Exploratory data analysis

Lecture 10

Dr. Benjamin Soltoff

Cornell University
INFO 4940/5940 - Fall 2025

September 25, 2025

Announcements

Announcements

  • No homework this week
  • Project 01 begins next week

Learning objectives

  • Review the importance of exploring and cleaning data prior to model development
  • Implement visualization methods for exploring categorical and numeric predictors/outcomes
  • Utilize techniques to identify outliers and missingness patterns in data
  • Document exploratory steps using reproducible documents and Quarto

A flowchart diagramming the machine learning operations lifecycle, including collecting data, understanding and cleaning data, training and evaluating models, deploying model, and monitoring model.

Exploratory data analysis

Exploratory data analysis

  1. Generate questions about your data
  2. Search for answers by visualizing, transforming, and modeling your data
  3. Use what you learn to refine your questions and or generate new questions

Need to better understand the data before you can model it.

Things to investigate

  • Variation
    • Numeric
    • Categorical
  • Covariation
    • Relationships between variables
    • Patterns across groups
  • Data quality
    • Missing values
    • Outliers
    • Errors

Features of exploratory data analysis

Features of confirmatory data analysis

Reproducibility and documentation

Itโ€™s important to keep a log of all your EDA and data cleaning steps.

Allows you/colleagues to check your work, reproduce your results, and understand your process, especially when data cleaning actions can impact model performance.

Use literate programming and Quarto notebooks to record your exploration and analysis.

The Great American Coffee Taste Test

Coffee consumption in the United States

The Great American Coffee Taste Test

  • Survey of 4,042 coffee drinkers conducted in October 2023
  • Each respondent provided with four samples of coffee (single-blind study)
  • Respondents brewed and evaluated each sample
  • Survey includes measures about the four samples as well as general coffee preferences and demographic characteristics

ae-09

Instructions

  • Go to the course GitHub org and find your ae-09 (repo name will be suffixed with your GitHub name).
  • Clone the repo in Positron, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline โ€“ end of the day

Examining continuous variables

Things to look for

  • Asymmetry
  • Outliers
  • Multimodality
  • Gaps
  • Heaping
  • Rounding
  • Impossibilities
  • Errors

Common chart types

  • Histogram: geom_histogram()
    • Frequency polygon: geom_freqpoly()
    • Density plot: geom_density()
  • Boxplot: geom_boxplot()
  • Rug plot: geom_rug()
  • Box plot: geom_boxplot() (for subgroup distributions)
    • Violin plot: geom_violin()
    • Beeswarm plot: ggbeeswarm::geom_beeswarm()

๐Ÿ“ Examine the numeric variables

Instructions

Examine the numeric variables in the data set. Document your analysis in the Quarto notebook.

08:00

Examining categorical variables

Things to look for

  • Unexpected patterns of results
  • Uneven distributions
  • Extra categories
  • Large numbers of categories
  • Donโ€™t knows, refusals, errors, NAs, โ€ฆ

Common chart types

  • Bar chart: geom_bar()
  • Pie chart: Donโ€™t

๐Ÿ“ Examine the categorical variables

Instructions

Examine the categorical variables in the data set. Document your analysis in the Quarto notebook.

08:00

Making comparisons

Things to look for

  • Causal relationships
  • Associations/correlations
  • Outliers (multidimensional)
  • Clusters
  • Gaps
  • Barriers
  • Conditional relationships

Common chart types

  • Scatterplot: geom_point()

  • Smoothing line: geom_smooth()

  • High density regions: geom_density_2d() or geom_hex()

  • Comparing groups

    • Color
    • Faceting
  • Scatterplot matrix: GGally::ggpairs()

  • Categorical variables

    • geom_count()
    • geom_bin2d()

๐Ÿ“ Relationship with outcome

Instructions

Examine the relationship between some of the variables and coffee_d_personal_preference. Document your analysis in the Quarto notebook.

08:00

Data quality

Missing values

Why values are missing

  • Feature not measured
  • Error replaced with NA
  • โ€œDonโ€™t knowโ€ or โ€œrefusedโ€ to answer
  • Trolling responses

What to do with missing values

  • Remove the rows
  • Remove the feature
  • Impute plausible values
  • Do nothing1

Visualizing missingness patterns

Use missingno to visualize missingness patterns.

Visualizing missingness patterns

Visualizing missingness patterns

Outliers

Cases which are far away from the bulk of the data

  • Error
  • Extreme value
  • Rare value
  • Unusual value

Often is context-dependent based on who or what is in the data set.

What to do about outliers

  • True errors? Remove them
    • Only certain features
    • Drop observation entirely
  • Imputation
  • Transformations

๐Ÿ“ Relationship with outcome

Instructions

Examine the quality of the data set. Document your analysis in the Quarto notebook.

08:00

Wrap-up

Recap

  • Exploratory data analysis is critical for understanding your data
  • Use visualizations to explore numeric and categorical variables
  • Identify and address missingness and outliers
  • Document your exploratory steps in a reproducible document
  • Generate ideas that you will implement and evaluate in the modeling stage