Exploratory data analysis

Lecture 9

Dr. Benjamin Soltoff

Cornell University
INFO 4940/5940 - Fall 2024

September 26, 2024

Debrief of homework 01

Announcements

Regrade requests

Should be submitted within one week of the assignment grade being published
Can be submitted starting at noon the day after the assignment grade is published
Intended for if you believe a mistake was made in grading your submission
Be specific and polite in your request. We all make mistakes. If we made a mistake grading your submission, we want to correct it.

Announcements

Homework 02 timing

Learning objectives

Review the importance of exploring and cleaning data prior to model development
Implement visualization methods for exploring categorical and numeric predictors/outcomes
Utilize techniques to identify outliers and missingness patterns in data
Document exploratory steps using reproducible documents and Quarto

A flowchart diagramming the machine learning operations lifecycle, including collecting data, understanding and cleaning data, training and evaluating models, deploying model, and monitoring model.

Exploratory data analysis

Generate questions about your data
Search for answers by visualizing, transforming, and modeling your data
Use what you learn to refine your questions and or generate new questions

Need to better understand the data before you can model it.

Things to investigate

Variation
- Numeric
- Categorical
Covariation
- Relationships between variables
- Patterns across groups
Data quality
- Missing values
- Outliers
- Errors

Features of exploratory data analysis

ggplot(
  data = penguins,
  mapping = aes(
    x = body_mass_g,
    y = flipper_length_mm
  )
) +
  geom_point() +
  geom_smooth()

Features of confirmatory data analysis

ggplot(
  data = penguins,
  mapping = aes(
    x = body_mass_g,
    y = flipper_length_mm
  )
) +
  geom_point(alpha = .1) +
  geom_smooth(se = FALSE) +
  scale_x_continuous(labels = label_number(scale_cut = cut_si("g"))) +
  scale_y_continuous(labels = label_number(scale_cut = cut_si("mm"))) +
  labs(
    title = "Relationship between body mass and\nflipper length of a penguin",
    subtitle = str_glue("Sample of {nrow(penguins)} penguins"),
    x = "Body mass",
    y = "Flipper length"
  ) +
  theme_minimal(
    base_family = "Atkinson Hyperlegible",
    base_size = 14
  ) +
  theme(plot.title.position = "plot")

Reproducibility and documentation

It’s important to keep a log of all your EDA and data cleaning steps.

Allows you/colleagues to check your work, reproduce your results, and understand your process, especially when data cleaning actions can impact model performance.

Use literate programming and Quarto notebooks to record your exploration and analysis.

The Great American Coffee Taste Test

Coffee consumption in the United States

The Great American Coffee Taste Test

Survey of 4,042 coffee drinkers conducted in October 2023
Each respondent provided with four samples of coffee (single-blind study)
Respondents brewed and evaluated each sample
Survey includes measures about the four samples as well as general coffee preferences and demographic characteristics

`ae-08`

Go to the course GitHub org and find your ae-08 (repo name will be suffixed with your GitHub name).
Clone the repo in RStudio, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
Render, commit, and push your edits by the AE deadline – end of the day

Examining continuous variables

Things to look for

Asymmetry
Outliers
Multimodality
Gaps
Heaping
Rounding
Impossibilities
Errors

Common chart types

Histogram: geom_histogram()
- Frequency polygon: geom_freqpoly()
- Density plot: geom_density()
Boxplot: geom_boxplot()
Rug plot: geom_rug()
Box plot: geom_boxplot() (for subgroup distributions)
- Violin plot: geom_violin()
- Beeswarm plot: ggbeeswarm::geom_beeswarm()

⏱️ Your turn

Examine the numeric variables in the data set. Document your analysis in the Quarto notebook.

08:00

Examining categorical variables

Things to look for

Unexpected patterns of results
Uneven distributions
Extra categories
Large numbers of categories
Don’t knows, refusals, errors, NAs, …

Common chart types

Bar chart: geom_bar()
Pie chart: Don’t

⏱️ Your turn

Examine the categorical variables in the data set. Document your analysis in the Quarto notebook.

08:00

Making comparisons

Things to look for

Causal relationships
Associations/correlations
Outliers (multidimensional)
Clusters
Gaps
Barriers
Conditional relationships

Common chart types

Scatterplot: geom_point()
Smoothing line: geom_smooth()
High density regions: geom_density_2d() or geom_hex()
Comparing groups
- Color
- Faceting
Scatterplot matrix: GGally::ggpairs()
Categorical variables
- geom_count()
- geom_bin2d()

⏱️ Your turn

Examine the relationship between some of the variables and prefer_overall. Document your analysis in the Quarto notebook.

08:00

Data quality

Missing values

Why values are missing

Feature not measured
Error replaced with NA
“Don’t know” or “refused” to answer
Trolling responses

What to do with missing values

Remove the rows
Remove the feature
Impute plausible values
Do nothing¹

Visualizing missingness patterns

vis_dat(coffee_survey)

Visualizing missingness patterns

vis_miss(coffee_survey, sort_miss = TRUE)

Visualizing missingness patterns

vis_miss(coffee_survey, cluster = TRUE)

Outliers

Cases which are far away from the bulk of the data.

Error
Extreme value
Rare value
Unusual value

Often is context-dependent based on who or what is in the data set.

What to do about outliers

True errors? Remove them
- Only certain features
- Drop observation entirely
Imputation
Transformations

⏱️ Your turn

Examine the quality of the data set. Document your analysis in the Quarto notebook.

08:00

Wrap-up

Recap

Exploratory data analysis is critical for understanding your data
Use visualizations to explore numeric and categorical variables
Identify and address missingness and outliers
Document your exploratory steps in a reproducible document
Generate ideas that you will implement and evaluate in the modeling stage

Exploratory data analysis

Debrief of homework 01

Announcements

Regrade requests

Announcements

Learning objectives

Exploratory data analysis