Exploratory data analysis

Lecture 9

Dr. Benjamin Soltoff

Cornell University
INFO 4940/5940 - Fall 2024

September 26, 2024

Debrief of homework 01

Announcements

Regrade requests

  • Should be submitted within one week of the assignment grade being published
  • Can be submitted starting at noon the day after the assignment grade is published
  • Intended for if you believe a mistake was made in grading your submission
  • Be specific and polite in your request. We all make mistakes. If we made a mistake grading your submission, we want to correct it.

Announcements

  • Homework 02 timing

Learning objectives

  • Review the importance of exploring and cleaning data prior to model development
  • Implement visualization methods for exploring categorical and numeric predictors/outcomes
  • Utilize techniques to identify outliers and missingness patterns in data
  • Document exploratory steps using reproducible documents and Quarto

A flowchart diagramming the machine learning operations lifecycle, including collecting data, understanding and cleaning data, training and evaluating models, deploying model, and monitoring model.

Exploratory data analysis

Exploratory data analysis

  1. Generate questions about your data
  2. Search for answers by visualizing, transforming, and modeling your data
  3. Use what you learn to refine your questions and or generate new questions

Need to better understand the data before you can model it.

Things to investigate

  • Variation
    • Numeric
    • Categorical
  • Covariation
    • Relationships between variables
    • Patterns across groups
  • Data quality
    • Missing values
    • Outliers
    • Errors

Features of exploratory data analysis

ggplot(
  data = penguins,
  mapping = aes(
    x = body_mass_g,
    y = flipper_length_mm
  )
) +
  geom_point() +
  geom_smooth()

Features of confirmatory data analysis

ggplot(
  data = penguins,
  mapping = aes(
    x = body_mass_g,
    y = flipper_length_mm
  )
) +
  geom_point(alpha = .1) +
  geom_smooth(se = FALSE) +
  scale_x_continuous(labels = label_number(scale_cut = cut_si("g"))) +
  scale_y_continuous(labels = label_number(scale_cut = cut_si("mm"))) +
  labs(
    title = "Relationship between body mass and\nflipper length of a penguin",
    subtitle = str_glue("Sample of {nrow(penguins)} penguins"),
    x = "Body mass",
    y = "Flipper length"
  ) +
  theme_minimal(
    base_family = "Atkinson Hyperlegible",
    base_size = 14
  ) +
  theme(plot.title.position = "plot")

Reproducibility and documentation

It’s important to keep a log of all your EDA and data cleaning steps.

Allows you/colleagues to check your work, reproduce your results, and understand your process, especially when data cleaning actions can impact model performance.

Use literate programming and Quarto notebooks to record your exploration and analysis.

The Great American Coffee Taste Test

Coffee consumption in the United States

The Great American Coffee Taste Test

  • Survey of 4,042 coffee drinkers conducted in October 2023
  • Each respondent provided with four samples of coffee (single-blind study)
  • Respondents brewed and evaluated each sample
  • Survey includes measures about the four samples as well as general coffee preferences and demographic characteristics

ae-08

  • Go to the course GitHub org and find your ae-08 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of the day

Examining continuous variables

Things to look for

  • Asymmetry
  • Outliers
  • Multimodality
  • Gaps
  • Heaping
  • Rounding
  • Impossibilities
  • Errors

Common chart types

  • Histogram: geom_histogram()
    • Frequency polygon: geom_freqpoly()
    • Density plot: geom_density()
  • Boxplot: geom_boxplot()
  • Rug plot: geom_rug()
  • Box plot: geom_boxplot() (for subgroup distributions)
    • Violin plot: geom_violin()
    • Beeswarm plot: ggbeeswarm::geom_beeswarm()

⏱️ Your turn

Examine the numeric variables in the data set. Document your analysis in the Quarto notebook.

08:00

Examining categorical variables

Things to look for

  • Unexpected patterns of results
  • Uneven distributions
  • Extra categories
  • Large numbers of categories
  • Don’t knows, refusals, errors, NAs, …

Common chart types

  • Bar chart: geom_bar()
  • Pie chart: Don’t

⏱️ Your turn

Examine the categorical variables in the data set. Document your analysis in the Quarto notebook.

08:00

Making comparisons

Things to look for

  • Causal relationships
  • Associations/correlations
  • Outliers (multidimensional)
  • Clusters
  • Gaps
  • Barriers
  • Conditional relationships

Common chart types

  • Scatterplot: geom_point()

  • Smoothing line: geom_smooth()

  • High density regions: geom_density_2d() or geom_hex()

  • Comparing groups

    • Color
    • Faceting
  • Scatterplot matrix: GGally::ggpairs()

  • Categorical variables

    • geom_count()
    • geom_bin2d()

⏱️ Your turn

Examine the relationship between some of the variables and prefer_overall. Document your analysis in the Quarto notebook.

08:00

Data quality

Missing values

Why values are missing

  • Feature not measured
  • Error replaced with NA
  • “Don’t know” or “refused” to answer
  • Trolling responses

What to do with missing values

  • Remove the rows
  • Remove the feature
  • Impute plausible values
  • Do nothing1

Visualizing missingness patterns

vis_dat(coffee_survey)

Visualizing missingness patterns

vis_miss(coffee_survey, sort_miss = TRUE)

Visualizing missingness patterns

vis_miss(coffee_survey, cluster = TRUE)

Outliers

Cases which are far away from the bulk of the data.

  • Error
  • Extreme value
  • Rare value
  • Unusual value

Often is context-dependent based on who or what is in the data set.

What to do about outliers

  • True errors? Remove them
    • Only certain features
    • Drop observation entirely
  • Imputation
  • Transformations

⏱️ Your turn

Examine the quality of the data set. Document your analysis in the Quarto notebook.

08:00

Wrap-up

Recap

  • Exploratory data analysis is critical for understanding your data
  • Use visualizations to explore numeric and categorical variables
  • Identify and address missingness and outliers
  • Document your exploratory steps in a reproducible document
  • Generate ideas that you will implement and evaluate in the modeling stage

Apple Harvest Festival