Lecture 10
Cornell University
INFO 4940/5940 - Fall 2025
September 25, 2025
Illustration credit: Posit
Need to better understand the data before you can model it.
Itโs important to keep a log of all your EDA and data cleaning steps.
Allows you/colleagues to check your work, reproduce your results, and understand your process, especially when data cleaning actions can impact model performance.
Use literate programming and Quarto notebooks to record your exploration and analysis.
ae-09
Instructions
ae-09
(repo name will be suffixed with your GitHub name).renv::restore()
to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.Tons more info: Graphical Data Analysis with R
geom_histogram()
geom_freqpoly()
geom_density()
geom_boxplot()
geom_rug()
geom_boxplot()
(for subgroup distributions)
geom_violin()
ggbeeswarm::geom_beeswarm()
Instructions
Examine the numeric variables in the data set. Document your analysis in the Quarto notebook.
08:00
NA
s, โฆTons more info: Graphical Data Analysis with R
geom_bar()
Instructions
Examine the categorical variables in the data set. Document your analysis in the Quarto notebook.
08:00
Tons more info: Graphical Data Analysis with R
Scatterplot: geom_point()
Smoothing line: geom_smooth()
High density regions: geom_density_2d()
or geom_hex()
Comparing groups
Scatterplot matrix: GGally::ggpairs()
Categorical variables
geom_count()
geom_bin2d()
Instructions
Examine the relationship between some of the variables and coffee_d_personal_preference
. Document your analysis in the Quarto notebook.
08:00
NA
Use missingno
to visualize missingness patterns.
Cases which are far away from the bulk of the data
Often is context-dependent based on who or what is in the data set.
Instructions
Examine the quality of the data set. Document your analysis in the Quarto notebook.
08:00