Lecture 9
Cornell University
INFO 4940/5940 - Fall 2024
September 26, 2024
Illustration credit: Posit
Need to better understand the data before you can model it.
ggplot(
data = penguins,
mapping = aes(
x = body_mass_g,
y = flipper_length_mm
)
) +
geom_point(alpha = .1) +
geom_smooth(se = FALSE) +
scale_x_continuous(labels = label_number(scale_cut = cut_si("g"))) +
scale_y_continuous(labels = label_number(scale_cut = cut_si("mm"))) +
labs(
title = "Relationship between body mass and\nflipper length of a penguin",
subtitle = str_glue("Sample of {nrow(penguins)} penguins"),
x = "Body mass",
y = "Flipper length"
) +
theme_minimal(
base_family = "Atkinson Hyperlegible",
base_size = 14
) +
theme(plot.title.position = "plot")
It’s important to keep a log of all your EDA and data cleaning steps.
Allows you/colleagues to check your work, reproduce your results, and understand your process, especially when data cleaning actions can impact model performance.
Use literate programming and Quarto notebooks to record your exploration and analysis.
ae-08
ae-08
(repo name will be suffixed with your GitHub name).renv::restore()
to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.Tons more info: Graphical Data Analysis with R
geom_histogram()
geom_freqpoly()
geom_density()
geom_boxplot()
geom_rug()
geom_boxplot()
(for subgroup distributions)
geom_violin()
ggbeeswarm::geom_beeswarm()
Examine the numeric variables in the data set. Document your analysis in the Quarto notebook.
08:00
NA
s, …Tons more info: Graphical Data Analysis with R
geom_bar()
Examine the categorical variables in the data set. Document your analysis in the Quarto notebook.
08:00
Tons more info: Graphical Data Analysis with R
Scatterplot: geom_point()
Smoothing line: geom_smooth()
High density regions: geom_density_2d()
or geom_hex()
Comparing groups
Scatterplot matrix: GGally::ggpairs()
Categorical variables
geom_count()
geom_bin2d()
Examine the relationship between some of the variables and prefer_overall
. Document your analysis in the Quarto notebook.
08:00
NA
Cases which are far away from the bulk of the data.
Often is context-dependent based on who or what is in the data set.
Examine the quality of the data set. Document your analysis in the Quarto notebook.
08:00