Lecture 10
Cornell University
INFO 4940/5940 - Fall 2024
October 1, 2024
include: true
echo: true
message: true
or warning: true
cache: true
```{r}
#| cache: true
scdb_clean <- scdb_case |>
mutate(one_vote = majVotes - minVotes == 1)
scdb_clean
```
# A tibble: 9,299 × 53
caseId docketId caseIssuesId dateDecision decisionType usCite sctCite ledCite
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 1945-… 1945-00… 1945-001-01… 12/10/1945 1 326 U… 66 S. … 90 L. …
2 1945-… 1945-00… 1945-002-01… 12/3/1945 1 326 U… 66 S. … 90 L. …
3 1945-… 1945-00… 1945-003-01… 11/13/1945 1 326 U… 66 S. … 90 L. …
4 1945-… 1945-00… 1945-004-01… 11/13/1945 1 326 U… 66 S. … 90 L. …
5 1945-… 1945-00… 1945-005-01… 11/5/1945 1 326 U… 66 S. … 90 L. …
6 1945-… 1945-00… 1945-006-01… 11/5/1945 1 326 U… 66 S. … 90 L. …
7 1945-… 1945-00… 1945-007-01… 11/5/1945 2 326 U… 66 S. … 90 L. …
8 1945-… 1945-00… 1945-008-01… 11/5/1945 1 326 U… 66 S. … 90 L. …
9 1945-… 1945-00… 1945-009-01… 11/5/1945 1 326 U… 66 S. … 90 L. …
10 1945-… 1945-01… 1945-010-01… 12/10/1945 1 326 U… 66 S. … 90 L. …
# ℹ 9,289 more rows
# ℹ 45 more variables: lexisCite <chr>, term <dbl>, naturalCourt <dbl>,
# chief <chr>, docket <chr>, caseName <chr>, dateArgument <chr>,
# dateRearg <chr>, petitioner <dbl>, petitionerState <dbl>, respondent <dbl>,
# respondentState <dbl>, jurisdiction <dbl>, adminAction <dbl>,
# adminActionState <dbl>, threeJudgeFdc <dbl>, caseOrigin <dbl>,
# caseOriginState <dbl>, caseSource <dbl>, caseSourceState <dbl>, …
```{r}
#| cache: true
scdb_clean <- scdb_case |>
mutate(one_vote = majVotes - minVotes == 1)
scdb_clean
```
# A tibble: 9,299 × 53
caseId docketId caseIssuesId dateDecision decisionType usCite sctCite ledCite
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 1945-… 1945-00… 1945-001-01… 12/10/1945 1 326 U… 66 S. … 90 L. …
2 1945-… 1945-00… 1945-002-01… 12/3/1945 1 326 U… 66 S. … 90 L. …
3 1945-… 1945-00… 1945-003-01… 11/13/1945 1 326 U… 66 S. … 90 L. …
4 1945-… 1945-00… 1945-004-01… 11/13/1945 1 326 U… 66 S. … 90 L. …
5 1945-… 1945-00… 1945-005-01… 11/5/1945 1 326 U… 66 S. … 90 L. …
6 1945-… 1945-00… 1945-006-01… 11/5/1945 1 326 U… 66 S. … 90 L. …
7 1945-… 1945-00… 1945-007-01… 11/5/1945 2 326 U… 66 S. … 90 L. …
8 1945-… 1945-00… 1945-008-01… 11/5/1945 1 326 U… 66 S. … 90 L. …
9 1945-… 1945-00… 1945-009-01… 11/5/1945 1 326 U… 66 S. … 90 L. …
10 1945-… 1945-01… 1945-010-01… 12/10/1945 1 326 U… 66 S. … 90 L. …
# ℹ 9,289 more rows
# ℹ 45 more variables: lexisCite <chr>, term <dbl>, naturalCourt <dbl>,
# chief <chr>, docket <chr>, caseName <chr>, dateArgument <chr>,
# dateRearg <chr>, petitioner <dbl>, petitionerState <dbl>, respondent <dbl>,
# respondentState <dbl>, jurisdiction <dbl>, adminAction <dbl>,
# adminActionState <dbl>, threeJudgeFdc <dbl>, caseOrigin <dbl>,
# caseOriginState <dbl>, caseSource <dbl>, caseSourceState <dbl>, …
```{r}
#| label: processed-data-cache
#| cache: true
#| dependson: raw-data-cache
scdb_clean <- scdb_case |>
mutate(one_vote = majVotes - minVotes == 1)
scdb_clean
```
# A tibble: 29,021 × 53
caseId docketId caseIssuesId dateDecision decisionType usCite sctCite ledCite
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 1791-… 1791-00… 1791-001-01… 8/3/1791 6 2 U.S… <NA> 1 L. E…
2 1791-… 1791-00… 1791-002-01… 8/3/1791 2 2 U.S… <NA> 1 L. E…
3 1792-… 1792-00… 1792-001-01… 2/14/1792 2 2 U.S… <NA> 1 L. E…
4 1792-… 1792-00… 1792-002-01… 8/7/1792 2 2 U.S… <NA> 1 L. E…
5 1792-… 1792-00… 1792-003-01… 8/11/1792 8 2 U.S… <NA> 1 L. E…
6 1792-… 1792-00… 1792-004-01… 8/11/1792 6 2 U.S… <NA> 1 L. E…
7 1793-… 1793-00… 1793-001-01… 2/19/1793 8 2 U.S… <NA> 1 L. E…
8 1793-… 1793-00… 1793-002-01… 2/20/1793 2 2 U.S… <NA> 1 L. E…
9 1793-… 1793-00… 1793-003-01… 2/20/1793 8 2 U.S… <NA> 1 L. E…
10 1794-… 1794-00… 1794-001-01… 2/7/1794 NA 3 U.S… <NA> 1 L. E…
# ℹ 29,011 more rows
# ℹ 45 more variables: lexisCite <chr>, term <dbl>, naturalCourt <dbl>,
# chief <chr>, docket <chr>, caseName <chr>, dateArgument <chr>,
# dateRearg <chr>, petitioner <dbl>, petitionerState <dbl>, respondent <dbl>,
# respondentState <dbl>, jurisdiction <dbl>, adminAction <dbl>,
# adminActionState <dbl>, threeJudgeFdc <dbl>, caseOrigin <dbl>,
# caseOriginState <dbl>, caseSource <dbl>, caseSourceState <dbl>, …
ae-09
ae-09
(repo name will be suffixed with your GitHub name).renv::restore()
to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises."rsq"
and render - what happens to the output document?10:00
Examples include:
Actions that:
Data preparation can be performed in a script.
/data
|-- coffee_survey.csv
/output
|-- coffee_survey_clean.rds
clean-coffee.R
Our client ran a survey to better understand the preferences of potential customers for their new line of coffee. The survey includes questions about the potential customers’ coffee preferences, demographics, and coffee consumption habits, as well as taste test results of four varieties of coffee.
Our client wants to host a survey on their website that potential customers can complete to get a recommendation for which of the four coffees they would like best. What features in the existing survey dataset can be used to predict this outcome? What features should not be used?
04:00
Select (keep) all relevant features that can be used for predicting prefer_overall
.
08:00
Rows: 4,042
Columns: 31
$ age <chr> "18-24 y…
$ cups <chr> NA, NA, …
$ where_drink <chr> NA, NA, …
$ brew <chr> NA, "Pod…
$ purchase <chr> NA, NA, …
$ favorite <chr> "Regular…
$ additions <chr> "No - ju…
$ dairy <chr> NA, NA, …
$ sweetener <chr> NA, NA, …
$ style <chr> "Complex…
$ strength <chr> NA, NA, …
$ roast_level <chr> NA, NA, …
$ caffeine <chr> NA, NA, …
$ expertise <dbl> NA, NA, …
$ prefer_overall <chr> NA, NA, …
$ wfh <chr> NA, NA, …
$ total_spend <chr> NA, NA, …
$ why_drink <chr> NA, NA, …
$ taste <chr> NA, NA, …
$ know_source <chr> NA, NA, …
$ most_paid <chr> NA, NA, …
$ most_willing <chr> NA, NA, …
$ value_cafe <chr> NA, NA, …
$ spent_equipment <chr> NA, NA, …
$ value_equipment <chr> NA, NA, …
$ gender <chr> NA, NA, …
$ education_level <chr> NA, NA, …
$ ethnicity_race <chr> NA, NA, …
$ employment_status <chr> NA, NA, …
$ number_children <chr> NA, NA, …
$ political_affiliation <chr> NA, NA, …
Convert all single-response categorical predictors to factors. Ensure levels are appropriately ordered for ordinal variables.
08:00
coffee_survey |>
select(
-submission_id,
-ends_with("other"),
-ends_with("specify"),
-starts_with("coffee"),
-prefer_abc, -prefer_ad
) |>
# convert categorical variables with single answer to factors
mutate(
# convert all initially with factor()
across(
.cols = c(
age, cups, favorite, style, strength, roast_level, caffeine, prefer_overall,
wfh, total_spend, taste:political_affiliation
),
.fns = factor
),
# adjust order of levels for ordinal variables
age = fct_relevel(
.f = age,
"<18 years old",
"18-24 years old",
"25-34 years old",
"35-44 years old",
"45-54 years old",
"55-64 years old",
">65 years old"
),
cups = fct_relevel(
.f = cups,
"Less than 1", "1", "2", "3", "4", "More than 4"
),
strength = fct_relevel(
.f = strength,
"Weak", "Somewhat light", "Medium", "Somewhat strong", "Very strong"
),
caffeine = fct_relevel(
.f = caffeine,
"Decaf", "Half caff", "Full caffeine"
),
wfh = fct_relevel(
.f = wfh,
"I primarily work from home", "I do a mix of both", "I primarily work in person"
),
total_spend = fct_relevel(
.f = total_spend,
"<$20"
) |>
fct_relevel(
">$100",
after = 5L
),
across(
.cols = c(most_paid, most_willing),
.fns = \(x) fct_relevel(
.f = x,
"Less than $2", "$2-$4", "$4-$6", "$6-$8", "$8-$10", "$10-$15", "$15-$20",
"More than $20"
)
),
spent_equipment = fct_relevel(
.f = spent_equipment,
"Less than $20", "$20-$50", "$50-$100", "$100-$300", "$300-$500",
"$500-$1000", "More than $1,000"
),
education_level = fct_relevel(
.f = education_level,
"Less than high school", "High school graduate", "Some college or associate's degree",
"Bachelor's degree", "Master's degree", "Doctorate or professional degree"
),
number_children = fct_relevel(
.f = number_children,
"None", "1", "2", "3", "More than 3"
),
) |>
glimpse()
Rows: 4,042
Columns: 31
$ age <fct> 18-24 ye…
$ cups <fct> NA, NA, …
$ where_drink <chr> NA, NA, …
$ brew <chr> NA, "Pod…
$ purchase <chr> NA, NA, …
$ favorite <fct> Regular …
$ additions <chr> "No - ju…
$ dairy <chr> NA, NA, …
$ sweetener <chr> NA, NA, …
$ style <fct> Complex,…
$ strength <fct> NA, NA, …
$ roast_level <fct> NA, NA, …
$ caffeine <fct> NA, NA, …
$ expertise <dbl> NA, NA, …
$ prefer_overall <fct> NA, NA, …
$ wfh <fct> NA, NA, …
$ total_spend <fct> NA, NA, …
$ why_drink <chr> NA, NA, …
$ taste <fct> NA, NA, …
$ know_source <fct> NA, NA, …
$ most_paid <fct> NA, NA, …
$ most_willing <fct> NA, NA, …
$ value_cafe <fct> NA, NA, …
$ spent_equipment <fct> NA, NA, …
$ value_equipment <fct> NA, NA, …
$ gender <fct> NA, NA, …
$ education_level <fct> NA, NA, …
$ ethnicity_race <fct> NA, NA, …
$ employment_status <fct> NA, NA, …
$ number_children <fct> NA, NA, …
$ political_affiliation <fct> NA, NA, …
Multiple-response categorical predictors need to be converted to one column-per-response. Use {tidyr} to convert these columns.
10:00
coffee_survey |>
select(
-submission_id,
-ends_with("other"),
-ends_with("specify"),
-starts_with("coffee"),
-prefer_abc, -prefer_ad
) |>
# convert categorical variables with single answer to factors
mutate(
# convert all initially with factor()
across(
.cols = c(
age, cups, favorite, style, strength, roast_level, caffeine, prefer_overall,
wfh, total_spend, taste:political_affiliation
),
.fns = factor
),
# adjust order of levels for ordinal variables
age = fct_relevel(
.f = age,
"<18 years old",
"18-24 years old",
"25-34 years old",
"35-44 years old",
"45-54 years old",
"55-64 years old",
">65 years old"
),
cups = fct_relevel(
.f = cups,
"Less than 1", "1", "2", "3", "4", "More than 4"
),
strength = fct_relevel(
.f = strength,
"Weak", "Somewhat light", "Medium", "Somewhat strong", "Very strong"
),
caffeine = fct_relevel(
.f = caffeine,
"Decaf", "Half caff", "Full caffeine"
),
wfh = fct_relevel(
.f = wfh,
"I primarily work from home", "I do a mix of both", "I primarily work in person"
),
total_spend = fct_relevel(
.f = total_spend,
"<$20"
) |>
fct_relevel(
">$100",
after = 5L
),
across(
.cols = c(most_paid, most_willing),
.fns = \(x) fct_relevel(
.f = x,
"Less than $2", "$2-$4", "$4-$6", "$6-$8", "$8-$10", "$10-$15", "$15-$20",
"More than $20"
)
),
spent_equipment = fct_relevel(
.f = spent_equipment,
"Less than $20", "$20-$50", "$50-$100", "$100-$300", "$300-$500",
"$500-$1000", "More than $1,000"
),
education_level = fct_relevel(
.f = education_level,
"Less than high school", "High school graduate", "Some college or associate's degree",
"Bachelor's degree", "Master's degree", "Doctorate or professional degree"
),
number_children = fct_relevel(
.f = number_children,
"None", "1", "2", "3", "More than 3"
),
) |>
# fix additions and purchase columns to ensure commas are only used to separate values
mutate(
additions = str_replace_all(
string = additions,
pattern = "Milk, dairy alternative, or coffee creamer",
replacement = "Milk dairy alternative or coffee creamer"
),
purchase = str_replace_all(
string = purchase,
pattern = "National chain \\(e.g. Starbucks, Dunkin\\)",
replacement = "National chain \\(e.g. Starbucks Dunkin\\)"
)
) |>
# separate columns with multiple answers into one column per answer
separate_wider_delim(
cols = c(
where_drink, brew, purchase,
additions, dairy, sweetener, why_drink
),
delim = ", ",
names_sep = "_",
too_few = "align_start"
) |>
# convert multi-answer columns to factors
mutate(
across(
.cols = c(starts_with("where_drink"), starts_with("brew"),
starts_with("purchase"), starts_with("additions"),
starts_with("dairy"), starts_with("sweetener"),
starts_with("why_drink")),
.fns = factor
)
) |>
glimpse()
Rows: 4,042
Columns: 64
$ age <fct> 18-24 ye…
$ cups <fct> NA, NA, …
$ where_drink_1 <fct> NA, NA, …
$ where_drink_2 <fct> NA, NA, …
$ where_drink_3 <fct> NA, NA, …
$ where_drink_4 <fct> NA, NA, …
$ brew_1 <fct> NA, Pod/…
$ brew_2 <fct> NA, NA, …
$ brew_3 <fct> NA, NA, …
$ brew_4 <fct> NA, NA, …
$ brew_5 <fct> NA, NA, …
$ brew_6 <fct> NA, NA, …
$ brew_7 <fct> NA, NA, …
$ purchase_1 <fct> NA, NA, …
$ purchase_2 <fct> NA, NA, …
$ purchase_3 <fct> NA, NA, …
$ purchase_4 <fct> NA, NA, …
$ purchase_5 <fct> NA, NA, …
$ favorite <fct> Regular …
$ additions_1 <fct> No - jus…
$ additions_2 <fct> NA, No -…
$ additions_3 <fct> NA, NA, …
$ additions_4 <fct> NA, NA, …
$ additions_5 <fct> NA, NA, …
$ dairy_1 <fct> NA, NA, …
$ dairy_2 <fct> NA, NA, …
$ dairy_3 <fct> NA, NA, …
$ dairy_4 <fct> NA, NA, …
$ dairy_5 <fct> NA, NA, …
$ dairy_6 <fct> NA, NA, …
$ dairy_7 <fct> NA, NA, …
$ dairy_8 <fct> NA, NA, …
$ sweetener_1 <fct> NA, NA, …
$ sweetener_2 <fct> NA, NA, …
$ sweetener_3 <fct> NA, NA, …
$ sweetener_4 <fct> NA, NA, …
$ sweetener_5 <fct> NA, NA, …
$ sweetener_6 <fct> NA, NA, …
$ style <fct> Complex,…
$ strength <fct> NA, NA, …
$ roast_level <fct> NA, NA, …
$ caffeine <fct> NA, NA, …
$ expertise <dbl> NA, NA, …
$ prefer_overall <fct> NA, NA, …
$ wfh <fct> NA, NA, …
$ total_spend <fct> NA, NA, …
$ why_drink_1 <fct> NA, NA, …
$ why_drink_2 <fct> NA, NA, …
$ why_drink_3 <fct> NA, NA, …
$ why_drink_4 <fct> NA, NA, …
$ why_drink_5 <fct> NA, NA, …
$ taste <fct> NA, NA, …
$ know_source <fct> NA, NA, …
$ most_paid <fct> NA, NA, …
$ most_willing <fct> NA, NA, …
$ value_cafe <fct> NA, NA, …
$ spent_equipment <fct> NA, NA, …
$ value_equipment <fct> NA, NA, …
$ gender <fct> NA, NA, …
$ education_level <fct> NA, NA, …
$ ethnicity_race <fct> NA, NA, …
$ employment_status <fct> NA, NA, …
$ number_children <fct> NA, NA, …
$ political_affiliation <fct> NA, NA, …
Drop any rows with missing values for the outcome of interest prefer_overall
. Save the resulting data frame as an RDS file to preserve all the factor levels.
10:00
coffee_survey |>
select(
-submission_id,
-ends_with("other"),
-ends_with("specify"),
-starts_with("coffee"),
-prefer_abc, -prefer_ad
) |>
# convert categorical variables with single answer to factors
mutate(
# convert all initially with factor()
across(
.cols = c(
age, cups, favorite, style, strength, roast_level, caffeine, prefer_overall,
wfh, total_spend, taste:political_affiliation
),
.fns = factor
),
# adjust order of levels for ordinal variables
age = fct_relevel(
.f = age,
"<18 years old",
"18-24 years old",
"25-34 years old",
"35-44 years old",
"45-54 years old",
"55-64 years old",
">65 years old"
),
cups = fct_relevel(
.f = cups,
"Less than 1", "1", "2", "3", "4", "More than 4"
),
strength = fct_relevel(
.f = strength,
"Weak", "Somewhat light", "Medium", "Somewhat strong", "Very strong"
),
caffeine = fct_relevel(
.f = caffeine,
"Decaf", "Half caff", "Full caffeine"
),
wfh = fct_relevel(
.f = wfh,
"I primarily work from home", "I do a mix of both", "I primarily work in person"
),
total_spend = fct_relevel(
.f = total_spend,
"<$20"
) |>
fct_relevel(
">$100",
after = 5L
),
across(
.cols = c(most_paid, most_willing),
.fns = \(x) fct_relevel(
.f = x,
"Less than $2", "$2-$4", "$4-$6", "$6-$8", "$8-$10", "$10-$15", "$15-$20",
"More than $20"
)
),
spent_equipment = fct_relevel(
.f = spent_equipment,
"Less than $20", "$20-$50", "$50-$100", "$100-$300", "$300-$500",
"$500-$1000", "More than $1,000"
),
education_level = fct_relevel(
.f = education_level,
"Less than high school", "High school graduate", "Some college or associate's degree",
"Bachelor's degree", "Master's degree", "Doctorate or professional degree"
),
number_children = fct_relevel(
.f = number_children,
"None", "1", "2", "3", "More than 3"
),
) |>
# fix additions and purchase columns to ensure commas are only used to separate values
mutate(
additions = str_replace_all(
string = additions,
pattern = "Milk, dairy alternative, or coffee creamer",
replacement = "Milk dairy alternative or coffee creamer"
),
purchase = str_replace_all(
string = purchase,
pattern = "National chain \\(e.g. Starbucks, Dunkin\\)",
replacement = "National chain \\(e.g. Starbucks Dunkin\\)"
)
) |>
# separate columns with multiple answers into one column per answer
separate_wider_delim(
cols = c(
where_drink, brew, purchase,
additions, dairy, sweetener, why_drink
),
delim = ", ",
names_sep = "_",
too_few = "align_start"
) |>
# convert multi-answer columns to factors
mutate(
across(
.cols = c(starts_with("where_drink"), starts_with("brew"),
starts_with("purchase"), starts_with("additions"),
starts_with("dairy"), starts_with("sweetener"),
starts_with("why_drink")),
.fns = factor
)
) |>
# drop rows with no prefer_overall
drop_na(prefer_overall) |>
write_rds(file = "data/coffee_survey_clean.rds") |>
glimpse()
Rows: 3,770
Columns: 64
$ age <fct> <18 year…
$ cups <fct> 3, 3, 1,…
$ where_drink_1 <fct> At home,…
$ where_drink_2 <fct> At the o…
$ where_drink_3 <fct> At a caf…
$ where_drink_4 <fct> NA, NA, …
$ brew_1 <fct> Pour ove…
$ brew_2 <fct> Espresso…
$ brew_3 <fct> Instant …
$ brew_4 <fct> NA, NA, …
$ brew_5 <fct> NA, NA, …
$ brew_6 <fct> NA, NA, …
$ brew_7 <fct> NA, NA, …
$ purchase_1 <fct> NA, NA, …
$ purchase_2 <fct> NA, NA, …
$ purchase_3 <fct> NA, NA, …
$ purchase_4 <fct> NA, NA, …
$ purchase_5 <fct> NA, NA, …
$ favorite <fct> Pourover…
$ additions_1 <fct> No - jus…
$ additions_2 <fct> Milk dai…
$ additions_3 <fct> NA, NA, …
$ additions_4 <fct> NA, NA, …
$ additions_5 <fct> NA, NA, …
$ dairy_1 <fct> Whole mi…
$ dairy_2 <fct> NA, NA, …
$ dairy_3 <fct> NA, NA, …
$ dairy_4 <fct> NA, NA, …
$ dairy_5 <fct> NA, NA, …
$ dairy_6 <fct> NA, NA, …
$ dairy_7 <fct> NA, NA, …
$ dairy_8 <fct> NA, NA, …
$ sweetener_1 <fct> NA, NA, …
$ sweetener_2 <fct> NA, NA, …
$ sweetener_3 <fct> NA, NA, …
$ sweetener_4 <fct> NA, NA, …
$ sweetener_5 <fct> NA, NA, …
$ sweetener_6 <fct> NA, NA, …
$ style <fct> Bright, …
$ strength <fct> Medium, …
$ roast_level <fct> Light, B…
$ caffeine <fct> Full caf…
$ expertise <dbl> 10, 7, 6…
$ prefer_overall <fct> Coffee B…
$ wfh <fct> I primar…
$ total_spend <fct> >$100, N…
$ why_drink_1 <fct> NA, NA, …
$ why_drink_2 <fct> NA, NA, …
$ why_drink_3 <fct> NA, NA, …
$ why_drink_4 <fct> NA, NA, …
$ why_drink_5 <fct> NA, NA, …
$ taste <fct> NA, NA, …
$ know_source <fct> NA, NA, …
$ most_paid <fct> NA, NA, …
$ most_willing <fct> NA, NA, …
$ value_cafe <fct> NA, NA, …
$ spent_equipment <fct> NA, NA, …
$ value_equipment <fct> NA, NA, …
$ gender <fct> Other (p…
$ education_level <fct> Bachelor…
$ ethnicity_race <fct> Other (p…
$ employment_status <fct> Employed…
$ number_children <fct> More tha…
$ political_affiliation <fct> Democrat…