Data preparation

Lecture 10

Dr. Benjamin Soltoff

Cornell University
INFO 4940/5940 - Fall 2024

October 1, 2024

Announcements

Announcements

  • Homework 02

Learning objectives

  • Review how to use code caching to improve rendering times for Quarto files
  • Implement approaches to handling outliers and missingness in data
  • Standardize categorical variables and clean character strings
  • Implement data cleaning steps reproducibly using standalone scripts

Code chunk options

Code chunks

```{r}
#| label: clean-data
#| message: false
#| warning: false
#| cache: true

# code goes here
```
  • Labeling code chunks
  • Code chunk options
  • include: true
  • echo: true
  • message: true or warning: true
  • cache: true

Caching with dependencies

```{r}
#| cache: true
scdb_case <- read_csv("data/scdb-case.csv") |>
  filter(term >= 1945)
```
```{r}
#| cache: true
scdb_clean <- scdb_case |> 
  mutate(one_vote = majVotes - minVotes == 1)
scdb_clean
```
# A tibble: 9,299 × 53
   caseId docketId caseIssuesId dateDecision decisionType usCite sctCite ledCite
   <chr>  <chr>    <chr>        <chr>               <dbl> <chr>  <chr>   <chr>  
 1 1945-… 1945-00… 1945-001-01… 12/10/1945              1 326 U… 66 S. … 90 L. …
 2 1945-… 1945-00… 1945-002-01… 12/3/1945               1 326 U… 66 S. … 90 L. …
 3 1945-… 1945-00… 1945-003-01… 11/13/1945              1 326 U… 66 S. … 90 L. …
 4 1945-… 1945-00… 1945-004-01… 11/13/1945              1 326 U… 66 S. … 90 L. …
 5 1945-… 1945-00… 1945-005-01… 11/5/1945               1 326 U… 66 S. … 90 L. …
 6 1945-… 1945-00… 1945-006-01… 11/5/1945               1 326 U… 66 S. … 90 L. …
 7 1945-… 1945-00… 1945-007-01… 11/5/1945               2 326 U… 66 S. … 90 L. …
 8 1945-… 1945-00… 1945-008-01… 11/5/1945               1 326 U… 66 S. … 90 L. …
 9 1945-… 1945-00… 1945-009-01… 11/5/1945               1 326 U… 66 S. … 90 L. …
10 1945-… 1945-01… 1945-010-01… 12/10/1945              1 326 U… 66 S. … 90 L. …
# ℹ 9,289 more rows
# ℹ 45 more variables: lexisCite <chr>, term <dbl>, naturalCourt <dbl>,
#   chief <chr>, docket <chr>, caseName <chr>, dateArgument <chr>,
#   dateRearg <chr>, petitioner <dbl>, petitionerState <dbl>, respondent <dbl>,
#   respondentState <dbl>, jurisdiction <dbl>, adminAction <dbl>,
#   adminActionState <dbl>, threeJudgeFdc <dbl>, caseOrigin <dbl>,
#   caseOriginState <dbl>, caseSource <dbl>, caseSourceState <dbl>, …
```{r}
#| cache: true
scdb_case <- read_csv("data/scdb-case.csv")
```
```{r}
#| cache: true
scdb_clean <- scdb_case |> 
  mutate(one_vote = majVotes - minVotes == 1)
scdb_clean
```
# A tibble: 9,299 × 53
   caseId docketId caseIssuesId dateDecision decisionType usCite sctCite ledCite
   <chr>  <chr>    <chr>        <chr>               <dbl> <chr>  <chr>   <chr>  
 1 1945-… 1945-00… 1945-001-01… 12/10/1945              1 326 U… 66 S. … 90 L. …
 2 1945-… 1945-00… 1945-002-01… 12/3/1945               1 326 U… 66 S. … 90 L. …
 3 1945-… 1945-00… 1945-003-01… 11/13/1945              1 326 U… 66 S. … 90 L. …
 4 1945-… 1945-00… 1945-004-01… 11/13/1945              1 326 U… 66 S. … 90 L. …
 5 1945-… 1945-00… 1945-005-01… 11/5/1945               1 326 U… 66 S. … 90 L. …
 6 1945-… 1945-00… 1945-006-01… 11/5/1945               1 326 U… 66 S. … 90 L. …
 7 1945-… 1945-00… 1945-007-01… 11/5/1945               2 326 U… 66 S. … 90 L. …
 8 1945-… 1945-00… 1945-008-01… 11/5/1945               1 326 U… 66 S. … 90 L. …
 9 1945-… 1945-00… 1945-009-01… 11/5/1945               1 326 U… 66 S. … 90 L. …
10 1945-… 1945-01… 1945-010-01… 12/10/1945              1 326 U… 66 S. … 90 L. …
# ℹ 9,289 more rows
# ℹ 45 more variables: lexisCite <chr>, term <dbl>, naturalCourt <dbl>,
#   chief <chr>, docket <chr>, caseName <chr>, dateArgument <chr>,
#   dateRearg <chr>, petitioner <dbl>, petitionerState <dbl>, respondent <dbl>,
#   respondentState <dbl>, jurisdiction <dbl>, adminAction <dbl>,
#   adminActionState <dbl>, threeJudgeFdc <dbl>, caseOrigin <dbl>,
#   caseOriginState <dbl>, caseSource <dbl>, caseSourceState <dbl>, …

Label your chunks

```{r}
#| label: raw-data-cache
#| cache: true
scdb_case <- read_csv("data/scdb-case.csv")
```
```{r}
#| label: processed-data-cache
#| cache: true
#| dependson: raw-data-cache
scdb_clean <- scdb_case |> 
  mutate(one_vote = majVotes - minVotes == 1)
scdb_clean
```
# A tibble: 29,021 × 53
   caseId docketId caseIssuesId dateDecision decisionType usCite sctCite ledCite
   <chr>  <chr>    <chr>        <chr>               <dbl> <chr>  <chr>   <chr>  
 1 1791-… 1791-00… 1791-001-01… 8/3/1791                6 2 U.S… <NA>    1 L. E…
 2 1791-… 1791-00… 1791-002-01… 8/3/1791                2 2 U.S… <NA>    1 L. E…
 3 1792-… 1792-00… 1792-001-01… 2/14/1792               2 2 U.S… <NA>    1 L. E…
 4 1792-… 1792-00… 1792-002-01… 8/7/1792                2 2 U.S… <NA>    1 L. E…
 5 1792-… 1792-00… 1792-003-01… 8/11/1792               8 2 U.S… <NA>    1 L. E…
 6 1792-… 1792-00… 1792-004-01… 8/11/1792               6 2 U.S… <NA>    1 L. E…
 7 1793-… 1793-00… 1793-001-01… 2/19/1793               8 2 U.S… <NA>    1 L. E…
 8 1793-… 1793-00… 1793-002-01… 2/20/1793               2 2 U.S… <NA>    1 L. E…
 9 1793-… 1793-00… 1793-003-01… 2/20/1793               8 2 U.S… <NA>    1 L. E…
10 1794-… 1794-00… 1794-001-01… 2/7/1794               NA 3 U.S… <NA>    1 L. E…
# ℹ 29,011 more rows
# ℹ 45 more variables: lexisCite <chr>, term <dbl>, naturalCourt <dbl>,
#   chief <chr>, docket <chr>, caseName <chr>, dateArgument <chr>,
#   dateRearg <chr>, petitioner <dbl>, petitionerState <dbl>, respondent <dbl>,
#   respondentState <dbl>, jurisdiction <dbl>, adminAction <dbl>,
#   adminActionState <dbl>, threeJudgeFdc <dbl>, caseOrigin <dbl>,
#   caseOriginState <dbl>, caseSource <dbl>, caseSourceState <dbl>, …

Caching guidelines

  • Label your code chunks
  • Define dependencies
  • Never cache chunks that load packages

Application exercise

ae-09

  • Go to the course GitHub org and find your ae-09 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of the day

⏱️ Your turn

  • Render the Quarto document - what files are generated in the repo?
  • Change the document to show the best tuning parameters based on "rsq" and render - what happens to the output document?
  • Change the document to implement 10-fold CV instead of 5-fold and render - how does this change the tuning results?
10:00

Data preparation

Data preparation

  • Some stuff needs to be done before we do any modeling (e.g. {dplyr}, {tidyr}, {tidyverse})
  • Some stuff is done as part of the modeling workflow (i.e. {recipes})

What happens before the modeling?

Examples include:

  • Restructure/combine data sets to get into tidy form
  • Standardize variable names
  • Drop columns that are not used for modeling
  • Clean and standardize character strings for categorical variables
  • Convert categorical variables to factors
  • Drop rows with missing values for outcome of interest
  • Transformations of the outcome of interest (e.g. log transform)

What things should we do before the modeling?

Actions that:

What happens during the modeling?

  • Data preprocessing
  • Feature engineering

How to perform data preparation

Data preparation can be performed in a script.

Save the resulting object to disk - don’t replace the original file(s)
/data
|-- coffee_survey.csv
/output
|-- coffee_survey_clean.rds
clean-coffee.R
Use an appropriate file format
  • Don’t use CSVs

Apply data preparation to the entire dataset

Predicting coffee preferences

⏱️ Your turn

Our client ran a survey to better understand the preferences of potential customers for their new line of coffee. The survey includes questions about the potential customers’ coffee preferences, demographics, and coffee consumption habits, as well as taste test results of four varieties of coffee.

Our client wants to host a survey on their website that potential customers can complete to get a recommendation for which of the four coffees they would like best. What features in the existing survey dataset can be used to predict this outcome? What features should not be used?

04:00

⏱️ Your turn

Select (keep) all relevant features that can be used for predicting prefer_overall.

08:00

coffee_survey |>
  select(
    -submission_id,
    -ends_with("other"),
    -ends_with("specify"),
    -starts_with("coffee"),
    -prefer_abc, -prefer_ad
  ) |>
  glimpse()
Rows: 4,042
Columns: 31
$ age                   <chr> "18-24 y…
$ cups                  <chr> NA, NA, …
$ where_drink           <chr> NA, NA, …
$ brew                  <chr> NA, "Pod…
$ purchase              <chr> NA, NA, …
$ favorite              <chr> "Regular…
$ additions             <chr> "No - ju…
$ dairy                 <chr> NA, NA, …
$ sweetener             <chr> NA, NA, …
$ style                 <chr> "Complex…
$ strength              <chr> NA, NA, …
$ roast_level           <chr> NA, NA, …
$ caffeine              <chr> NA, NA, …
$ expertise             <dbl> NA, NA, …
$ prefer_overall        <chr> NA, NA, …
$ wfh                   <chr> NA, NA, …
$ total_spend           <chr> NA, NA, …
$ why_drink             <chr> NA, NA, …
$ taste                 <chr> NA, NA, …
$ know_source           <chr> NA, NA, …
$ most_paid             <chr> NA, NA, …
$ most_willing          <chr> NA, NA, …
$ value_cafe            <chr> NA, NA, …
$ spent_equipment       <chr> NA, NA, …
$ value_equipment       <chr> NA, NA, …
$ gender                <chr> NA, NA, …
$ education_level       <chr> NA, NA, …
$ ethnicity_race        <chr> NA, NA, …
$ employment_status     <chr> NA, NA, …
$ number_children       <chr> NA, NA, …
$ political_affiliation <chr> NA, NA, …

⏱️ Your turn

Convert all single-response categorical predictors to factors. Ensure levels are appropriately ordered for ordinal variables.

08:00

coffee_survey |>
  select(
    -submission_id,
    -ends_with("other"),
    -ends_with("specify"),
    -starts_with("coffee"),
    -prefer_abc, -prefer_ad
  ) |>
  # convert categorical variables with single answer to factors
  mutate(
    # convert all initially with factor()
    across(
      .cols = c(
        age, cups, favorite, style, strength, roast_level, caffeine, prefer_overall,
        wfh, total_spend, taste:political_affiliation
      ),
      .fns = factor
    ),
    # adjust order of levels for ordinal variables
    age = fct_relevel(
      .f = age,
      "<18 years old",
      "18-24 years old",
      "25-34 years old",
      "35-44 years old",
      "45-54 years old",
      "55-64 years old",
      ">65 years old"
    ),
    cups = fct_relevel(
      .f = cups,
      "Less than 1", "1", "2", "3", "4", "More than 4"
    ),
    strength = fct_relevel(
      .f = strength,
      "Weak", "Somewhat light", "Medium", "Somewhat strong", "Very strong"
    ),
    caffeine = fct_relevel(
      .f = caffeine,
      "Decaf", "Half caff", "Full caffeine"
    ),
    wfh = fct_relevel(
      .f = wfh,
      "I primarily work from home", "I do a mix of both", "I primarily work in person"
    ),
    total_spend = fct_relevel(
      .f = total_spend,
      "<$20"
    ) |>
      fct_relevel(
        ">$100",
        after = 5L
      ),
    across(
      .cols = c(most_paid, most_willing),
      .fns = \(x) fct_relevel(
        .f = x,
        "Less than $2", "$2-$4", "$4-$6", "$6-$8", "$8-$10", "$10-$15", "$15-$20",
        "More than $20"
      )
    ),
    spent_equipment = fct_relevel(
      .f = spent_equipment,
      "Less than $20", "$20-$50", "$50-$100", "$100-$300", "$300-$500",
      "$500-$1000", "More than $1,000"
    ),
    education_level = fct_relevel(
      .f = education_level,
      "Less than high school", "High school graduate", "Some college or associate's degree",
      "Bachelor's degree", "Master's degree", "Doctorate or professional degree"
    ),
    number_children = fct_relevel(
      .f = number_children,
      "None", "1", "2", "3", "More than 3"
    ),
  ) |>
  glimpse()
Rows: 4,042
Columns: 31
$ age                   <fct> 18-24 ye…
$ cups                  <fct> NA, NA, …
$ where_drink           <chr> NA, NA, …
$ brew                  <chr> NA, "Pod…
$ purchase              <chr> NA, NA, …
$ favorite              <fct> Regular …
$ additions             <chr> "No - ju…
$ dairy                 <chr> NA, NA, …
$ sweetener             <chr> NA, NA, …
$ style                 <fct> Complex,…
$ strength              <fct> NA, NA, …
$ roast_level           <fct> NA, NA, …
$ caffeine              <fct> NA, NA, …
$ expertise             <dbl> NA, NA, …
$ prefer_overall        <fct> NA, NA, …
$ wfh                   <fct> NA, NA, …
$ total_spend           <fct> NA, NA, …
$ why_drink             <chr> NA, NA, …
$ taste                 <fct> NA, NA, …
$ know_source           <fct> NA, NA, …
$ most_paid             <fct> NA, NA, …
$ most_willing          <fct> NA, NA, …
$ value_cafe            <fct> NA, NA, …
$ spent_equipment       <fct> NA, NA, …
$ value_equipment       <fct> NA, NA, …
$ gender                <fct> NA, NA, …
$ education_level       <fct> NA, NA, …
$ ethnicity_race        <fct> NA, NA, …
$ employment_status     <fct> NA, NA, …
$ number_children       <fct> NA, NA, …
$ political_affiliation <fct> NA, NA, …

⏱️ Your turn

Multiple-response categorical predictors need to be converted to one column-per-response. Use {tidyr} to convert these columns.

10:00

coffee_survey |>
  select(
    -submission_id,
    -ends_with("other"),
    -ends_with("specify"),
    -starts_with("coffee"),
    -prefer_abc, -prefer_ad
  ) |>
  # convert categorical variables with single answer to factors
  mutate(
    # convert all initially with factor()
    across(
      .cols = c(
        age, cups, favorite, style, strength, roast_level, caffeine, prefer_overall,
        wfh, total_spend, taste:political_affiliation
      ),
      .fns = factor
    ),
    # adjust order of levels for ordinal variables
    age = fct_relevel(
      .f = age,
      "<18 years old",
      "18-24 years old",
      "25-34 years old",
      "35-44 years old",
      "45-54 years old",
      "55-64 years old",
      ">65 years old"
    ),
    cups = fct_relevel(
      .f = cups,
      "Less than 1", "1", "2", "3", "4", "More than 4"
    ),
    strength = fct_relevel(
      .f = strength,
      "Weak", "Somewhat light", "Medium", "Somewhat strong", "Very strong"
    ),
    caffeine = fct_relevel(
      .f = caffeine,
      "Decaf", "Half caff", "Full caffeine"
    ),
    wfh = fct_relevel(
      .f = wfh,
      "I primarily work from home", "I do a mix of both", "I primarily work in person"
    ),
    total_spend = fct_relevel(
      .f = total_spend,
      "<$20"
    ) |>
      fct_relevel(
        ">$100",
        after = 5L
      ),
    across(
      .cols = c(most_paid, most_willing),
      .fns = \(x) fct_relevel(
        .f = x,
        "Less than $2", "$2-$4", "$4-$6", "$6-$8", "$8-$10", "$10-$15", "$15-$20",
        "More than $20"
      )
    ),
    spent_equipment = fct_relevel(
      .f = spent_equipment,
      "Less than $20", "$20-$50", "$50-$100", "$100-$300", "$300-$500",
      "$500-$1000", "More than $1,000"
    ),
    education_level = fct_relevel(
      .f = education_level,
      "Less than high school", "High school graduate", "Some college or associate's degree",
      "Bachelor's degree", "Master's degree", "Doctorate or professional degree"
    ),
    number_children = fct_relevel(
      .f = number_children,
      "None", "1", "2", "3", "More than 3"
    ),
  ) |>
  # fix additions and purchase columns to ensure commas are only used to separate values
  mutate(
    additions = str_replace_all(
      string = additions,
      pattern = "Milk, dairy alternative, or coffee creamer",
      replacement = "Milk dairy alternative or coffee creamer"
    ),
    purchase = str_replace_all(
      string = purchase,
      pattern = "National chain \\(e.g. Starbucks, Dunkin\\)",
      replacement = "National chain \\(e.g. Starbucks Dunkin\\)"
    )
  ) |>
  # separate columns with multiple answers into one column per answer
  separate_wider_delim(
    cols = c(
      where_drink, brew, purchase,
      additions, dairy, sweetener, why_drink
    ),
    delim = ", ",
    names_sep = "_",
    too_few = "align_start"
  ) |>
  # convert multi-answer columns to factors
  mutate(
    across(
      .cols = c(starts_with("where_drink"), starts_with("brew"),
                 starts_with("purchase"), starts_with("additions"),
                 starts_with("dairy"), starts_with("sweetener"),
                 starts_with("why_drink")),
      .fns = factor
    )
  ) |>
  glimpse()
Rows: 4,042
Columns: 64
$ age                   <fct> 18-24 ye…
$ cups                  <fct> NA, NA, …
$ where_drink_1         <fct> NA, NA, …
$ where_drink_2         <fct> NA, NA, …
$ where_drink_3         <fct> NA, NA, …
$ where_drink_4         <fct> NA, NA, …
$ brew_1                <fct> NA, Pod/…
$ brew_2                <fct> NA, NA, …
$ brew_3                <fct> NA, NA, …
$ brew_4                <fct> NA, NA, …
$ brew_5                <fct> NA, NA, …
$ brew_6                <fct> NA, NA, …
$ brew_7                <fct> NA, NA, …
$ purchase_1            <fct> NA, NA, …
$ purchase_2            <fct> NA, NA, …
$ purchase_3            <fct> NA, NA, …
$ purchase_4            <fct> NA, NA, …
$ purchase_5            <fct> NA, NA, …
$ favorite              <fct> Regular …
$ additions_1           <fct> No - jus…
$ additions_2           <fct> NA, No -…
$ additions_3           <fct> NA, NA, …
$ additions_4           <fct> NA, NA, …
$ additions_5           <fct> NA, NA, …
$ dairy_1               <fct> NA, NA, …
$ dairy_2               <fct> NA, NA, …
$ dairy_3               <fct> NA, NA, …
$ dairy_4               <fct> NA, NA, …
$ dairy_5               <fct> NA, NA, …
$ dairy_6               <fct> NA, NA, …
$ dairy_7               <fct> NA, NA, …
$ dairy_8               <fct> NA, NA, …
$ sweetener_1           <fct> NA, NA, …
$ sweetener_2           <fct> NA, NA, …
$ sweetener_3           <fct> NA, NA, …
$ sweetener_4           <fct> NA, NA, …
$ sweetener_5           <fct> NA, NA, …
$ sweetener_6           <fct> NA, NA, …
$ style                 <fct> Complex,…
$ strength              <fct> NA, NA, …
$ roast_level           <fct> NA, NA, …
$ caffeine              <fct> NA, NA, …
$ expertise             <dbl> NA, NA, …
$ prefer_overall        <fct> NA, NA, …
$ wfh                   <fct> NA, NA, …
$ total_spend           <fct> NA, NA, …
$ why_drink_1           <fct> NA, NA, …
$ why_drink_2           <fct> NA, NA, …
$ why_drink_3           <fct> NA, NA, …
$ why_drink_4           <fct> NA, NA, …
$ why_drink_5           <fct> NA, NA, …
$ taste                 <fct> NA, NA, …
$ know_source           <fct> NA, NA, …
$ most_paid             <fct> NA, NA, …
$ most_willing          <fct> NA, NA, …
$ value_cafe            <fct> NA, NA, …
$ spent_equipment       <fct> NA, NA, …
$ value_equipment       <fct> NA, NA, …
$ gender                <fct> NA, NA, …
$ education_level       <fct> NA, NA, …
$ ethnicity_race        <fct> NA, NA, …
$ employment_status     <fct> NA, NA, …
$ number_children       <fct> NA, NA, …
$ political_affiliation <fct> NA, NA, …

⏱️ Your turn

Drop any rows with missing values for the outcome of interest prefer_overall. Save the resulting data frame as an RDS file to preserve all the factor levels.

10:00

coffee_survey |>
  select(
    -submission_id,
    -ends_with("other"),
    -ends_with("specify"),
    -starts_with("coffee"),
    -prefer_abc, -prefer_ad
  ) |>
  # convert categorical variables with single answer to factors
  mutate(
    # convert all initially with factor()
    across(
      .cols = c(
        age, cups, favorite, style, strength, roast_level, caffeine, prefer_overall,
        wfh, total_spend, taste:political_affiliation
      ),
      .fns = factor
    ),
    # adjust order of levels for ordinal variables
    age = fct_relevel(
      .f = age,
      "<18 years old",
      "18-24 years old",
      "25-34 years old",
      "35-44 years old",
      "45-54 years old",
      "55-64 years old",
      ">65 years old"
    ),
    cups = fct_relevel(
      .f = cups,
      "Less than 1", "1", "2", "3", "4", "More than 4"
    ),
    strength = fct_relevel(
      .f = strength,
      "Weak", "Somewhat light", "Medium", "Somewhat strong", "Very strong"
    ),
    caffeine = fct_relevel(
      .f = caffeine,
      "Decaf", "Half caff", "Full caffeine"
    ),
    wfh = fct_relevel(
      .f = wfh,
      "I primarily work from home", "I do a mix of both", "I primarily work in person"
    ),
    total_spend = fct_relevel(
      .f = total_spend,
      "<$20"
    ) |>
      fct_relevel(
        ">$100",
        after = 5L
      ),
    across(
      .cols = c(most_paid, most_willing),
      .fns = \(x) fct_relevel(
        .f = x,
        "Less than $2", "$2-$4", "$4-$6", "$6-$8", "$8-$10", "$10-$15", "$15-$20",
        "More than $20"
      )
    ),
    spent_equipment = fct_relevel(
      .f = spent_equipment,
      "Less than $20", "$20-$50", "$50-$100", "$100-$300", "$300-$500",
      "$500-$1000", "More than $1,000"
    ),
    education_level = fct_relevel(
      .f = education_level,
      "Less than high school", "High school graduate", "Some college or associate's degree",
      "Bachelor's degree", "Master's degree", "Doctorate or professional degree"
    ),
    number_children = fct_relevel(
      .f = number_children,
      "None", "1", "2", "3", "More than 3"
    ),
  ) |>
  # fix additions and purchase columns to ensure commas are only used to separate values
  mutate(
    additions = str_replace_all(
      string = additions,
      pattern = "Milk, dairy alternative, or coffee creamer",
      replacement = "Milk dairy alternative or coffee creamer"
    ),
    purchase = str_replace_all(
      string = purchase,
      pattern = "National chain \\(e.g. Starbucks, Dunkin\\)",
      replacement = "National chain \\(e.g. Starbucks Dunkin\\)"
    )
  ) |>
  # separate columns with multiple answers into one column per answer
  separate_wider_delim(
    cols = c(
      where_drink, brew, purchase,
      additions, dairy, sweetener, why_drink
    ),
    delim = ", ",
    names_sep = "_",
    too_few = "align_start"
  ) |>
  # convert multi-answer columns to factors
  mutate(
    across(
      .cols = c(starts_with("where_drink"), starts_with("brew"),
                 starts_with("purchase"), starts_with("additions"),
                 starts_with("dairy"), starts_with("sweetener"),
                 starts_with("why_drink")),
      .fns = factor
    )
  ) |>
  # drop rows with no prefer_overall
  drop_na(prefer_overall) |>
  write_rds(file = "data/coffee_survey_clean.rds") |>
  glimpse()
Rows: 3,770
Columns: 64
$ age                   <fct> <18 year…
$ cups                  <fct> 3, 3, 1,…
$ where_drink_1         <fct> At home,…
$ where_drink_2         <fct> At the o…
$ where_drink_3         <fct> At a caf…
$ where_drink_4         <fct> NA, NA, …
$ brew_1                <fct> Pour ove…
$ brew_2                <fct> Espresso…
$ brew_3                <fct> Instant …
$ brew_4                <fct> NA, NA, …
$ brew_5                <fct> NA, NA, …
$ brew_6                <fct> NA, NA, …
$ brew_7                <fct> NA, NA, …
$ purchase_1            <fct> NA, NA, …
$ purchase_2            <fct> NA, NA, …
$ purchase_3            <fct> NA, NA, …
$ purchase_4            <fct> NA, NA, …
$ purchase_5            <fct> NA, NA, …
$ favorite              <fct> Pourover…
$ additions_1           <fct> No - jus…
$ additions_2           <fct> Milk dai…
$ additions_3           <fct> NA, NA, …
$ additions_4           <fct> NA, NA, …
$ additions_5           <fct> NA, NA, …
$ dairy_1               <fct> Whole mi…
$ dairy_2               <fct> NA, NA, …
$ dairy_3               <fct> NA, NA, …
$ dairy_4               <fct> NA, NA, …
$ dairy_5               <fct> NA, NA, …
$ dairy_6               <fct> NA, NA, …
$ dairy_7               <fct> NA, NA, …
$ dairy_8               <fct> NA, NA, …
$ sweetener_1           <fct> NA, NA, …
$ sweetener_2           <fct> NA, NA, …
$ sweetener_3           <fct> NA, NA, …
$ sweetener_4           <fct> NA, NA, …
$ sweetener_5           <fct> NA, NA, …
$ sweetener_6           <fct> NA, NA, …
$ style                 <fct> Bright, …
$ strength              <fct> Medium, …
$ roast_level           <fct> Light, B…
$ caffeine              <fct> Full caf…
$ expertise             <dbl> 10, 7, 6…
$ prefer_overall        <fct> Coffee B…
$ wfh                   <fct> I primar…
$ total_spend           <fct> >$100, N…
$ why_drink_1           <fct> NA, NA, …
$ why_drink_2           <fct> NA, NA, …
$ why_drink_3           <fct> NA, NA, …
$ why_drink_4           <fct> NA, NA, …
$ why_drink_5           <fct> NA, NA, …
$ taste                 <fct> NA, NA, …
$ know_source           <fct> NA, NA, …
$ most_paid             <fct> NA, NA, …
$ most_willing          <fct> NA, NA, …
$ value_cafe            <fct> NA, NA, …
$ spent_equipment       <fct> NA, NA, …
$ value_equipment       <fct> NA, NA, …
$ gender                <fct> Other (p…
$ education_level       <fct> Bachelor…
$ ethnicity_race        <fct> Other (p…
$ employment_status     <fct> Employed…
$ number_children       <fct> More tha…
$ political_affiliation <fct> Democrat…

Wrap-up

Recap

  • Use code caching in Quarto to speed up rendering times
  • Data preparation occurs before the modeling process
  • Clean and standardize your data set to make it ready for modeling
  • Use standalone scripts to perform data preparation reproducibly