Data preparation

Lecture 12

Dr. Benjamin Soltoff

Cornell University
INFO 4940/5940 - Fall 2025

October 2, 2025

Announcements

Announcements

  • Homework 04
  • Project 01 proposal due next week

Learning objectives

  • Distinguish between data preparation and data preprocessing
  • Standardize categorical variables and clean character strings
  • Implement data cleaning steps reproducibly using standalone scripts

Data preparation

Data preparation

  • Some stuff needs to be done before we do any modeling (e.g. {dplyr}, {tidyr}, {tidyverse})
  • Some stuff is done as part of the modeling workflow (i.e. {recipes})

What happens before the modeling?

Examples include:

  • Restructure/combine data sets to get into tidy form
  • Standardize variable names
  • Drop columns that are not used for modeling
  • Clean and standardize character strings for categorical variables
  • Convert categorical variables to factors/Categoricals
  • Drop rows with missing values for outcome of interest
  • Transformations of the outcome of interest (e.g. log transform)

What things should we do before the modeling?

Actions that:

What happens during the modeling?

  • Data preprocessing
  • Feature engineering

How to perform data preparation

Data preparation can be performed in a script

Save the resulting object to disk - don’t replace the original file(s)

data-raw/
  coffee_survey.csv
data/
  coffee_survey_clean.feather
scripts/
  clean-coffee.R
  clean-coffee.py

Use appropriate file formats

Store data in formats that preserve data types and are interoperable between R and Python

Feather
  • Don’t use CSVs

Apply data preparation to the entire dataset

Predicting coffee preferences

📝 What can we use?

Instructions

Our client ran a survey to better understand the preferences of potential customers for their new line of coffee. The survey includes questions about the potential customers’ coffee preferences, demographics, and coffee consumption habits, as well as taste test results of four varieties of coffee.

Our client wants to know if it should recommend a new variety of coffee (coffee D) based on customer demographics, preferences, and ratings for a standardized set of three coffe varieties. What features should not be used in the model?

05:00

Keep the relevant features

coffee_survey |>
  select(
    -submission_id,
    -ends_with("other"),
    -ends_with("specify"),
    -ends_with("notes"),
    -coffee_d_bitterness,
    -coffee_d_acidity,
    -prefer_ad,
    -prefer_overall
  ) |>
  glimpse()
Rows: 4,042
Columns: 41
$ age                          <chr> "18-24 years old…
$ cups                         <chr> NA, NA, NA, NA, …
$ where_drink                  <chr> NA, NA, NA, NA, …
$ brew                         <chr> NA, "Pod/capsule…
$ purchase                     <chr> NA, NA, NA, NA, …
$ favorite                     <chr> "Regular drip co…
$ additions                    <chr> "No - just black…
$ dairy                        <chr> NA, NA, NA, NA, …
$ sweetener                    <chr> NA, NA, NA, NA, …
$ style                        <chr> "Complex", "Ligh…
$ strength                     <chr> NA, NA, NA, NA, …
$ roast_level                  <chr> NA, NA, NA, NA, …
$ caffeine                     <chr> NA, NA, NA, NA, …
$ expertise                    <dbl> NA, NA, NA, NA, …
$ coffee_a_bitterness          <dbl> NA, NA, NA, NA, …
$ coffee_a_acidity             <dbl> NA, NA, NA, NA, …
$ coffee_a_personal_preference <dbl> NA, NA, NA, NA, …
$ coffee_b_bitterness          <dbl> NA, NA, NA, NA, …
$ coffee_b_acidity             <dbl> NA, NA, NA, NA, …
$ coffee_b_personal_preference <dbl> NA, NA, NA, NA, …
$ coffee_c_bitterness          <dbl> NA, NA, NA, NA, …
$ coffee_c_acidity             <dbl> NA, NA, NA, NA, …
$ coffee_c_personal_preference <dbl> NA, NA, NA, NA, …
$ coffee_d_personal_preference <dbl> NA, NA, NA, NA, …
$ prefer_abc                   <chr> NA, NA, NA, NA, …
$ wfh                          <chr> NA, NA, NA, NA, …
$ total_spend                  <chr> NA, NA, NA, NA, …
$ why_drink                    <chr> NA, NA, NA, NA, …
$ taste                        <chr> NA, NA, NA, NA, …
$ know_source                  <chr> NA, NA, NA, NA, …
$ most_paid                    <chr> NA, NA, NA, NA, …
$ most_willing                 <chr> NA, NA, NA, NA, …
$ value_cafe                   <chr> NA, NA, NA, NA, …
$ spent_equipment              <chr> NA, NA, NA, NA, …
$ value_equipment              <chr> NA, NA, NA, NA, …
$ gender                       <chr> NA, NA, NA, NA, …
$ education_level              <chr> NA, NA, NA, NA, …
$ ethnicity_race               <chr> NA, NA, NA, NA, …
$ employment_status            <chr> NA, NA, NA, NA, …
$ number_children              <chr> NA, NA, NA, NA, …
$ political_affiliation        <chr> NA, NA, NA, NA, …

📝 Standardize single-response categorical predictors

# A tibble: 12 × 2
   favorite                             n
   <chr>                            <int>
 1 Americano                          249
 2 Blended drink (e.g. Frappuccino)    45
 3 Cappuccino                         341
 4 Cold brew                          109
 5 Cortado                            312
 6 Espresso                           330
 7 Iced coffee                        156
 8 Latte                              680
 9 Mocha                              118
10 Other                              114
11 Pourover                          1084
12 Regular drip coffee                442
# A tibble: 7 × 2
  spent_equipment      n
  <chr>            <int>
1 $100-$300          750
2 $20-$50            188
3 $300-$500          597
4 $50-$100           310
5 $500-$1000         642
6 Less than $20      239
7 More than $1,000   780

For each feature, identify how the data should be structured before we partition into training/test sets

Things to consider:

  • Ordinal or nominal?
  • How many categories?
  • How are values encoded in the data frame?
06:00

coffee_survey |>
  select(
    -submission_id,
    -ends_with("other"),
    -ends_with("specify"),
    -ends_with("notes"),
    -coffee_d_bitterness,
    -coffee_d_acidity,
    -prefer_ad,
    -prefer_overall
  ) |>
  # convert categorical variables with single answer to factors
  mutate(
    # convert all initially with factor()
    across(
      .cols = c(
        age,
        cups,
        favorite,
        style,
        strength,
        roast_level,
        caffeine,
        wfh,
        total_spend,
        taste:political_affiliation
      ),
      .fns = factor
    ),
    # adjust order of levels for ordinal variables
    age = fct_relevel(
      .f = age,
      "<18 years old",
      "18-24 years old",
      "25-34 years old",
      "35-44 years old",
      "45-54 years old",
      "55-64 years old",
      ">65 years old"
    ),
    cups = fct_relevel(
      .f = cups,
      "Less than 1",
      "1",
      "2",
      "3",
      "4",
      "More than 4"
    ),
    strength = fct_relevel(
      .f = strength,
      "Weak",
      "Somewhat light",
      "Medium",
      "Somewhat strong",
      "Very strong"
    ),
    caffeine = fct_relevel(
      .f = caffeine,
      "Decaf",
      "Half caff",
      "Full caffeine"
    ),
    wfh = fct_relevel(
      .f = wfh,
      "I primarily work from home",
      "I do a mix of both",
      "I primarily work in person"
    ),
    total_spend = fct_relevel(
      .f = total_spend,
      "<$20"
    ) |>
      fct_relevel(
        ">$100",
        after = 5L
      ),
    across(
      .cols = c(most_paid, most_willing),
      .fns = \(x) {
        fct_relevel(
          .f = x,
          "Less than $2",
          "$2-$4",
          "$4-$6",
          "$6-$8",
          "$8-$10",
          "$10-$15",
          "$15-$20",
          "More than $20"
        )
      }
    ),
    spent_equipment = fct_relevel(
      .f = spent_equipment,
      "Less than $20",
      "$20-$50",
      "$50-$100",
      "$100-$300",
      "$300-$500",
      "$500-$1000",
      "More than $1,000"
    ),
    education_level = fct_relevel(
      .f = education_level,
      "Less than high school",
      "High school graduate",
      "Some college or associate's degree",
      "Bachelor's degree",
      "Master's degree",
      "Doctorate or professional degree"
    ),
    number_children = fct_relevel(
      .f = number_children,
      "None",
      "1",
      "2",
      "3",
      "More than 3"
    ),
  ) |>
  glimpse()
Rows: 4,042
Columns: 41
$ age                          <fct> 18-24 years old,…
$ cups                         <fct> NA, NA, NA, NA, …
$ where_drink                  <chr> NA, NA, NA, NA, …
$ brew                         <chr> NA, "Pod/capsule…
$ purchase                     <chr> NA, NA, NA, NA, …
$ favorite                     <fct> Regular drip cof…
$ additions                    <chr> "No - just black…
$ dairy                        <chr> NA, NA, NA, NA, …
$ sweetener                    <chr> NA, NA, NA, NA, …
$ style                        <fct> Complex, Light, …
$ strength                     <fct> NA, NA, NA, NA, …
$ roast_level                  <fct> NA, NA, NA, NA, …
$ caffeine                     <fct> NA, NA, NA, NA, …
$ expertise                    <dbl> NA, NA, NA, NA, …
$ coffee_a_bitterness          <dbl> NA, NA, NA, NA, …
$ coffee_a_acidity             <dbl> NA, NA, NA, NA, …
$ coffee_a_personal_preference <dbl> NA, NA, NA, NA, …
$ coffee_b_bitterness          <dbl> NA, NA, NA, NA, …
$ coffee_b_acidity             <dbl> NA, NA, NA, NA, …
$ coffee_b_personal_preference <dbl> NA, NA, NA, NA, …
$ coffee_c_bitterness          <dbl> NA, NA, NA, NA, …
$ coffee_c_acidity             <dbl> NA, NA, NA, NA, …
$ coffee_c_personal_preference <dbl> NA, NA, NA, NA, …
$ coffee_d_personal_preference <dbl> NA, NA, NA, NA, …
$ prefer_abc                   <chr> NA, NA, NA, NA, …
$ wfh                          <fct> NA, NA, NA, NA, …
$ total_spend                  <fct> NA, NA, NA, NA, …
$ why_drink                    <chr> NA, NA, NA, NA, …
$ taste                        <fct> NA, NA, NA, NA, …
$ know_source                  <fct> NA, NA, NA, NA, …
$ most_paid                    <fct> NA, NA, NA, NA, …
$ most_willing                 <fct> NA, NA, NA, NA, …
$ value_cafe                   <fct> NA, NA, NA, NA, …
$ spent_equipment              <fct> NA, NA, NA, NA, …
$ value_equipment              <fct> NA, NA, NA, NA, …
$ gender                       <fct> NA, NA, NA, NA, …
$ education_level              <fct> NA, NA, NA, NA, …
$ ethnicity_race               <fct> NA, NA, NA, NA, …
$ employment_status            <fct> NA, NA, NA, NA, …
$ number_children              <fct> NA, NA, NA, NA, …
$ political_affiliation        <fct> NA, NA, NA, NA, …

How do you brew coffee at home?

Provided options

Bean-to-cup machine

Coffee brewing machine (e.g. Mr. Coffee)

Coffee extract (e.g. Cometeer)

Cold brew

Espresso

French press

Instant coffee

Other

Pod/capsule machine (e.g. Keurig/Nespresso)

Pour over

Recorded responses

# A tibble: 449 × 1
   brew                                                                                             
   <chr>                                                                                            
 1 Bean-to-cup machine                                                                              
 2 Bean-to-cup machine, Coffee extract (e.g. Cometeer)                                              
 3 Bean-to-cup machine, Coffee extract (e.g. Cometeer), Espresso, Other, Coffee brewing machine (e.…
 4 Bean-to-cup machine, Cold brew                                                                   
 5 Bean-to-cup machine, Espresso                                                                    
 6 Bean-to-cup machine, Espresso, Coffee brewing machine (e.g. Mr. Coffee)                          
 7 Bean-to-cup machine, French press                                                                
 8 Bean-to-cup machine, Other                                                                       
 9 Bean-to-cup machine, Pod/capsule machine (e.g. Keurig/Nespresso)                                 
10 Bean-to-cup machine, Pour over                                                                   
# ℹ 439 more rows

📝 Standardize multiple-response categorical predictors

Instructions

How should we prepare brew for the modeling phase? What should its structure be before we partition the data into training and test sets?

05:00

coffee_survey |>
  select(
    -submission_id,
    -ends_with("other"),
    -ends_with("specify"),
    -ends_with("notes"),
    -coffee_d_bitterness,
    -coffee_d_acidity,
    -prefer_ad,
    -prefer_overall
  ) |>
  # convert categorical variables with single answer to factors
  mutate(
    # convert all initially with factor()
    across(
      .cols = c(
        age,
        cups,
        favorite,
        style,
        strength,
        roast_level,
        caffeine,
        wfh,
        total_spend,
        taste:political_affiliation
      ),
      .fns = factor
    ),
    # adjust order of levels for ordinal variables
    age = fct_relevel(
      .f = age,
      "<18 years old",
      "18-24 years old",
      "25-34 years old",
      "35-44 years old",
      "45-54 years old",
      "55-64 years old",
      ">65 years old"
    ),
    cups = fct_relevel(
      .f = cups,
      "Less than 1",
      "1",
      "2",
      "3",
      "4",
      "More than 4"
    ),
    strength = fct_relevel(
      .f = strength,
      "Weak",
      "Somewhat light",
      "Medium",
      "Somewhat strong",
      "Very strong"
    ),
    caffeine = fct_relevel(
      .f = caffeine,
      "Decaf",
      "Half caff",
      "Full caffeine"
    ),
    wfh = fct_relevel(
      .f = wfh,
      "I primarily work from home",
      "I do a mix of both",
      "I primarily work in person"
    ),
    total_spend = fct_relevel(
      .f = total_spend,
      "<$20"
    ) |>
      fct_relevel(
        ">$100",
        after = 5L
      ),
    across(
      .cols = c(most_paid, most_willing),
      .fns = \(x) {
        fct_relevel(
          .f = x,
          "Less than $2",
          "$2-$4",
          "$4-$6",
          "$6-$8",
          "$8-$10",
          "$10-$15",
          "$15-$20",
          "More than $20"
        )
      }
    ),
    spent_equipment = fct_relevel(
      .f = spent_equipment,
      "Less than $20",
      "$20-$50",
      "$50-$100",
      "$100-$300",
      "$300-$500",
      "$500-$1000",
      "More than $1,000"
    ),
    education_level = fct_relevel(
      .f = education_level,
      "Less than high school",
      "High school graduate",
      "Some college or associate's degree",
      "Bachelor's degree",
      "Master's degree",
      "Doctorate or professional degree"
    ),
    number_children = fct_relevel(
      .f = number_children,
      "None",
      "1",
      "2",
      "3",
      "More than 3"
    ),
  ) |>
  # fix additions and purchase columns to ensure commas are only used to separate values
  mutate(
    additions = str_replace_all(
      string = additions,
      pattern = "Milk, dairy alternative, or coffee creamer",
      replacement = "Milk dairy alternative or coffee creamer"
    ),
    purchase = str_replace_all(
      string = purchase,
      pattern = "National chain \\(e.g. Starbucks, Dunkin\\)",
      replacement = "National chain \\(e.g. Starbucks Dunkin\\)"
    )
  ) |>
  # convert multi-answer columns to factors
  mutate(
    across(
      .cols = c(
        starts_with("where_drink"),
        starts_with("brew"),
        starts_with("purchase"),
        starts_with("additions"),
        starts_with("dairy"),
        starts_with("sweetener"),
        starts_with("why_drink")
      ),
      .fns = factor
    )
  ) |>
  glimpse()
Rows: 4,042
Columns: 41
$ age                          <fct> 18-24 years old,…
$ cups                         <fct> NA, NA, NA, NA, …
$ where_drink                  <fct> NA, NA, NA, NA, …
$ brew                         <fct> NA, "Pod/capsule…
$ purchase                     <fct> NA, NA, NA, NA, …
$ favorite                     <fct> Regular drip cof…
$ additions                    <fct> "No - just black…
$ dairy                        <fct> NA, NA, NA, NA, …
$ sweetener                    <fct> NA, NA, NA, NA, …
$ style                        <fct> Complex, Light, …
$ strength                     <fct> NA, NA, NA, NA, …
$ roast_level                  <fct> NA, NA, NA, NA, …
$ caffeine                     <fct> NA, NA, NA, NA, …
$ expertise                    <dbl> NA, NA, NA, NA, …
$ coffee_a_bitterness          <dbl> NA, NA, NA, NA, …
$ coffee_a_acidity             <dbl> NA, NA, NA, NA, …
$ coffee_a_personal_preference <dbl> NA, NA, NA, NA, …
$ coffee_b_bitterness          <dbl> NA, NA, NA, NA, …
$ coffee_b_acidity             <dbl> NA, NA, NA, NA, …
$ coffee_b_personal_preference <dbl> NA, NA, NA, NA, …
$ coffee_c_bitterness          <dbl> NA, NA, NA, NA, …
$ coffee_c_acidity             <dbl> NA, NA, NA, NA, …
$ coffee_c_personal_preference <dbl> NA, NA, NA, NA, …
$ coffee_d_personal_preference <dbl> NA, NA, NA, NA, …
$ prefer_abc                   <chr> NA, NA, NA, NA, …
$ wfh                          <fct> NA, NA, NA, NA, …
$ total_spend                  <fct> NA, NA, NA, NA, …
$ why_drink                    <fct> NA, NA, NA, NA, …
$ taste                        <fct> NA, NA, NA, NA, …
$ know_source                  <fct> NA, NA, NA, NA, …
$ most_paid                    <fct> NA, NA, NA, NA, …
$ most_willing                 <fct> NA, NA, NA, NA, …
$ value_cafe                   <fct> NA, NA, NA, NA, …
$ spent_equipment              <fct> NA, NA, NA, NA, …
$ value_equipment              <fct> NA, NA, NA, NA, …
$ gender                       <fct> NA, NA, NA, NA, …
$ education_level              <fct> NA, NA, NA, NA, …
$ ethnicity_race               <fct> NA, NA, NA, NA, …
$ employment_status            <fct> NA, NA, NA, NA, …
$ number_children              <fct> NA, NA, NA, NA, …
$ political_affiliation        <fct> NA, NA, NA, NA, …

📝 Structuring the outcome variable

Ordinal scale from 1 to 5:

\[ \begin{aligned} 1 &= \text{Dislike}\\ 5 &= \text{Love} \end{aligned} \]

How should we represent the outcome variable for modeling?

  • Keep as is
  • Collapse to binary (e.g. Like vs. Does not like)
  • Something else?
05:00

Final data cleaning

coffee_survey |>
  select(
    -submission_id,
    -ends_with("other"),
    -ends_with("specify"),
    -ends_with("notes"),
    -coffee_d_bitterness,
    -coffee_d_acidity,
    -prefer_ad,
    -prefer_overall
  ) |>
  # convert categorical variables with single answer to factors
  mutate(
    # convert all initially with factor()
    across(
      .cols = c(
        age,
        cups,
        favorite,
        style,
        strength,
        roast_level,
        caffeine,
        wfh,
        total_spend,
        taste:political_affiliation
      ),
      .fns = factor
    ),
    # adjust order of levels for ordinal variables
    age = fct_relevel(
      .f = age,
      "<18 years old",
      "18-24 years old",
      "25-34 years old",
      "35-44 years old",
      "45-54 years old",
      "55-64 years old",
      ">65 years old"
    ),
    cups = fct_relevel(
      .f = cups,
      "Less than 1",
      "1",
      "2",
      "3",
      "4",
      "More than 4"
    ),
    strength = fct_relevel(
      .f = strength,
      "Weak",
      "Somewhat light",
      "Medium",
      "Somewhat strong",
      "Very strong"
    ),
    caffeine = fct_relevel(
      .f = caffeine,
      "Decaf",
      "Half caff",
      "Full caffeine"
    ),
    wfh = fct_relevel(
      .f = wfh,
      "I primarily work from home",
      "I do a mix of both",
      "I primarily work in person"
    ),
    total_spend = fct_relevel(
      .f = total_spend,
      "<$20"
    ) |>
      fct_relevel(
        ">$100",
        after = 5L
      ),
    across(
      .cols = c(most_paid, most_willing),
      .fns = \(x) {
        fct_relevel(
          .f = x,
          "Less than $2",
          "$2-$4",
          "$4-$6",
          "$6-$8",
          "$8-$10",
          "$10-$15",
          "$15-$20",
          "More than $20"
        )
      }
    ),
    spent_equipment = fct_relevel(
      .f = spent_equipment,
      "Less than $20",
      "$20-$50",
      "$50-$100",
      "$100-$300",
      "$300-$500",
      "$500-$1000",
      "More than $1,000"
    ),
    education_level = fct_relevel(
      .f = education_level,
      "Less than high school",
      "High school graduate",
      "Some college or associate's degree",
      "Bachelor's degree",
      "Master's degree",
      "Doctorate or professional degree"
    ),
    number_children = fct_relevel(
      .f = number_children,
      "None",
      "1",
      "2",
      "3",
      "More than 3"
    ),
  ) |>
  # fix additions and purchase columns to ensure commas are only used to separate values
  mutate(
    additions = str_replace_all(
      string = additions,
      pattern = "Milk, dairy alternative, or coffee creamer",
      replacement = "Milk dairy alternative or coffee creamer"
    ),
    purchase = str_replace_all(
      string = purchase,
      pattern = "National chain \\(e.g. Starbucks, Dunkin\\)",
      replacement = "National chain \\(e.g. Starbucks Dunkin\\)"
    )
  ) |>
  # convert multi-answer columns to factors
  mutate(
    across(
      .cols = c(
        starts_with("where_drink"),
        starts_with("brew"),
        starts_with("purchase"),
        starts_with("additions"),
        starts_with("dairy"),
        starts_with("sweetener"),
        starts_with("why_drink")
      ),
      .fns = factor
    )
  ) |>
  # drop rows with no coffee_d_personal_preference value
  drop_na(coffee_d_personal_preference) |>
  # convert coffee_d_personal_preference to binary factor
  mutate(
    coffee_d_personal_preference = (coffee_d_personal_preference > 3) |>
      factor(
        levels = c(TRUE, FALSE),
        labels = c("Like", "Does not like")
      )
  ) |>
  # write to disk
  write_feather(sink = "data/coffee_survey_clean.feather") |>
  glimpse()
Rows: 3,764
Columns: 41
$ age                          <fct> 25-34 years old,…
$ cups                         <fct> Less than 1, 3, …
$ where_drink                  <fct> "At a cafe", "At…
$ brew                         <fct> NA, "Pour over, …
$ purchase                     <fct> NA, NA, NA, "Nat…
$ favorite                     <fct> Regular drip cof…
$ additions                    <fct> "No - just black…
$ dairy                        <fct> NA, "Whole milk"…
$ sweetener                    <fct> NA, NA, NA, "Gra…
$ style                        <fct> Light, Bright, F…
$ strength                     <fct> NA, Medium, Some…
$ roast_level                  <fct> NA, Light, Blond…
$ caffeine                     <fct> NA, Full caffein…
$ expertise                    <dbl> NA, 10, 7, 6, 4,…
$ coffee_a_bitterness          <dbl> 4, 1, 3, 3, 4, 3…
$ coffee_a_acidity             <dbl> 4, 1, 3, 3, 4, 3…
$ coffee_a_personal_preference <dbl> 4, 1, 3, 3, 4, 4…
$ coffee_b_bitterness          <dbl> 4, 1, 3, 3, 4, 3…
$ coffee_b_acidity             <dbl> 4, 1, 3, 3, 4, 3…
$ coffee_b_personal_preference <dbl> 4, 1, 3, 3, 4, 5…
$ coffee_c_bitterness          <dbl> 4, 1, 3, 3, NA, …
$ coffee_c_acidity             <dbl> 4, 1, 3, 3, 4, 3…
$ coffee_c_personal_preference <dbl> 4, 1, 3, 3, 4, 3…
$ coffee_d_personal_preference <fct> Like, Does not l…
$ prefer_abc                   <chr> NA, "Coffee A", …
$ wfh                          <fct> NA, I primarily …
$ total_spend                  <fct> NA, >$100, NA, $…
$ why_drink                    <fct> NA, NA, NA, NA, …
$ taste                        <fct> NA, NA, NA, NA, …
$ know_source                  <fct> NA, NA, NA, NA, …
$ most_paid                    <fct> NA, NA, NA, NA, …
$ most_willing                 <fct> NA, NA, NA, NA, …
$ value_cafe                   <fct> NA, NA, NA, NA, …
$ spent_equipment              <fct> NA, NA, NA, NA, …
$ value_equipment              <fct> NA, NA, NA, NA, …
$ gender                       <fct> NA, Other (pleas…
$ education_level              <fct> NA, Bachelor's d…
$ ethnicity_race               <fct> NA, Other (pleas…
$ employment_status            <fct> NA, Employed ful…
$ number_children              <fct> NA, More than 3,…
$ political_affiliation        <fct> NA, Democrat, NA…

Equivalent Python data cleaning

import pandas as pd
import pyarrow.feather as feather
import re

# Import data file
coffee_survey = pd.read_csv("data/coffee_survey.csv")

# Process the data
coffee_survey_clean = (
    coffee_survey
    .pipe(lambda df: df.drop(columns=[
        'submission_id'
    ] + [col for col in df.columns if col.endswith('_other')] +
    [col for col in df.columns if col.endswith('_specify')] +
    [col for col in df.columns if col.endswith('_notes')] +
    ['coffee_d_bitterness', 'coffee_d_acidity'] +
    ['prefer_ad', 'prefer_overall']))
    
    # Fix additions and purchase columns
    .assign(
        additions=lambda df: df['additions'].str.replace(
            'Milk, dairy alternative, or coffee creamer',
            'Milk dairy alternative or coffee creamer',
            regex=False
        ),
        purchase=lambda df: df['purchase'].str.replace(
            r'National chain \(e\.g\. Starbucks, Dunkin\)',
            'National chain (e.g. Starbucks Dunkin)',
            regex=True
        )
    )
)

# Separate multi-answer columns
multi_answer_cols = ['where_drink', 'brew', 'purchase', 'additions', 'dairy', 'sweetener', 'why_drink']
for col in multi_answer_cols:
    if col in coffee_survey_clean.columns:
        coffee_survey_clean[col] = coffee_survey_clean[col].str.split(r",\s*")

# Convert categorical variables to factors (categorical type in pandas)
categorical_cols = [
    'age', 'cups', 'favorite', 'style', 'strength', 'roast_level', 
    'caffeine', 'wfh', 'total_spend', 'taste', 'political_affiliation'
]

# Find columns from taste to political_affiliation
taste_to_political = []
if 'taste' in coffee_survey_clean.columns and 'political_affiliation' in coffee_survey_clean.columns:
    taste_idx = coffee_survey_clean.columns.get_loc('taste')
    political_idx = coffee_survey_clean.columns.get_loc('political_affiliation')
    taste_to_political = coffee_survey_clean.columns[taste_idx:political_idx+1].tolist()

categorical_cols.extend(taste_to_political)

# Remove duplicates and filter existing columns
categorical_cols = list(set([col for col in categorical_cols if col in coffee_survey_clean.columns]))

# Convert to categorical
for col in categorical_cols:
    coffee_survey_clean[col] = coffee_survey_clean[col].astype('category')

# Set ordered categories for ordinal variables
if 'age' in coffee_survey_clean.columns:
    age_order = ["<18 years old", "18-24 years old", "25-34 years old", 
                 "35-44 years old", "45-54 years old", "55-64 years old", ">65 years old"]
    coffee_survey_clean['age'] = pd.Categorical(coffee_survey_clean['age'], categories=age_order, ordered=True)

if 'cups' in coffee_survey_clean.columns:
    cups_order = ["Less than 1", "1", "2", "3", "4", "More than 4"]
    coffee_survey_clean['cups'] = pd.Categorical(coffee_survey_clean['cups'], categories=cups_order, ordered=True)

if 'strength' in coffee_survey_clean.columns:
    strength_order = ["Weak", "Somewhat light", "Medium", "Somewhat strong", "Very strong"]
    coffee_survey_clean['strength'] = pd.Categorical(coffee_survey_clean['strength'], categories=strength_order, ordered=True)

if 'caffeine' in coffee_survey_clean.columns:
    caffeine_order = ["Decaf", "Half caff", "Full caffeine"]
    coffee_survey_clean['caffeine'] = pd.Categorical(coffee_survey_clean['caffeine'], categories=caffeine_order, ordered=True)

if 'wfh' in coffee_survey_clean.columns:
    wfh_order = ["I primarily work from home", "I do a mix of both", "I primarily work in person"]
    coffee_survey_clean['wfh'] = pd.Categorical(coffee_survey_clean['wfh'], categories=wfh_order, ordered=True)

if 'total_spend' in coffee_survey_clean.columns:
    total_spend_order = ["<$20", "$20-$40", "$40-$60", "$60-$80", "$80-$100", ">$100"]
    coffee_survey_clean['total_spend'] = pd.Categorical(coffee_survey_clean['total_spend'], categories=total_spend_order, ordered=True)

# Handle most_paid and most_willing columns
price_order = ["Less than $2", "$2-$4", "$4-$6", "$6-$8", "$8-$10", "$10-$15", "$15-$20", "More than $20"]
for col in ['most_paid', 'most_willing']:
    if col in coffee_survey_clean.columns:
        coffee_survey_clean[col] = pd.Categorical(coffee_survey_clean[col], categories=price_order, ordered=True)

if 'spent_equipment' in coffee_survey_clean.columns:
    equipment_order = ["Less than $20", "$20-$50", "$50-$100", "$100-$300", "$300-$500", "$500-$1000", "More than $1,000"]
    coffee_survey_clean['spent_equipment'] = pd.Categorical(coffee_survey_clean['spent_equipment'], categories=equipment_order, ordered=True)

if 'education_level' in coffee_survey_clean.columns:
    education_order = ["Less than high school", "High school graduate", "Some college or associate's degree", 
                      "Bachelor's degree", "Master's degree", "Doctorate or professional degree"]
    coffee_survey_clean['education_level'] = pd.Categorical(coffee_survey_clean['education_level'], categories=education_order, ordered=True)

if 'number_children' in coffee_survey_clean.columns:
    children_order = ["None", "1", "2", "3", "More than 3"]
    coffee_survey_clean['number_children'] = pd.Categorical(coffee_survey_clean['number_children'], categories=children_order, ordered=True)

# Drop rows with missing coffee_d_personal_preference values
coffee_survey_clean = coffee_survey_clean.dropna(subset=['coffee_d_personal_preference'])

# Transform coffee_d_personal_preference
coffee_survey_clean['coffee_d_personal_preference'] = (
    (coffee_survey_clean['coffee_d_personal_preference'] > 3)
    .map({True: 'Like', False: 'Does not like'})
    .astype('category')
)

# Save to feather format
feather.write_feather(coffee_survey_clean, "data/coffee_survey_clean_py.feather")

Wrap-up

Recap

  • Data preparation occurs before the modeling process
  • Clean and standardize your data set to make it ready for modeling
  • Use standalone scripts to perform data preparation reproducibly