AE 08: Explore coffee taste tests

Application exercise
Modified

September 26, 2024

The Great American Coffee Taste Test

In October 2023, James Hoffmann and coffee company Cometeer held the “Great American Coffee Taste Test” on YouTube, during which viewers were asked to fill out a survey about 4 coffees they ordered from Cometeer for the tasting. Tidy Tuesday published the data set we are using.

coffee_survey <- read_csv(file = "data/coffee_survey.csv")

It includes the following features:

variable class description
submission_id character Submission ID
age character What is your age?
cups character How many cups of coffee do you typically drink per day?
where_drink character Where do you typically drink coffee?
brew character How do you brew coffee at home?
brew_other character How else do you brew coffee at home?
purchase character On the go, where do you typically purchase coffee?
purchase_other character Where else do you purchase coffee?
favorite character What is your favorite coffee drink?
favorite_specify character Please specify what your favorite coffee drink is
additions character Do you usually add anything to your coffee?
additions_other character What else do you add to your coffee?
dairy character What kind of dairy do you add?
sweetener character What kind of sugar or sweetener do you add?
style character Before today’s tasting, which of the following best described what kind of coffee you like?
strength character How strong do you like your coffee?
roast_level character What roast level of coffee do you prefer?
caffeine character How much caffeine do you like in your coffee?
expertise numeric Lastly, how would you rate your own coffee expertise?
coffee_a_bitterness numeric Coffee A - Bitterness
coffee_a_acidity numeric Coffee A - Acidity
coffee_a_personal_preference numeric Coffee A - Personal Preference
coffee_a_notes character Coffee A - Notes
coffee_b_bitterness numeric Coffee B - Bitterness
coffee_b_acidity numeric Coffee B - Acidity
coffee_b_personal_preference numeric Coffee B - Personal Preference
coffee_b_notes character Coffee B - Notes
coffee_c_bitterness numeric Coffee C - Bitterness
coffee_c_acidity numeric Coffee C - Acidity
coffee_c_personal_preference numeric Coffee C - Personal Preference
coffee_c_notes character Coffee C - Notes
coffee_d_bitterness numeric Coffee D - Bitterness
coffee_d_acidity numeric Coffee D - Acidity
coffee_d_personal_preference numeric Coffee D - Personal Preference
coffee_d_notes character Coffee D - Notes
prefer_abc character Between Coffee A, Coffee B, and Coffee C which did you prefer?
prefer_ad character Between Coffee A and Coffee D, which did you prefer?
prefer_overall character Lastly, what was your favorite overall coffee?
wfh character Do you work from home or in person?
total_spend character In total, much money do you typically spend on coffee in a month?
why_drink character Why do you drink coffee?
why_drink_other character Other reason for drinking coffee
taste character Do you like the taste of coffee?
know_source character Do you know where your coffee comes from?
most_paid character What is the most you’ve ever paid for a cup of coffee?
most_willing character What is the most you’d ever be willing to pay for a cup of coffee?
value_cafe character Do you feel like you’re getting good value for your money when you buy coffee at a cafe?
spent_equipment character Approximately how much have you spent on coffee equipment in the past 5 years?
value_equipment character Do you feel like you’re getting good value for your money when you buy coffee at a cafe?
gender character Gender
gender_specify character Gender (please specify)
education_level character Education Level
ethnicity_race character Ethnicity/Race
ethnicity_race_specify character Ethnicity/Race (please specify)
employment_status character Employment Status
number_children character Number of Children
political_affiliation character Political Affiliation

Our ultimate goal on the next homework assignment is to predict which coffee a respondent will prefer based on their survey responses. We will use the prefer_overall variable as our target.

A quick {skimr} of the data:

skim(coffee_survey)
Data summary
Name coffee_survey
Number of rows 4042
Number of columns 57
_______________________
Column type frequency:
character 44
numeric 13
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
submission_id 0 1.00 6 6 0 4042 0
age 31 0.99 13 15 0 7 0
cups 93 0.98 1 11 0 6 0
where_drink 70 0.98 7 44 0 65 0
brew 385 0.90 5 165 0 449 0
brew_other 3364 0.17 2 319 0 160 0
purchase 3332 0.18 5 107 0 89 0
purchase_other 4011 0.01 4 83 0 26 0
favorite 62 0.98 5 32 0 12 0
favorite_specify 3926 0.03 3 92 0 78 0
additions 83 0.98 5 100 0 53 0
additions_other 3994 0.01 3 140 0 42 0
dairy 2356 0.42 8 110 0 175 0
sweetener 3530 0.13 5 99 0 82 0
style 84 0.98 4 11 0 12 0
strength 126 0.97 4 15 0 5 0
roast_level 102 0.97 4 7 0 7 0
caffeine 125 0.97 5 13 0 3 0
coffee_a_notes 1464 0.64 1 377 0 2317 0
coffee_b_notes 1586 0.61 1 980 0 2199 0
coffee_c_notes 1659 0.59 1 438 0 2163 0
coffee_d_notes 1454 0.64 1 528 0 2354 0
prefer_abc 270 0.93 8 8 0 3 0
prefer_ad 281 0.93 8 8 0 2 0
prefer_overall 272 0.93 8 8 0 4 0
wfh 518 0.87 18 26 0 3 0
total_spend 531 0.87 4 8 0 6 0
why_drink 474 0.88 5 93 0 84 0
why_drink_other 3875 0.04 2 195 0 163 0
taste 479 0.88 2 3 0 2 0
know_source 483 0.88 2 3 0 2 0
most_paid 515 0.87 5 13 0 8 0
most_willing 532 0.87 5 13 0 8 0
value_cafe 542 0.87 2 3 0 2 0
spent_equipment 536 0.87 7 16 0 7 0
value_equipment 548 0.86 2 3 0 2 0
gender 519 0.87 4 22 0 5 0
gender_specify 4030 0.00 2 28 0 11 0
education_level 604 0.85 15 34 0 6 0
ethnicity_race 624 0.85 15 29 0 6 0
ethnicity_race_specify 3937 0.03 2 53 0 82 0
employment_status 623 0.85 7 18 0 6 0
number_children 636 0.84 1 11 0 5 0
political_affiliation 753 0.81 8 14 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
expertise 104 0.97 5.69 1.95 1 5 6 7 10 ▂▃▇▇▁
coffee_a_bitterness 244 0.94 2.14 0.95 1 1 2 3 5 ▅▇▃▂▁
coffee_a_acidity 263 0.93 3.63 0.98 1 3 4 4 5 ▁▂▅▇▃
coffee_a_personal_preference 253 0.94 3.31 1.19 1 2 3 4 5 ▂▅▆▇▅
coffee_b_bitterness 262 0.94 3.01 0.99 1 2 3 4 5 ▂▅▇▆▁
coffee_b_acidity 275 0.93 2.22 0.87 1 2 2 3 5 ▃▇▅▁▁
coffee_b_personal_preference 269 0.93 3.07 1.11 1 2 3 4 5 ▂▆▇▆▃
coffee_c_bitterness 278 0.93 3.07 1.00 1 2 3 4 5 ▁▅▇▆▂
coffee_c_acidity 291 0.93 2.37 0.92 1 2 2 3 5 ▃▇▆▂▁
coffee_c_personal_preference 276 0.93 3.06 1.13 1 2 3 4 5 ▂▆▇▆▃
coffee_d_bitterness 275 0.93 2.16 1.08 1 1 2 3 5 ▇▇▅▂▁
coffee_d_acidity 277 0.93 3.86 1.01 1 3 4 5 5 ▁▂▃▇▆
coffee_d_personal_preference 278 0.93 3.38 1.45 1 2 4 5 5 ▅▃▃▆▇

Examining continuous variables

Your turn: Examine expertise using a histogram and appropriate binwidth. Describe the features of this variable.

# add code here

Add response here.

Your turn: Each coffee has three numeric ratings by the respondents: bitterness, acidity, and personal preference. Create a histogram for each of these characteristics, faceted by coffee type. What do you notice?

Wrangling the data for easier visualization

The original structure of the data is one column for each coffee for each characteristic. You could create separate graphs for each of the 12 columns, but that seems like a lot of work. Instead, consider using the pivot_longer() function to restructure the data to one row per coffee per characteristic. This will make it easier to create the faceted histograms.

coffee_survey |>
  select(starts_with("coffee"), -ends_with("notes")) |>
  pivot_longer(
    cols = everything(),
    names_to = c("coffee", "measure"),
    names_prefix = "coffee_",
    names_sep = "_",
    values_to = "value"
  )
# A tibble: 48,504 × 3
   coffee measure    value
   <chr>  <chr>      <dbl>
 1 a      bitterness    NA
 2 a      acidity       NA
 3 a      personal      NA
 4 b      bitterness    NA
 5 b      acidity       NA
 6 b      personal      NA
 7 c      bitterness    NA
 8 c      acidity       NA
 9 c      personal      NA
10 d      bitterness    NA
# ℹ 48,494 more rows
# add code here

Add response here.

Examining categorical variables

Your turn: Examine prefer_overall graphically. Record your observations.

# add code here

Add response here.

Your turn: Examine cups, brew, and roast_level. Record your observations, in particular how you might need to handle these variables in the modeling stage.

# add code here

Add response here.

# add code here

Add response here.

# add code here

Add response here.

Making comparisons

Your turn: Compare the relationship between overall preference and the respondents’ preferred roast levels. Use a proportional bar chart to visualize the relationship.

Tip

Use position = "fill" with geom_bar() to automatically calculate percentages for the chart.

# add code here

Add response here.

Your turn: Examine the relationship between the respondents’ numeric ratings for acidity, bitterness, and personal preference for each of the four coffees and compare to their overall preference. Record your observations.

Add response here.

Data quality

Missingness

Demonstration: Use {visdat} to visualize missingness patterns in the data set.

library(visdat)

# quick glance of missingness by row/column order
vis_dat(coffee_survey)

# reorder columns based on % missing
vis_miss(coffee_survey, sort_miss = TRUE)

# cluster rows based on similarity in missingness patterns
vis_miss(coffee_survey, cluster = TRUE)

Your turn: Record your observations on the missingness patterns in the data set. What variables have high missingness? Is this surprising? What might you do to variables or observations with high degrees of missingness?

Add response here.

Outliers

Demonstration: Generate a scatterplot matrix for all the numeric variables in the data set.1

1 Not particularly helpful for this dataset, but a good practice to get into.

coffee_survey |>
  select(where(is.numeric)) |>
  ggpairs()

Your turn: Examine the distribution of roast/gender and roast/cups. Describe the patterns you see and anything that is of particular interest given the model we will estimate.

# add code here
# add code here

Add response here.