HW 02 - Predict coffee preferences

Homework

Modified

October 2, 2024

Important

This homework is due October 9 at 11:59pm ET.

Learning objectives

Implement data cleaning and preparation
Define data preprocessing and feature engineering steps for predictive models
Estimate a range of model types
Creatively explore alternative model workflows
Utilize tuning parameters to increase model performance
Evaluate model performance

Getting started

Go to the info4940-fa24 organization on GitHub. Click on the repo with the prefix hw-02. It contains the starter documents you need to complete the homework.
Clone the repo and start a new project in RStudio.

Packages

library(tidyverse)
library(tidymodels)
library(scales)
library(themis)
library(glmnet)
library(earth)
library(discrim)
library(mda)
library(bonsai)
library(xgboost)
library(probably)

# preferred theme
theme_set(theme_minimal(base_size = 12, base_family = "Atkinson Hyperlegible"))

Guidelines + tips

Set your random seed to ensure reproducible results.
Use caching to speed up the rendering process.
Use parallel processing to speed up rendering time. Note that this works differently on different systems and operating systems, and it also makes it harder to debug code and track progress in model fitting. Use at your own discretion.

Presenting results of multiple models

For the love of all that is pure in this world, please consider how to present the results of your modeling efforts. Do not just rely on raw output from R to tell us what we need to know.

Your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.
Maybe condense information into one or a handful of custom graphs.
You can create simple formatted tables using {gt}

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete this homework and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

Workflow + formatting

Make sure to

Update author name on your document.
Label all code chunks informatively and concisely.
Follow the Tidyverse code style guidelines.
Make at least 3 commits.
Resize figures where needed, avoid tiny or huge plots.
Turn in an organized, well formatted document.

The Great American Coffee Tasting

In October 2023, James Hoffmann and coffee company Cometeer held the “Great American Coffee Taste Test” on YouTube, during which viewers were asked to fill out a survey about 4 coffees they ordered from Cometeer for the tasting. Tidy Tuesday published the data set we are using.

coffee_survey <- read_csv(file = "data/coffee_survey.csv")

Rows: 4042 Columns: 57
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (44): submission_id, age, cups, where_drink, brew, brew_other, purchase,...
dbl (13): expertise, coffee_a_bitterness, coffee_a_acidity, coffee_a_persona...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

It includes the following features:

variable	class	description
submission_id	character	Submission ID
age	character	What is your age?
cups	character	How many cups of coffee do you typically drink per day?
where_drink	character	Where do you typically drink coffee?
brew	character	How do you brew coffee at home?
brew_other	character	How else do you brew coffee at home?
purchase	character	On the go, where do you typically purchase coffee?
purchase_other	character	Where else do you purchase coffee?
favorite	character	What is your favorite coffee drink?
favorite_specify	character	Please specify what your favorite coffee drink is
additions	character	Do you usually add anything to your coffee?
additions_other	character	What else do you add to your coffee?
dairy	character	What kind of dairy do you add?
sweetener	character	What kind of sugar or sweetener do you add?
style	character	Before today’s tasting, which of the following best described what kind of coffee you like?
strength	character	How strong do you like your coffee?
roast_level	character	What roast level of coffee do you prefer?
caffeine	character	How much caffeine do you like in your coffee?
expertise	numeric	Lastly, how would you rate your own coffee expertise?
coffee_a_bitterness	numeric	Coffee A - Bitterness
coffee_a_acidity	numeric	Coffee A - Acidity
coffee_a_personal_preference	numeric	Coffee A - Personal Preference
coffee_a_notes	character	Coffee A - Notes
coffee_b_bitterness	numeric	Coffee B - Bitterness
coffee_b_acidity	numeric	Coffee B - Acidity
coffee_b_personal_preference	numeric	Coffee B - Personal Preference
coffee_b_notes	character	Coffee B - Notes
coffee_c_bitterness	numeric	Coffee C - Bitterness
coffee_c_acidity	numeric	Coffee C - Acidity
coffee_c_personal_preference	numeric	Coffee C - Personal Preference
coffee_c_notes	character	Coffee C - Notes
coffee_d_bitterness	numeric	Coffee D - Bitterness
coffee_d_acidity	numeric	Coffee D - Acidity
coffee_d_personal_preference	numeric	Coffee D - Personal Preference
coffee_d_notes	character	Coffee D - Notes
prefer_abc	character	Between Coffee A, Coffee B, and Coffee C which did you prefer?
prefer_ad	character	Between Coffee A and Coffee D, which did you prefer?
prefer_overall	character	Lastly, what was your favorite overall coffee?
wfh	character	Do you work from home or in person?
total_spend	character	In total, much money do you typically spend on coffee in a month?
why_drink	character	Why do you drink coffee?
why_drink_other	character	Other reason for drinking coffee
taste	character	Do you like the taste of coffee?
know_source	character	Do you know where your coffee comes from?
most_paid	character	What is the most you’ve ever paid for a cup of coffee?
most_willing	character	What is the most you’d ever be willing to pay for a cup of coffee?
value_cafe	character	Do you feel like you’re getting good value for your money when you buy coffee at a cafe?
spent_equipment	character	Approximately how much have you spent on coffee equipment in the past 5 years?
value_equipment	character	Do you feel like you’re getting good value for your money when you buy coffee at a cafe?
gender	character	Gender
gender_specify	character	Gender (please specify)
education_level	character	Education Level
ethnicity_race	character	Ethnicity/Race
ethnicity_race_specify	character	Ethnicity/Race (please specify)
employment_status	character	Employment Status
number_children	character	Number of Children
political_affiliation	character	Political Affiliation

You have been hired by Cometeer to build a model predicting potential customers’ preferred coffee. Our client wants to host a survey on their website that potential customers can complete to get a recommendation for which of the four coffees they would like best.

Exercise 1

Import the data set and prepare it for modeling. Implement the data preparation we performed in the last application exercise. Specifically,

Keep only columns that would plausibly be available without users conducting the taste test and that provide useful, relevant information for modeling.
Convert categorical variables to factor columns. For single response survey questions, convert the columns to factors and order the levels relevantly. For multi-response survey questions, split the variables to one column-per-response and convert them to factors.
Drop observations with missing values for the outcome of interest, prefer_overall.

Exercise 2

Partition the data set. Split the data into training/test sets, with 75% allocated for training. Further partition the training set using an appropriate resampling method. You will use this resampled set for all model training and evaluation. Provide a brief explanation for why you chose this specific technique.

Exercise 3

Estimate a null model. Predict a respondent’s overall coffee preference (coffee A, B, C, or D). Calculate the accuracy, Brier score, and ROC AUC of the model (as well as any other metrics you deem relevant). Interpret the results.

Exercise 4

Evaluate a penalized regression model.

Penalized regression for multinomial outcomes.

Since the outcome of interest contains more than 2 values, use multinom_reg() to specify a penalized regression model. It’s the multinomial comparable to linear_reg() and logistic_reg().

At minimum, the feature engineering recipe should:

Convert any multi-response categorical variables to one column per response.¹
Do everything recommended/required for a penalized regression model.²

¹ {recipes} has a step function to help you do this. Use the documentation.

² Please remember that penalized regression always requires features to be normalized.

The model should be tuned, at minimum, over the penalty and mixture tuning parameters using the {glmnet} engine.

Beyond that, it is up to you to decide what variables to use as predictors, additional feature engineering steps, tuning parameters, tuning method, etc. You should document these decisions and explain why you made them. Report the results of at least 3 distinct workflows for a penalized regression model. Be sure to include the accuracy, Brier score, and ROC AUCs of the models (as well as any other metrics you deem relevant), and generate a calibration plot for the best performing model.

Note

I have established the floor of what you need to do for each exercise. To earn full credit, please think deliberately about your modeling approach and describe/justify it in your submission.

Exercise 5

Evaluate a MARS model.

At minimum, the feature engineering recipe should:

Convert any multi-response categorical variables to one column per response.
Do everything recommended/required for a MARS model.

The model should be tuned, at minimum, over the num_terms and prod_degree tuning parameters.

Note

mars() does not work for classification models with more than 2 classes. Instead, use the alternative implementation discrim_flexible().

Beyond that, it is up to you to decide what variables to use as predictors, additional feature engineering steps, tuning parameters, tuning method, etc. You should document these decisions and explain why you made them. Report the results of at least 3 distinct workflows for a MARS model. Be sure to include the accuracy, Brier score, and ROC AUCs of the models (as well as any other metrics you deem relevant), and generate a calibration plot for the best performing model.

Exercise 6

Evaluate a boosted tree model.

At minimum, the feature engineering recipe should:

Convert any multi-response categorical variables to one column per response.
Do everything recommended/required for a boosted tree model.

The model should be tuned over an appropriate set of parameters.

Beyond that, it is up to you to decide what variables to use as predictors, additional feature engineering steps, tuning parameters, tuning method, etc. You should document these decisions and explain why you made them. Report the results of at least 2 distinct workflows for a boosted tree model. Be sure to include the accuracy, Brier score, and ROC AUCs of the models (as well as any other metrics you deem relevant), and generate a calibration plot for the best performing model.

Exercise 7

Finalize a predictive model. Choose a model from a previous exercise (or try something else) and finalize it. Report the final accuracy, Brier score, and ROC AUC of the model, and generate a calibration plot for the best performing model. Interpret the performance of your model.

Exercise 8

Your client is excited to hear what you have done for this project and eager to deploy this model ASAP. What is your recommendation?

Generative AI (GAI) self-reflection

As stated in the syllabus, include a written reflection for this assignment of how you used GAI tools (e.g. what tools you used, how you used them to assist you with writing code), what skills you believe you acquired, and how you believe you demonstrated mastery of the learning objectives.

Render, commit, and push one last time.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Wrap up

Submission

Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
Click on your INFO 4940/5940 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).

Grading

Exercise 1: 4 points
Exercise 2: 4 points
Exercise 3: 2 points
Exercise 4: 10 points
Exercise 5: 10 points
Exercise 6: 10 points
Exercise 7: 5 points
Exercise 8: 5 points
GAI self-reflection: 0 points
Total: 50 points