HW 02 - Predict coffee preferences

Homework
Modified

October 2, 2024

Important

This homework is due October 9 at 11:59pm ET.

Learning objectives

  • Implement data cleaning and preparation
  • Define data preprocessing and feature engineering steps for predictive models
  • Estimate a range of model types
  • Creatively explore alternative model workflows
  • Utilize tuning parameters to increase model performance
  • Evaluate model performance

Getting started

  • Go to the info4940-fa24 organization on GitHub. Click on the repo with the prefix hw-02. It contains the starter documents you need to complete the homework.

  • Clone the repo and start a new project in RStudio.

Packages

Guidelines + tips

  • Set your random seed to ensure reproducible results.
  • Use caching to speed up the rendering process.
  • Use parallel processing to speed up rendering time. Note that this works differently on different systems and operating systems, and it also makes it harder to debug code and track progress in model fitting. Use at your own discretion.
Presenting results of multiple models

For the love of all that is pure in this world, please consider how to present the results of your modeling efforts. Do not just rely on raw output from R to tell us what we need to know.

  • Your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.
  • Maybe condense information into one or a handful of custom graphs.
  • You can create simple formatted tables using {gt}

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete this homework and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

Workflow + formatting

Make sure to

  • Update author name on your document.
  • Label all code chunks informatively and concisely.
  • Follow the Tidyverse code style guidelines.
  • Make at least 3 commits.
  • Resize figures where needed, avoid tiny or huge plots.
  • Turn in an organized, well formatted document.

The Great American Coffee Tasting

In October 2023, James Hoffmann and coffee company Cometeer held the “Great American Coffee Taste Test” on YouTube, during which viewers were asked to fill out a survey about 4 coffees they ordered from Cometeer for the tasting. Tidy Tuesday published the data set we are using.

coffee_survey <- read_csv(file = "data/coffee_survey.csv")
Rows: 4042 Columns: 57
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (44): submission_id, age, cups, where_drink, brew, brew_other, purchase,...
dbl (13): expertise, coffee_a_bitterness, coffee_a_acidity, coffee_a_persona...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

It includes the following features:

variable class description
submission_id character Submission ID
age character What is your age?
cups character How many cups of coffee do you typically drink per day?
where_drink character Where do you typically drink coffee?
brew character How do you brew coffee at home?
brew_other character How else do you brew coffee at home?
purchase character On the go, where do you typically purchase coffee?
purchase_other character Where else do you purchase coffee?
favorite character What is your favorite coffee drink?
favorite_specify character Please specify what your favorite coffee drink is
additions character Do you usually add anything to your coffee?
additions_other character What else do you add to your coffee?
dairy character What kind of dairy do you add?
sweetener character What kind of sugar or sweetener do you add?
style character Before today’s tasting, which of the following best described what kind of coffee you like?
strength character How strong do you like your coffee?
roast_level character What roast level of coffee do you prefer?
caffeine character How much caffeine do you like in your coffee?
expertise numeric Lastly, how would you rate your own coffee expertise?
coffee_a_bitterness numeric Coffee A - Bitterness
coffee_a_acidity numeric Coffee A - Acidity
coffee_a_personal_preference numeric Coffee A - Personal Preference
coffee_a_notes character Coffee A - Notes
coffee_b_bitterness numeric Coffee B - Bitterness
coffee_b_acidity numeric Coffee B - Acidity
coffee_b_personal_preference numeric Coffee B - Personal Preference
coffee_b_notes character Coffee B - Notes
coffee_c_bitterness numeric Coffee C - Bitterness
coffee_c_acidity numeric Coffee C - Acidity
coffee_c_personal_preference numeric Coffee C - Personal Preference
coffee_c_notes character Coffee C - Notes
coffee_d_bitterness numeric Coffee D - Bitterness
coffee_d_acidity numeric Coffee D - Acidity
coffee_d_personal_preference numeric Coffee D - Personal Preference
coffee_d_notes character Coffee D - Notes
prefer_abc character Between Coffee A, Coffee B, and Coffee C which did you prefer?
prefer_ad character Between Coffee A and Coffee D, which did you prefer?
prefer_overall character Lastly, what was your favorite overall coffee?
wfh character Do you work from home or in person?
total_spend character In total, much money do you typically spend on coffee in a month?
why_drink character Why do you drink coffee?
why_drink_other character Other reason for drinking coffee
taste character Do you like the taste of coffee?
know_source character Do you know where your coffee comes from?
most_paid character What is the most you’ve ever paid for a cup of coffee?
most_willing character What is the most you’d ever be willing to pay for a cup of coffee?
value_cafe character Do you feel like you’re getting good value for your money when you buy coffee at a cafe?
spent_equipment character Approximately how much have you spent on coffee equipment in the past 5 years?
value_equipment character Do you feel like you’re getting good value for your money when you buy coffee at a cafe?
gender character Gender
gender_specify character Gender (please specify)
education_level character Education Level
ethnicity_race character Ethnicity/Race
ethnicity_race_specify character Ethnicity/Race (please specify)
employment_status character Employment Status
number_children character Number of Children
political_affiliation character Political Affiliation

You have been hired by Cometeer to build a model predicting potential customers’ preferred coffee. Our client wants to host a survey on their website that potential customers can complete to get a recommendation for which of the four coffees they would like best.

Exercise 1

Import the data set and prepare it for modeling. Implement the data preparation we performed in the last application exercise. Specifically,

  • Keep only columns that would plausibly be available without users conducting the taste test and that provide useful, relevant information for modeling.
  • Convert categorical variables to factor columns. For single response survey questions, convert the columns to factors and order the levels relevantly. For multi-response survey questions, split the variables to one column-per-response and convert them to factors.
  • Drop observations with missing values for the outcome of interest, prefer_overall.

Exercise 2

Partition the data set. Split the data into training/test sets, with 75% allocated for training. Further partition the training set using an appropriate resampling method. You will use this resampled set for all model training and evaluation. Provide a brief explanation for why you chose this specific technique.

Exercise 3

Estimate a null model. Predict a respondent’s overall coffee preference (coffee A, B, C, or D). Calculate the accuracy, Brier score, and ROC AUC of the model (as well as any other metrics you deem relevant). Interpret the results.

Exercise 4

Evaluate a penalized regression model.

Penalized regression for multinomial outcomes.

Since the outcome of interest contains more than 2 values, use multinom_reg() to specify a penalized regression model. It’s the multinomial comparable to linear_reg() and logistic_reg().

At minimum, the feature engineering recipe should:

  • Convert any multi-response categorical variables to one column per response.1
  • Do everything recommended/required for a penalized regression model.2

1 {recipes} has a step function to help you do this. Use the documentation.

2 Please remember that penalized regression always requires features to be normalized.

The model should be tuned, at minimum, over the penalty and mixture tuning parameters using the {glmnet} engine.

Beyond that, it is up to you to decide what variables to use as predictors, additional feature engineering steps, tuning parameters, tuning method, etc. You should document these decisions and explain why you made them. Report the results of at least 3 distinct workflows for a penalized regression model. Be sure to include the accuracy, Brier score, and ROC AUCs of the models (as well as any other metrics you deem relevant), and generate a calibration plot for the best performing model.

Note

I have established the floor of what you need to do for each exercise. To earn full credit, please think deliberately about your modeling approach and describe/justify it in your submission.

Exercise 5

Evaluate a MARS model.

At minimum, the feature engineering recipe should:

  • Convert any multi-response categorical variables to one column per response.
  • Do everything recommended/required for a MARS model.

The model should be tuned, at minimum, over the num_terms and prod_degree tuning parameters.

Note

mars() does not work for classification models with more than 2 classes. Instead, use the alternative implementation discrim_flexible().

Beyond that, it is up to you to decide what variables to use as predictors, additional feature engineering steps, tuning parameters, tuning method, etc. You should document these decisions and explain why you made them. Report the results of at least 3 distinct workflows for a MARS model. Be sure to include the accuracy, Brier score, and ROC AUCs of the models (as well as any other metrics you deem relevant), and generate a calibration plot for the best performing model.

Exercise 6

Evaluate a boosted tree model.

At minimum, the feature engineering recipe should:

  • Convert any multi-response categorical variables to one column per response.
  • Do everything recommended/required for a boosted tree model.

The model should be tuned over an appropriate set of parameters.

Beyond that, it is up to you to decide what variables to use as predictors, additional feature engineering steps, tuning parameters, tuning method, etc. You should document these decisions and explain why you made them. Report the results of at least 2 distinct workflows for a boosted tree model. Be sure to include the accuracy, Brier score, and ROC AUCs of the models (as well as any other metrics you deem relevant), and generate a calibration plot for the best performing model.

Exercise 7

Finalize a predictive model. Choose a model from a previous exercise (or try something else) and finalize it. Report the final accuracy, Brier score, and ROC AUC of the model, and generate a calibration plot for the best performing model. Interpret the performance of your model.

Exercise 8

Your client is excited to hear what you have done for this project and eager to deploy this model ASAP. What is your recommendation?

Generative AI (GAI) self-reflection

As stated in the syllabus, include a written reflection for this assignment of how you used GAI tools (e.g. what tools you used, how you used them to assist you with writing code), what skills you believe you acquired, and how you believe you demonstrated mastery of the learning objectives.

Render, commit, and push one last time.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Wrap up

Submission

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
  • Click on your INFO 4940/5940 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).

Grading

  • Exercise 1: 4 points
  • Exercise 2: 4 points
  • Exercise 3: 2 points
  • Exercise 4: 10 points
  • Exercise 5: 10 points
  • Exercise 6: 10 points
  • Exercise 7: 5 points
  • Exercise 8: 5 points
  • GAI self-reflection: 0 points
  • Total: 50 points