HW 04 - Predict coffee preferences

Homework
Modified

October 3, 2025

Important

This homework is due October 8 at 11:59pm ET.

Learning objectives

  • Implement data cleaning and preparation
  • Define data preprocessing and feature engineering steps for predictive models
  • Estimate a range of model types
  • Creatively explore alternative model workflows
  • Utilize tuning parameters to increase model performance
  • Evaluate model performance

Getting started

  • Go to the info4940-fa25 organization on GitHub. Click on the repo with the prefix hw-04. It contains the starter documents you need to complete the lab.

  • Clone the repo and start a new workspace in Positron. See the Homework 0 instructions for details on cloning a repo and starting a new R project.

General guidance

TipGuidelines + tips
  • Set your random seed to ensure reproducible results.
  • Use caching to speed up the rendering process.
  • Use parallel processing to speed up rendering time. Note that this works differently on different systems and operating systems, and it also makes it harder to debug code and track progress in model fitting. Use at your own discretion.

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete this homework and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

TipWorkflow + formatting

Make sure to

  • Update author name on your document.
  • Label all code chunks informatively and concisely.
  • Follow consistent code style guidelines.
  • Make at least 3 commits.
  • Resize figures where needed, avoid tiny or huge plots.
  • Turn in an organized, well formatted document.
WarningPresenting results of multiple models

For the love of all that is pure in this world, please consider how to present the results of your modeling efforts. Do not just rely on raw output from R to tell us what we need to know.

  • Your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.
  • Maybe condense information into one or a handful of custom graphs.
  • You can create simple formatted tables using {gt}/great_tables

The Great American Coffee Tasting

In October 2023, James Hoffmann and coffee company Cometeer held the “Great American Coffee Taste Test” on YouTube, during which viewers were asked to fill out a survey about 4 coffees they ordered from Cometeer for the tasting. Tidy Tuesday published the data set we are using.

It includes the following features:

variable class description
submission_id character Submission ID
age character What is your age?
cups character How many cups of coffee do you typically drink per day?
where_drink character Where do you typically drink coffee?
brew character How do you brew coffee at home?
brew_other character How else do you brew coffee at home?
purchase character On the go, where do you typically purchase coffee?
purchase_other character Where else do you purchase coffee?
favorite character What is your favorite coffee drink?
favorite_specify character Please specify what your favorite coffee drink is
additions character Do you usually add anything to your coffee?
additions_other character What else do you add to your coffee?
dairy character What kind of dairy do you add?
sweetener character What kind of sugar or sweetener do you add?
style character Before today’s tasting, which of the following best described what kind of coffee you like?
strength character How strong do you like your coffee?
roast_level character What roast level of coffee do you prefer?
caffeine character How much caffeine do you like in your coffee?
expertise numeric Lastly, how would you rate your own coffee expertise?
coffee_a_bitterness numeric Coffee A - Bitterness
coffee_a_acidity numeric Coffee A - Acidity
coffee_a_personal_preference numeric Coffee A - Personal Preference
coffee_a_notes character Coffee A - Notes
coffee_b_bitterness numeric Coffee B - Bitterness
coffee_b_acidity numeric Coffee B - Acidity
coffee_b_personal_preference numeric Coffee B - Personal Preference
coffee_b_notes character Coffee B - Notes
coffee_c_bitterness numeric Coffee C - Bitterness
coffee_c_acidity numeric Coffee C - Acidity
coffee_c_personal_preference numeric Coffee C - Personal Preference
coffee_c_notes character Coffee C - Notes
coffee_d_bitterness numeric Coffee D - Bitterness
coffee_d_acidity numeric Coffee D - Acidity
coffee_d_personal_preference numeric Coffee D - Personal Preference
coffee_d_notes character Coffee D - Notes
prefer_abc character Between Coffee A, Coffee B, and Coffee C which did you prefer?
prefer_ad character Between Coffee A and Coffee D, which did you prefer?
prefer_overall character Lastly, what was your favorite overall coffee?
wfh character Do you work from home or in person?
total_spend character In total, much money do you typically spend on coffee in a month?
why_drink character Why do you drink coffee?
why_drink_other character Other reason for drinking coffee
taste character Do you like the taste of coffee?
know_source character Do you know where your coffee comes from?
most_paid character What is the most you’ve ever paid for a cup of coffee?
most_willing character What is the most you’d ever be willing to pay for a cup of coffee?
value_cafe character Do you feel like you’re getting good value for your money when you buy coffee at a cafe?
spent_equipment character Approximately how much have you spent on coffee equipment in the past 5 years?
value_equipment character Do you feel like you’re getting good value for your money when you buy coffee at a cafe?
gender character Gender
gender_specify character Gender (please specify)
education_level character Education Level
ethnicity_race character Ethnicity/Race
ethnicity_race_specify character Ethnicity/Race (please specify)
employment_status character Employment Status
number_children character Number of Children
political_affiliation character Political Affiliation

Our client ran a survey to better understand the preferences of potential customers for their new line of coffee. The survey includes questions about the potential customers’ coffee preferences, demographics, and coffee consumption habits, as well as taste test results of four varieties of coffee.

Our client wants to know if it should recommend a new variety of coffee (coffee D) based on customer demographics, preferences, and ratings for a standardized set of three coffe varieties.

Exercise 1

Import the data set and prepare it for modeling. Implement the data preparation we discussed in the last application exercise. Specifically,

  • Keep only columns that would plausibly be available without users conducting the taste test and that provide useful, relevant information for modeling.
  • Convert categorical variables to an appropriate format.1
  • Drop observations with missing values for the outcome of interest, coffee_d_personal_preference.
  • Convert the outcome of interest to a binary variable indicating whether the respondent liked coffee D or not (\(\text{Like} > 3\)).

Exercise 2

Partition the data set. Split the data into training/test sets. Further partition the training set using an appropriate resampling method. You will use this resampled set for all model training and evaluation. Provide a brief explanation for why you chose this specific technique.

Exercise 3

Estimate a null model. Predict a respondent’s preferenec for coffee D. Calculate the accuracy, Brier score, and ROC AUC of the model (as well as any other metrics you deem relevant). Interpret the results.

Exercise 4

Evaluate a penalized regression model.

At minimum, the feature engineering recipe should do everything recommended/required for a penalized regression model.

The model should be tuned, at minimum, over the penalty and mixture tuning parameters.

Beyond that, it is up to you to decide what variables to use as predictors, additional feature engineering steps, tuning parameters, tuning method, etc. You should document these decisions and explain why you made them. Report the results of at least 3 distinct workflows for a penalized regression model. Be sure to include the accuracy, Brier score, and ROC AUCs of the models (as well as any other metrics you deem relevant), and generate a calibration plot for the best performing model.

NoteGenerating calibration plots
Note

I have established the floor of what you need to do for each exercise. To earn full credit, please think deliberately about your modeling approach and describe/justify it in your submission.

Exercise 5

Evaluate a boosted trees model.

At minimum, the feature engineering recipe should do everything recommended/required for a boosted tree model.

The model should be tuned over an appropriate set of parameters.

Beyond that, it is up to you to decide what variables to use as predictors, additional feature engineering steps, tuning parameters, tuning method, etc. You should document these decisions and explain why you made them. Report the results of at least 2 distinct workflows for a boosted tree model. Be sure to include the accuracy, Brier score, and ROC AUCs of the models (as well as any other metrics you deem relevant), and generate a calibration plot for the best performing model.

Exercise 6

Choose your own adventure. Choose and fit a third model of your choice to predict whether or not the respondent likes coffee D. This can be any model you want, but it must be different from the penalized regression and boosted trees models you already fit. Document your choice of model and justify why you think it is appropriate for this problem. Perform any data preprocessing or feature engineering steps you think are appropriate to improve the model’s performance. Tune the model across relevant hyperparameters. How does this model perform?

Exercise 7

Finalize a predictive model. Choose a model from a previous exercise (or try something else) and finalize it. Report the final accuracy, Brier score, and ROC AUC of the model using the test set, and generate a calibration plot for the best performing model. Interpret the performance of your model.

Exercise 8

Your client is excited to hear what you have done for this project and eager to deploy this model ASAP. What is your recommendation?

Generative AI (GAI) self-reflection

As stated in the syllabus, include a written reflection for this assignment of how you used GAI tools (e.g. what tools you used, how you used them to assist you with writing code), what skills you believe you acquired, and how you believe you demonstrated mastery of the learning objectives.

Render, commit, and push one last time.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Wrap up

Submission

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
  • Click on your INFO 4940/5940 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).

Grading

  • Exercise 1: 4 points
  • Exercise 2: 4 points
  • Exercise 3: 2 points
  • Exercise 4: 10 points
  • Exercise 5: 10 points
  • Exercise 6: 10 points
  • Exercise 7: 5 points
  • Exercise 8: 5 points
  • GAI self-reflection: 0 points
  • Total: 50 points

Footnotes

  1. In R this would be factor variables. In Python we’re probably looking at Categoricals.↩︎