HW 01 - Machine learning workflows and methods

Homework

Modified

September 18, 2024

Important

This homework is due September 18 at 11:59pm ET.

Learning objectives

Develop a workflow for fitting machine learning models using the {tidymodels} framework.
Evaluate the bias and variance of different resampling methods.
Implement methods for handling class imbalance
Utilize a variety of metrics to evaluate model performance
Implement feature engineering techniques for categorical variables

Getting started

Go to the info4940-fa24 organization on GitHub. Click on the repo with the prefix hw-01. It contains the starter documents you need to complete the homework.
Clone the repo and start a new project in RStudio.

Packages

library(tidyverse)
library(tidymodels)
library(scales)
library(themis)
library(tictoc)
library(embed)
library(textrecipes)
library(probably)

Guidelines + tips

Your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.
Set your random seed to ensure reproducible results.
Use caching to speed up the rendering process.
Use parallel processing to speed up rendering time. Note that this works differently on different systems and operating systems, and it also makes it harder to debug code and track progress in model fitting. Use at your own discretion.

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete this homework and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

Workflow + formatting

Make sure to

Update author name on your document.
Label all code chunks informatively and concisely.
Follow the Tidyverse code style guidelines.
Make at least 3 commits.
Resize figures where needed, avoid tiny or huge plots.
Turn in an organized, well formatted document.

Simulation: Bias vs. variance with resampling methods

Let’s evaluate the claims we made in class related to the relative bias and variance of different resampling methods. We will use a simulated dataset to compare the performance of \(V\)-fold cross-validation, repeated \(V\)-fold cross-validation, Monte Carlo cross-validation, and bootstrapping.

Warning

For this section, do not use parallel processing. Just use the default (sequential processing).

data/resample_sim_train.csv contains a simulated dataset with 5,000 observations, 1 outcome (outcome), and 20 predictors. The outcome of interest is a continuous variable.

sim_train <- read_csv("data/resample_sim_train.csv")

Rows: 5000 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (21): outcome, predictor_01, predictor_02, predictor_03, predictor_04, p...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(sim_train)

Rows: 5,000
Columns: 21
$ outcome      <dbl> 3.715720, -28.253089, 15.803279, 18.255588, 2.321731, 13.…
$ predictor_01 <dbl> -1.37765970, -4.70388939, 0.03288553, 0.97303952, 1.84176…
$ predictor_02 <dbl> -0.89896948, -2.38690900, -0.07971423, -1.10346354, 1.645…
$ predictor_03 <dbl> 2.4280448, -4.5665401, 1.3775750, 7.5457395, -2.9159828, …
$ predictor_04 <dbl> -0.4517618, -0.4225646, 0.1853948, 1.6947051, -0.7426996,…
$ predictor_05 <dbl> 0.8128607, -2.9563295, 0.4365355, -0.5164981, 2.7157868, …
$ predictor_06 <dbl> 6.23290764, 3.14134488, 1.97238523, 2.45810059, -0.153283…
$ predictor_07 <dbl> 0.6222154, -2.0954605, 4.8815526, -3.9386134, -0.8049892,…
$ predictor_08 <dbl> -6.19278008, -5.28192443, -0.21642289, -5.00111640, 3.674…
$ predictor_09 <dbl> 2.3515567, 0.6170276, -0.2853378, 6.1092727, 2.4543098, 0…
$ predictor_10 <dbl> 1.8225728, -0.4053848, -0.1229824, 2.3404391, 1.0544255, …
$ predictor_11 <dbl> -0.5123522, -4.8681184, -0.2878140, -2.2457106, 1.3718077…
$ predictor_12 <dbl> 4.24047792, -0.08321799, 3.22990898, -2.78487430, -4.8905…
$ predictor_13 <dbl> -4.5012803, 2.0754524, -0.7997411, -2.0700509, 2.6306113,…
$ predictor_14 <dbl> 0.541826751, 0.420860265, -0.707170291, 0.861794064, 4.96…
$ predictor_15 <dbl> -2.3390576, 0.9432659, 2.9177289, 0.6405517, 1.3824467, -…
$ predictor_16 <dbl> 0.53484223, -3.19350666, -7.43957140, -0.74147010, -0.279…
$ predictor_17 <dbl> -0.68871190, 0.53370414, -2.26836840, -3.51271096, -3.751…
$ predictor_18 <dbl> 2.57010785, 6.31938206, -3.87701320, 4.50806322, 5.073874…
$ predictor_19 <dbl> -1.8537024, -0.9869667, -2.0694273, 4.4028691, 3.5209847,…
$ predictor_20 <dbl> -0.914632996, -3.111775176, 2.353765622, -4.705561921, 1.…

Fit a model using different resampling methods

Exercise 1

Implement a simple linear regression model predicting outcome as a function of all the other predictors, varying your resampling procedure. Use the entire training set to fit the model.

The resampling methods you will evaluate are:

\(V\)-fold cross-validation with \(V = 5, 10, 25, 50, 100\)
Repeated \(V\)-fold cross-validation with all combinations of \(V = 5, 10, 25, 50, 100\) and \(\text{repeats} = 5, 10, 25\)
Monte Carlo cross-validation with all combinations of \(\text{prop} = 0.5, 0.6, 0.7, 0.8, 0.9, 0.95\) and \(\text{times} = 5, 10, 25, 50, 100\)
Bootstrap resampling with \(\text{times} = 5, 10, 25, 50, 100, 500, 1000\)

For each resampling method, calculate the mean and standard error of the mean absolute error (MAE) across all resamples. Also record how long it takes to train the models.

Timing code in R

There are several ways you can track how long it takes for code to execute in R. I like the {tictoc} package which uses two functions to start and stop a timer.

library(tictoc)

# start the timer
tic()

# do something to make time pass by
Sys.sleep(time = 1)

# stop the timer
toc()

1.002 sec elapsed

# what if you need to store the time elapsed in an object?
tic()
Sys.sleep(time = 1)
time_elapsed <- toc()

1.007 sec elapsed

# check the results
time_elapsed

$tic
elapsed 
   3.06 

$toc
elapsed 
  4.067 

$msg
logical(0)

$callback_msg
[1] "1.007 sec elapsed"

# list object - what if we just want the elapsed time?
time_elapsed$toc - time_elapsed$tic

elapsed 
  1.007

Visualize the results using a plot with MAE on the \(y\)-axis and the resampling parameters on the \(x\)-axis or through color. Make sure to incorporate both the estimated MAE as well as the uncertainty of the estimate. You’ll probably make multiple plots - that’s okay. Just make sure you can reasonably compare across the different resampling methods.

The test set MAE for the linear regression model is \(13.51\).¹ Benchmark your resampling methods against this value and visually incorporate it into your charts.

¹ I did not give you the test set. That’s fine.

Need some advice getting started?

You’re doing essentially the same thing for each resampling method/combination of parameters. Consider writing a function that uses the splits object created by the resampling functions. Within the function, define the model specification and workflow, fit the model, and collect the metrics. Return a data frame containing the model metrics and the computation time.
Each resampling method uses the exact same function (e.g. vfold_cv(), mc_cv(), bootstraps()). Use iterative operations such as for loops or map() functions to apply the function to each resampling method and specific parameters.

Now is a good time to render, commit, and push.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 2

Analyze the results in terms of the relative bias and variance of the resampling methods. Specifically address these three questions, using your results from the previous exercise to support your claims.

Which resampling methods seem to have higher/lower bias?
Which resampling methods seem to have higher/lower variance?
We rarely have the ability to choose the “perfect” method in machine learning. Given the trade-offs between bias and variance as well the the computation time of each method, in what situations would you use each resampling method?

Now is a good time to render, commit, and push.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Data 1: Project FeederWatch

Project FeederWatch is a citizen science project that collects data on birds and other wildlife that visit feeders in North America.² A subset of their data was published in 2021 as part of the Tidy Tuesday project. Here we will use the data to model the presence of squirrels at bird feeders.

² Co-operated by our very own Cornell Lab of Ornithology!

Note

As you age, you will develop Old Person interests. One of mine and my wife’s is watching the animals around our house. We have multiple bird feeders which attract songbirds. Unfortunately the squirrels also like to eat the birdseed, cluttering our front porch with seed shells.

data/squirrels.csv contains a lightly modified version of the raw data file, filtering for missing values and operationalizing our outcome of interest squirrels as either “squirrels” or “no squirrels” based on whether squirrels take food from the feeder at least 3 times per week.³

³ {tidymodels} expects discrete outcomes to be defined as character strings or factors, not logicals.

squirrels <- read_csv("data/squirrels.csv")

Rows: 235685 Columns: 59
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): squirrels
dbl (58): yard_type_pavement, yard_type_garden, yard_type_landsca, yard_type...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(squirrels)

Rows: 235,685
Columns: 59
$ squirrels                    <chr> "no squirrels", "no squirrels", "no squir…
$ yard_type_pavement           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ yard_type_garden             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ yard_type_landsca            <dbl> 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,…
$ yard_type_woods              <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,…
$ yard_type_desert             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ hab_dcid_woods               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA,…
$ hab_evgr_woods               <dbl> NA, NA, NA, NA, 0, 0, NA, NA, NA, NA, NA,…
$ hab_mixed_woods              <dbl> 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, 1, 1, 1…
$ hab_orchard                  <dbl> NA, NA, NA, NA, 0, 0, NA, NA, NA, NA, NA,…
$ hab_park                     <dbl> NA, NA, NA, NA, 0, 0, NA, NA, NA, NA, 1, …
$ hab_water_fresh              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ hab_water_salt               <dbl> NA, NA, NA, NA, 0, 0, NA, NA, NA, NA, 1, …
$ hab_residential              <dbl> 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, 1, 1, 1…
$ hab_industrial               <dbl> NA, NA, NA, NA, 0, 0, NA, NA, NA, NA, 1, …
$ hab_agricultural             <dbl> 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, 1, 1, 1…
$ hab_desert_scrub             <dbl> NA, NA, NA, NA, 0, 0, NA, NA, NA, NA, NA,…
$ hab_young_woods              <dbl> NA, NA, NA, NA, 0, 0, NA, NA, NA, NA, NA,…
$ hab_swamp                    <dbl> NA, NA, NA, NA, 0, 0, NA, NA, NA, NA, NA,…
$ hab_marsh                    <dbl> 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, 1, 1, 1…
$ evgr_trees_atleast           <dbl> 11, 11, 11, 11, 11, 11, 0, 0, 0, 4, 1, 1,…
$ evgr_shrbs_atleast           <dbl> 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 4, 4, 4, 4,…
$ dcid_trees_atleast           <dbl> 11, 11, 11, 11, 1, 1, 11, 11, 11, 11, 4, …
$ dcid_shrbs_atleast           <dbl> 4, 4, 4, 4, 4, 4, 11, 11, 11, 11, 4, 4, 4…
$ fru_trees_atleast            <dbl> 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ cacti_atleast                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ brsh_piles_atleast           <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,…
$ water_srcs_atleast           <dbl> 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bird_baths_atleast           <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1,…
$ nearby_feeders               <dbl> 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1,…
$ cats                         <dbl> 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1,…
$ dogs                         <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,…
$ humans                       <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,…
$ housing_density              <dbl> 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2,…
$ fed_in_jan                   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, 1, N…
$ fed_in_feb                   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, 1, N…
$ fed_in_mar                   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, 1, N…
$ fed_in_apr                   <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, NA, 0, N…
$ fed_in_may                   <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, NA, 0, N…
$ fed_in_jun                   <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, NA, 0, N…
$ fed_in_jul                   <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, NA, 0, N…
$ fed_in_aug                   <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, NA, 0, N…
$ fed_in_sep                   <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, NA, 0, N…
$ fed_in_oct                   <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, NA, 1, N…
$ fed_in_nov                   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, 1, N…
$ fed_in_dec                   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, 1, N…
$ numfeeders_suet              <dbl> 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 1, 1, 1, 2,…
$ numfeeders_ground            <dbl> NA, 0, 0, 0, NA, NA, 1, 1, 1, 1, 3, 3, 3,…
$ numfeeders_hanging           <dbl> 1, 1, 1, 3, NA, NA, 2, 2, 2, 2, 2, 2, 1, …
$ numfeeders_platfrm           <dbl> 1, 1, 1, 0, NA, NA, 1, 1, 1, 2, 1, 1, 1, …
$ numfeeders_humming           <dbl> NA, 0, 0, 0, NA, NA, 1, 1, 1, 1, NA, 0, 0…
$ numfeeders_water             <dbl> 1, 1, 1, 1, NA, NA, 2, 2, 2, 2, 1, 1, 1, …
$ numfeeders_thistle           <dbl> NA, 0, 0, 0, NA, NA, 1, 1, 1, 2, 1, 1, 1,…
$ numfeeders_fruit             <dbl> NA, 0, 0, 0, NA, NA, 1, 1, 1, 1, NA, 0, 0…
$ numfeeders_hopper            <dbl> NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, NA,…
$ numfeeders_tube              <dbl> NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, NA,…
$ numfeeders_other             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ population_atleast           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5001, 5001,…
$ count_area_size_sq_m_atleast <dbl> 1.01, 1.01, 1.01, 1.01, 1.01, 1.01, 375.0…

Before we conduct analysis, we need to partition the data into training and test sets.

set.seed(235)
feeder_split <- initial_split(data = squirrels, prop = 0.75, strata = squirrels)

feeder_train <- training(feeder_split)
feeder_test <- testing(feeder_split)

Important

Unless specified, use the training set for the exercises below.

Exploratory analysis

Exercise 3

Inspect the distribution of the outcome variable squirrels and conduct exploratory analysis with potential features of interest. Use visualizations and summary statistics to explore the data focusing on the outcome of interest and at least five other variables. Briefly summarize your findings.

Now is a good time to render, commit, and push.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Dealing with class imbalance

Hopefully one of the characteristics you noticed is that the large majority of feeders do have squirrels. This is a common issue in classification problems, known as class imbalance where one of the outcome classes dominates or is significantly larger than the other class(es).

One technique for handling this issue is to downsample the majority class. This involves randomly removing observations from the majority class until the class sizes achieve a desired ratio. But why would this be good? Isn’t this just throwing away data? Didn’t we learn that more is better?!?!?!

Exercise 4

Let’s evaluate these claims by comparing the performance of a model trained on the original data to one trained on a downsampled dataset.

Specifically, you will fit two penalized regression models using the specification:⁴

⁴ Setting the mixture tuning parameter to 1 will fit a pure lasso model, which we will discuss in future weeks.

glmnet_spec <- logistic_reg(penalty = tune(), mixture = 1) |>
  set_engine("glmnet")

For both models, create a feature engineering recipe to predict squirrels as a function of all the other variables. The recipe (at minimum) needs to impute all numeric predictors, remove predictors that are highly sparse and unbalanced, and standardize all the numeric predictors to mean of 0 and variance of 1. For the downsampled model, use step_downsample() from the {themis} package to randomly remove observations from the majority class until the class sizes are equal.

Tune both models using 10-fold cross-validation and at least 20 possible values for thepenalty parameter. Evaluate the models’ performance using accuracy, sensitivity, specificity, and J-index.

What is J-index?

J-index is simply

\[ \text{Sensitivity} + \text{Specificity} - 1 \]

It ranges between \([0, 1]\) (higher is better) and is 1 when there are no false positives and no false negatives..

If our goal is to identify both the positive and negative outcomes (e.g. squirrels and no squirrels), which approach is preferred: downsampling or not?
If our goal is to identify both the positive and negative outcomes (e.g. squirrels and no squirrels), which metric should we use to finalize the model (i.e. select the appropriate penalty value)?

Using metrics to select a final model

Remember that you do not have to always chose the model with the absolute largest or smallest value for a metric. Sometimes you are willing to trade-off performance for simplicity. See the select_*() functions from {tune} for other possible approaches.

Use workflow sets

Consider using workflow sets to streamline the process of fitting multiple models. Not required, just a suggestion.

Looking to learn more?

Looking for more fun and practice (that does not get you any extra credit)? What happens if you vary the ratio used to downsample the dataset? How does this effect model performance?

Now is a good time to render, commit, and push.

Data 2: Student debt

Median student debt in the United States has increased substantially over the past twenty years.

ggplot(data = sc_debt, mapping = aes(x = year, y = debt_50)) +
  # geom_ribbon(mapping = aes(ymin = debt_20, ymax = debt_80), alpha = 0.2) +
  geom_line() +
  scale_y_continuous(labels = label_dollar()) +
  labs(
    title = "Average median federal debt among first-time,\nfull-time borrowers upon leaving the institution",
    x = NULL,
    y = "2020-adjusted dollars",
    caption = "Source: College Scorecard"
  )

data/scorecard.csv contains a portion of the database⁵ for academic year 2022-23 with the following variables:

⁵ The full database has almost 3500 features for each college.

Column name	Variable description
`unit_id`	Unit ID for institution
`name`	Institution name
`state`	State postcode
`act_med`	Median ACT score for admitted students
`adm_rate`	Admission rate
`comp_rate`	Completion rate for first-time, full-time students
`cost_net`	The average annual total cost of attendance, including tuition and fees, books and supplies, and living expenses for all full-time, first-time, degree/certificate-seeking undergraduates, minus the average grant/scholarship aid
`cost_sticker`	The average annual total cost of attendance, including tuition and fees, books and supplies, and living expenses for all full-time, first-time, degree/certificate-seeking undergraduates
`death_pct`	Percent of student who died within 4 years at original institution
`debt`	Median debt of student graduating in 2022-23
`earnings_med`	Median earnings of students working and not enrolled 6 years after entry
`earnings_sd`	Standard deviation of earnings of students working and not enrolled 6 years after entry
`faculty_salary_mean`	Average faculty salary
`faculty_full_time`	Proportion of faculty that are full-time
`female`	Share of female students
`first_gen`	Share of first-generation students
`instruct_exp`	Instructional expenditures per full-time student
`locale`	Locale of institution
`median_hh_inc`	Median household income of students
`open_admissions`	Open admissions policy
`pbi`	Predominantly black institution
`pell_pct`	Percentage of student who receive a Pell grant
`retention_rate`	First-time, full-time student retention rate
`sat_mean`	Average (mean) SAT score of admitted students
`test_policy`	Test score requirements for admission
`type`	Type of institution
`veteran`	Share of veteran students
`women_only`	Women-only college

Our goal is to predict the median debt of student graduating in 2022-23.

Exercise 5

Partition the data and evaluate the null model performance. First you should partition the dataset into training and testing sets.

# import data
scorecard <- read_csv("data/scorecard.csv")

Rows: 2678 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): name, state, locale, open_admissions, pbi, test_policy, type, wome...
dbl (20): unit_id, act_med, adm_rate, comp_rate, cost_net, cost_sticker, dea...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

set.seed(123)
scorecard_split <- scorecard |>
  # drop schools with missing debt - nothing to predict
  drop_na(debt) |>
  initial_split(prop = 0.75, strata = debt)

scorecard_train <- training(scorecard_split)
scorecard_test <- testing(scorecard_split)

Further partition the training data using a cross-validation procedure of your choosing.

Fit a null model to predict the median debt for students graduating from a four year college or university in 2022-23. Report the mean absolute error (MAE) and root mean squared error (RMSE) and interpret them in the context of the model.

Now is a good time to render, commit, and push.

Exercise 6

One might expect median debt load to vary based on state since state funding for higher education varies significantly across the United States. However state is a categorical variable with many levels, which can be difficult to incorporate into machine learning models.

Estimate a series of random forest models to predict median debt, comparing different methods for encoding the state variable. Specifically, compare the performance of the following methods:

Do nothing. Just use the raw state variable. Tree-based models are generally fine with character or factor columns.⁶
Collapse infrequent states into a single “other” category to reduce the number of levels.
Feature hashing
Effect encoding

⁶ This does not apply to other forms of models such as regression or nearest neighbors.

For each method, create a feature engineering recipe that includes the state variable and any other variables you think are relevant. Tune the random forest models using your chosen form of resampling and grid tuning for at least 10 possible combinations of values for the mtry and min_n parameters.

Gotchas to watch out for with these models

The ranger engine does not like predictors with missing values. Decide how you want to handle these when the situation arises.
In order to generate predictions from a fitted model with categorical predictors, the levels must have appeared in the data used to fit the model. If they are new levels, the model will not know how to use those values to generate predictions. Either make sure all possible values for state (and other categorical predictors) are present in the training data or use a feature engineering method that allows the model to handle new levels.

Now is a good time to render, commit, and push.

Exercise 7

Evaluate the models’ performance using the assessment set mean absolute error (MAE) and root mean squared error (RMSE). How do each of the four techniques tend to perform? Why might that be the case?

Finalize your preferred model, fit it using the entire training set, and evaluate its performance using the test set. Along with the MAE and RMSE, visualize the model’s predictions against the true values to assess its calibration. How does your final model perform?

Now is a good time to render, commit, and push.

Generative AI (GAI) self-reflection

As stated in the syllabus, include a written reflection for this assignment of how you used GAI tools (e.g. what tools you used, how you used them to assist you with writing code), what skills you believe you acquired, and how you believe you demonstrated mastery of the learning objectives.

Render, commit, and push one last time.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Wrap up

Submission

Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
Click on your INFO 4940/5940 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.

Grading

Exercise 1: 12 points
Exercise 2: 5 points
Exercise 3: 5 points
Exercise 4: 12 points
Exercise 5: 2 points
Exercise 6: 10 points
Exercise 7: 4 points
GAI self-reflection: 0 points
Total: 50 points

Acknowledgments

Squirrel feeders example inspired by blog posts by Julia Silge and licensed under CC BY-SA 4.0.