Data budget/making a model

Lecture 3

Dr. Benjamin Soltoff

Cornell University
INFO 4940/5940 - Fall 2024

September 3, 2024

Announcements

Announcements

  • Office hours beginning this week

Learning objectives

  • Identify the importance of budgeting data for machine learning
  • Partition data into training and test sets
  • Specify models using the {parsnip} package
  • Utilize workflows to bundle preprocessing and modeling tasks
  • Fit models using the {tidymodels} framework
  • Generate predictions from the fitted model

Application exercise

ae-02

  • Install the {renv} package

  • Go to the course GitHub org and find your ae-02 (repo name will be suffixed with your GitHub name).

  • Clone the repo in RStudio, open the Quarto document in the repo, install the required packages, and follow along and complete the exercises.

  • Render, commit, and push your edits by the AE deadline – end of the day

Data on forests in Washington

Data on forests in Washington

  • The U.S. Forest Service maintains ML models to predict whether a plot of land is “forested.”
  • This classification is important for all sorts of research, legislation, and land management purposes.
  • Plots are typically remeasured every 10 years and this dataset contains the most recent measurement per plot.
  • Type ?forested to learn more about this dataset, including references.

Data on forests in Washington

  • N = 7,107 plots of land, one from each of 7,107 6000-acre hexagons in WA.
  • A nominal outcome, forested, with levels "Yes" and "No", measured “on-the-ground.”
  • 18 remotely-sensed and easily-accessible predictors:
    • numeric variables based on weather and topography.
    • nominal variables based on classifications from other governmental orgs.

Checklist for predictors

  • Is it ethical to use this variable? (Or even legal?)

  • Will this variable be available at prediction time?

  • Does this variable contribute to explainability?

Data on forests in Washington

library(tidymodels)
library(forested)

forested
# A tibble: 7,107 × 19
   forested  year elevation eastness northness roughness tree_no_tree dew_temp
   <fct>    <dbl>     <dbl>    <dbl>     <dbl>     <dbl> <fct>           <dbl>
 1 Yes       2005       881       90        43        63 Tree             0.04
 2 Yes       2005       113      -25        96        30 Tree             6.4 
 3 No        2005       164      -84        53        13 Tree             6.06
 4 Yes       2005       299       93        34         6 No tree          4.43
 5 Yes       2005       806       47       -88        35 Tree             1.06
 6 Yes       2005       736      -27       -96        53 Tree             1.35
 7 Yes       2005       636      -48        87         3 No tree          1.42
 8 Yes       2005       224      -65       -75         9 Tree             6.39
 9 Yes       2005        52      -62        78        42 Tree             6.5 
10 Yes       2005      2240      -67       -74        99 No tree         -5.63
# ℹ 7,097 more rows
# ℹ 11 more variables: precip_annual <dbl>, temp_annual_mean <dbl>,
#   temp_annual_min <dbl>, temp_annual_max <dbl>, temp_january_min <dbl>,
#   vapor_min <dbl>, vapor_max <dbl>, canopy_cover <dbl>, lon <dbl>, lat <dbl>,
#   land_type <fct>

Data splitting and spending

Data splitting and spending

For machine learning, we typically split data into training and test sets:

  • The training set is used to estimate model parameters.
  • The test set is used to find an independent assessment of model performance.

Do not 🚫 use the test set during training.

Data splitting and spending

The more data we spend

the better estimates we’ll get.

Data splitting and spending

  • Spending too much data in training prevents us from computing a good assessment of predictive performance.
  • Spending too much data in testing prevents us from computing a good estimate of model parameters.

⏱️ Your turn

When is a good time to split your data?

01:00

The testing data is precious 💎

The initial split

set.seed(123)
forested_split <- initial_split(forested)
forested_split
<Training/Testing/Total>
<5330/1777/7107>

What is set.seed()?

To create that split of the data, R generates “pseudo-random” numbers: while they are made to behave like random numbers, their generation is deterministic given a “seed”.

This allows us to reproduce results by setting that seed.

Which seed you pick doesn’t matter, as long as you don’t try a bunch of seeds and pick the one that gives you the best performance.

Accessing the data

forested_train <- training(forested_split)
forested_test <- testing(forested_split)

The training set

forested_train
# A tibble: 5,330 × 19
   forested  year elevation eastness northness roughness tree_no_tree dew_temp
   <fct>    <dbl>     <dbl>    <dbl>     <dbl>     <dbl> <fct>           <dbl>
 1 No        2016       464       -5       -99         7 No tree          1.71
 2 Yes       2016       166       92        37         7 Tree             6   
 3 No        2016       644      -85       -52        24 No tree          0.67
 4 Yes       2014      1285        4        99        79 Tree             1.91
 5 Yes       2013       822       87        48        68 Tree             1.95
 6 Yes       2017         3        6       -99         5 Tree             7.93
 7 Yes       2014      2041      -95        28        49 Tree            -4.22
 8 Yes       2015      1009       -8        99        72 Tree             1.72
 9 No        2017       436      -98        19        10 No tree          1.8 
10 No        2018       775       63        76       103 No tree          0.62
# ℹ 5,320 more rows
# ℹ 11 more variables: precip_annual <dbl>, temp_annual_mean <dbl>,
#   temp_annual_min <dbl>, temp_annual_max <dbl>, temp_january_min <dbl>,
#   vapor_min <dbl>, vapor_max <dbl>, canopy_cover <dbl>, lon <dbl>, lat <dbl>,
#   land_type <fct>

The test set

🙈️

There are 1777 rows and 19 columns in the test set.

⏱️ Your turn

Split your data so 20% is held out for the test set.

Try out different values in set.seed() to see how the results change.

05:00

Data splitting and spending

set.seed(123)
forested_split <- initial_split(forested, prop = 0.8)
forested_train <- training(forested_split)
forested_test <- testing(forested_split)

nrow(forested_train)
[1] 5685
nrow(forested_test)
[1] 1422

Exploratory data analysis for ML 🧐

⏱️ Your turn

Explore the forested_train data on your own!

  • What’s the distribution of the outcome, forested?
  • What’s the distribution of numeric variables like precip_annual?
  • How does the distribution of forested differ across the categorical variables?
08:00

forested_train |> 
  ggplot(aes(x = forested)) +
  geom_bar()

forested_train |> 
  ggplot(aes(x = forested, fill = tree_no_tree)) +
  geom_bar()

forested_train |> 
  ggplot(aes(x = precip_annual, fill = forested, color = forested)) +
  geom_freqpoly(bins = 30)

forested_train |> 
  ggplot(aes(x = precip_annual, fill = forested, group = forested)) +
  geom_histogram(position = "fill")

forested_train |> 
  ggplot(aes(x = lon, y = lat, col = forested)) +
  geom_point()

The whole game - status update

Fitting models in R

Linear models

How do you fit a linear model in R?

  • lm for linear model

  • glmnet for regularized regression

  • keras for regression using TensorFlow

  • stan for Bayesian regression

  • spark for large data sets

  • brulee for regression using torch

To specify a model

  • Choose a model
  • Specify an engine
  • Set the mode




To specify a model

logistic_reg()
Logistic Regression Model Specification (classification)

Computational engine: glm 

To specify a model

  • Choose a model
  • Specify an engine
  • Set the mode




To specify a model

logistic_reg() |>
  set_engine("glmnet")
Logistic Regression Model Specification (classification)

Computational engine: glmnet 

To specify a model

logistic_reg() |>
  set_engine("stan")
Logistic Regression Model Specification (classification)

Computational engine: stan 

To specify a model

  • Choose a model
  • Specify an engine
  • Set the mode




To specify a model

decision_tree()
Decision Tree Model Specification (unknown mode)

Computational engine: rpart 

To specify a model

decision_tree() |> 
  set_mode("classification")
Decision Tree Model Specification (classification)

Computational engine: rpart 



All available models are listed at https://www.tidymodels.org/find/parsnip/

To specify a model

  • Choose a model
  • Specify an engine
  • Set the mode




⏱️ Your turn

Run the tree_spec chunk in your .qmd.

Edit this code to use a logistic regression model.

All available models are listed at https://www.tidymodels.org/find/parsnip/



Extension/Challenge: Edit this code to use a different model. For example, try using a conditional inference tree as implemented in the {partykit} package by changing the engine - or try an entirely different model type!

05:00

Models we’ll use this week

  • Logistic regression
  • Decision trees

Logistic regression

Logistic regression

Logistic regression

  • Logit of outcome probability modeled as linear combination of predictors:

\(\log(\frac{p}{1 - p}) = \beta_0 + \beta_1 \times \text{A}\)

  • Find a sigmoid line that separates the two classes

Decision trees

Decision trees

  • Series of splits or if/then statements based on predictors

  • First the tree grows until some condition is met (maximum depth, no more data)

  • Then the tree is pruned to reduce its complexity

Decision trees

All models are wrong, but some are useful!

Logistic regression

Decision trees

A model workflow

Workflows bind preprocessors and models

What is wrong with this?

Why a workflow()?

  • Workflows handle new data better than base R tools in terms of new factor levels
  • You can use other preprocessors besides formulas (see “Feature engineering”)
  • They can help organize your work when working with multiple models
  • Most importantly, a workflow captures the entire modeling process: fit() and predict() apply to the preprocessing steps in addition to the actual model fit

A model workflow

tree_spec <- decision_tree() |> 
  set_mode("classification")

tree_spec |> 
  fit(forested ~ ., data = forested_train) 
parsnip model object

n= 5685 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 5685 2550 Yes (0.55145119 0.44854881)  
   2) land_type=Tree 3064  300 Yes (0.90208877 0.09791123) *
   3) land_type=Barren,Non-tree vegetation 2621  371 No (0.14154903 0.85845097)  
     6) temp_annual_max< 13.395 347  153 Yes (0.55907781 0.44092219)  
      12) tree_no_tree=Tree 92    6 Yes (0.93478261 0.06521739) *
      13) tree_no_tree=No tree 255  108 No (0.42352941 0.57647059) *
     7) temp_annual_max>=13.395 2274  177 No (0.07783641 0.92216359) *

A model workflow

tree_spec <- decision_tree() |> 
  set_mode("classification")

workflow() |>
  add_formula(forested ~ .) |>
  add_model(tree_spec) |>
  fit(data = forested_train) 
══ Workflow [trained] ════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()

── Preprocessor ──────────────────────────────────────────────────────
forested ~ .

── Model ─────────────────────────────────────────────────────────────
n= 5685 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 5685 2550 Yes (0.55145119 0.44854881)  
   2) land_type=Tree 3064  300 Yes (0.90208877 0.09791123) *
   3) land_type=Barren,Non-tree vegetation 2621  371 No (0.14154903 0.85845097)  
     6) temp_annual_max< 13.395 347  153 Yes (0.55907781 0.44092219)  
      12) tree_no_tree=Tree 92    6 Yes (0.93478261 0.06521739) *
      13) tree_no_tree=No tree 255  108 No (0.42352941 0.57647059) *
     7) temp_annual_max>=13.395 2274  177 No (0.07783641 0.92216359) *

A model workflow

tree_spec <- decision_tree() |> 
  set_mode("classification")

workflow(forested ~ ., tree_spec) |> 
  fit(data = forested_train) 
══ Workflow [trained] ════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()

── Preprocessor ──────────────────────────────────────────────────────
forested ~ .

── Model ─────────────────────────────────────────────────────────────
n= 5685 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 5685 2550 Yes (0.55145119 0.44854881)  
   2) land_type=Tree 3064  300 Yes (0.90208877 0.09791123) *
   3) land_type=Barren,Non-tree vegetation 2621  371 No (0.14154903 0.85845097)  
     6) temp_annual_max< 13.395 347  153 Yes (0.55907781 0.44092219)  
      12) tree_no_tree=Tree 92    6 Yes (0.93478261 0.06521739) *
      13) tree_no_tree=No tree 255  108 No (0.42352941 0.57647059) *
     7) temp_annual_max>=13.395 2274  177 No (0.07783641 0.92216359) *

⏱️ Your turn

Run the tree_wflow chunk in your .qmd.

Edit this code to make a workflow with your own model of choice.



Extension/Challenge: Other than formulas, what kinds of preprocessors are supported?

05:00

Predict with your model

How do you use your new tree_fit model?

tree_spec <- decision_tree() |> 
  set_mode("classification")

tree_fit <- workflow(forested ~ ., tree_spec) |> 
  fit(data = forested_train) 

⏱️ Your turn

Run:

predict(tree_fit, new_data = forested_test)

What do you notice about the structure of the result?

01:00

⏱️ Your turn

Run:

augment(tree_fit, new_data = forested_test)

How does the output compare to the output from predict()?

01:00

The tidymodels prediction guarantee!

  • The predictions will always be inside a tibble
  • The column names and types are unsurprising and predictable
  • The number of rows in new_data and the output are the same

Understand your model

How do you understand your new tree_fit model?

Understand your model

How do you understand your new tree_fit model?

library(rpart.plot)
tree_fit |>
  extract_fit_engine() |>
  rpart.plot(roundint = FALSE)

You can extract_*() several components of your fitted workflow.

⚠️ Never predict() with any extracted components!

Understand your model

How do you understand your new tree_fit model?

You can use your fitted workflow for model and/or prediction explanations:

  • overall variable importance, such as with the {vip} package
  • flexible model explainers, such as with the {DALEXtra} package

⏱️ Your turn


Extract the model engine object from your fitted workflow and check it out.

02:00

The whole game - status update

Wrap-up

Recap

  • Splitting data into training and test sets ensures we have a valid way to assess the final model’s performance
  • Exploratory data analysis is important to understanding your data’s structure and (simple) patterns/relationships
  • Use {parsnip} to specify machine learning models in R
  • Use {workflows} to bundle preprocessing and modeling tasks

Acknowledgments

My latest school tax bill