Build better data (I)

Lecture 5

Dr. Benjamin Soltoff

Cornell University
INFO 4940/5940 - Fall 2025

September 9, 2025

Announcements

Announcements

  • Homework 1 due on Wednesday by 11:59pm
  • Office hours schedule

Learning objectives

  • Identify the importance of preparing predictors for a machine learning model
  • Distinguish between data preprocessing and feature engineering
  • Demonstrate the usage of pipelines for preparing data

Hotel reservations

Hotel data

We’ll use data on hotels to predict the cost of a room

 [1] "avg_price_per_room"             "lead_time"                     
 [3] "stays_in_weekend_nights"        "stays_in_week_nights"          
 [5] "adults"                         "children"                      
 [7] "babies"                         "meal"                          
 [9] "country"                        "market_segment"                
[11] "distribution_channel"           "is_repeated_guest"             
[13] "previous_cancellations"         "previous_bookings_not_canceled"
[15] "reserved_room_type"             "assigned_room_type"            
[17] "booking_changes"                "agent"                         
[19] "company"                        "days_in_waiting_list"          
[21] "customer_type"                  "required_car_parking_spaces"   
[23] "total_of_special_requests"      "arrival_date"                  
[25] "near_christmas"                 "near_new_years"                

Data spending

Let’s split the data into a training set (75%) and testing set (25%) using stratification:

set.seed(523)
hotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

Exploratory analysis

Exploratory analysis

Exploratory analysis

Working with predictors

Working with predictors

We might want to modify our predictors columns for a few reasons:

  • The model requires them in a different format (e.g. dummy variables for linear regression)
  • The model needs certain data qualities (e.g. same units for K-NN)
  • The outcome is better predicted when one or more columns are transformed in some way (a.k.a “feature engineering”)

What is feature engineering?

Think of a feature as some representation of a predictor that will be used in a model

Example representations:

  • Interactions
  • Polynomial expansions/splines
  • Principal component analysis (PCA) feature extraction

Example: Dates

How can we represent date columns for our model?

Most models can’t handle date columns directly, either throwing errors or converting them to integers.

We can re-engineer it as:

  • Days since a reference date
  • Day of the week
  • Month
  • Year
  • Indicators for holidays

📝 How should we handle dates for this predictive task?

Instructions

Identify at least four ways to represent the arrival_date column that would potentially be useful for predicting avg_price_per_room.

05:00

General definitions

  • Data preprocessing steps allow your model to fit

  • Feature engineering steps help the model do the least work to predict the outcome as well as possible

Resampling Strategy

Syntax error in textmermaid version 11.2.0

Resampling Strategy

We’ll use simple 10-fold cross-validation (stratified sampling):

set.seed(472)
hotel_rs <- vfold_cv(hotel_train, strata = avg_price_per_room)
hotel_rs
#  10-fold cross-validation using stratification 
# A tibble: 10 × 2
   splits             id    
   <list>             <chr> 
 1 <split [3372/377]> Fold01
 2 <split [3373/376]> Fold02
 3 <split [3373/376]> Fold03
 4 <split [3373/376]> Fold04
 5 <split [3373/376]> Fold05
 6 <split [3374/375]> Fold06
 7 <split [3375/374]> Fold07
 8 <split [3376/373]> Fold08
 9 <split [3376/373]> Fold09
10 <split [3376/373]> Fold10

Prepare your data for modeling

  • Create pipeable (chainable) sequences of data preprocessing and feature engineering steps
  • Statistical parameters for the steps can be estimated from an initial data set and then applied to other data sets
  • The resulting processed output can be used as inputs for statistical or machine learning models
  • In R, use the {recipes} package from the {tidymodels} ecosystem

scikit-learn provides Pipeline and ColumnTransformer to create reusable sequences of data transformations.

A first recipe

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train)

The recipe() function assigns columns to roles of “outcome” or “predictor” using the formula

  • Use ColumnTransformer to define a sequence of transformations for different columns.
  • scikit-learn requires predictors and outcomes to be passed as separate objects.

Create indicator variables

Create indicator variables

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_dummy(all_nominal_predictors())
  • For any categorical predictors, make binary indicators

  • There are many recipe steps that can convert categorical predictors to numeric columns

  • step_dummy() records the levels of the categorical predictors in the training set

Before recipe

# A tibble: 4 × 1
  meal                        
  <chr>                       
1 Bed and Breakfast           
2 breakfast and one other meal
3 no meal package             
4 breakfast lunch and dinner  

After recipe

# A tibble: 4 × 3
  meal_breakfast.and.one.other.meal meal_breakfast.lunch.…¹ meal_no.meal.package
                              <dbl>                   <dbl>                <dbl>
1                                 0                       0                    0
2                                 1                       0                    0
3                                 0                       0                    1
4                                 0                       1                    0
# ℹ abbreviated name: ¹​meal_breakfast.lunch.and.dinner

step_*()

Adds a single transformation to a recipe. Transformations are replayed in order when the recipe is run on data.

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_date(arrival_date) |>
  step_dummy(all_nominal_predictors())

Before recipe

# A tibble: 4 × 1
  arrival_date
  <date>      
1 2016-09-02  
2 2016-09-09  
3 2016-09-15  
4 2016-09-16  

After recipe

# A tibble: 4 × 19
  arrival_date arrival_date_year arrival_date_dow_Mon arrival_date_dow_Tue arrival_date_dow_Wed
  <date>                   <int>                <dbl>                <dbl>                <dbl>
1 2016-09-02                2016                    0                    0                    0
2 2016-09-09                2016                    0                    0                    0
3 2016-09-15                2016                    0                    0                    0
4 2016-09-16                2016                    0                    0                    0
# ℹ 14 more variables: arrival_date_dow_Thu <dbl>, arrival_date_dow_Fri <dbl>,
#   arrival_date_dow_Sat <dbl>, arrival_date_month_Feb <dbl>, arrival_date_month_Mar <dbl>,
#   arrival_date_month_Apr <dbl>, arrival_date_month_May <dbl>, arrival_date_month_Jun <dbl>,
#   arrival_date_month_Jul <dbl>, arrival_date_month_Aug <dbl>, arrival_date_month_Sep <dbl>,
#   arrival_date_month_Oct <dbl>, arrival_date_month_Nov <dbl>, arrival_date_month_Dec <dbl>

step_holiday() + step_rm()

Generate a set of indicator variables for specific holidays.

holidays <- c(
  "AllSouls",
  "AshWednesday",
  "ChristmasEve",
  "Easter",
  "ChristmasDay",
  "GoodFriday",
  "NewYearsDay",
  "PalmSunday"
)

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |>
  step_dummy(all_nominal_predictors())

step_holiday() + step_rm()

Rows: 3,749
Columns: 26
$ arrival_date_year         <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016…
$ arrival_date_AllSouls     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_AshWednesday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_ChristmasEve <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_Easter       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_ChristmasDay <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_GoodFriday   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_NewYearsDay  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_PalmSunday   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_dow_Mon      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0…
$ arrival_date_dow_Tue      <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_dow_Wed      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1…
$ arrival_date_dow_Thu      <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_dow_Fri      <dbl> 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_dow_Sat      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_month_Feb    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_month_Mar    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_month_Apr    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_month_May    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_month_Jun    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_month_Jul    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_month_Aug    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_month_Sep    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_month_Oct    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1…
$ arrival_date_month_Nov    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arrival_date_month_Dec    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Additional feature engineering steps

Filter out constant columns

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())
  • Remove any predictors that have zero variance
  • Artifact of step_dummy() (factor levels that were never observed in training)
  • Original predictors that are constant

Normalization

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())
  • This centers and scales the numeric predictors

  • The recipe will use the training set to estimate the means and standard deviations of the data

  • All data the recipe is applied to will be normalized using those statistics (there is no re-estimation)

Reduce correlation

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_corr(all_numeric_predictors(), threshold = 0.9)

To deal with highly correlated predictors, find the minimum set of predictor columns that make the pairwise correlations less than the threshold.

Other possible steps

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_corr(all_numeric_predictors(), threshold = 0.9) |>
  step_pca(all_numeric_predictors())

PCA feature extraction…

Other possible steps

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_corr(all_numeric_predictors(), threshold = 0.9) |>
  embed::step_umap(all_numeric_predictors(), outcome = vars(avg_price_per_room))

A fancy machine learning supervised dimension reduction technique called UMAP

Other possible steps

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_corr(all_numeric_predictors(), threshold = 0.9) |>
  step_spline_natural(lead_time, deg_free = 10)

Nonlinear transforms like natural splines, and so on!

Now we’ve built a recipe.

But, how do we use a recipe?

Axiom

Feature engineering and modeling are two halves of a single predictive workflow.

Minimal recipe

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |>
  step_dummy(all_nominal_predictors())

workflow()

Creates a workflow to which you can add a model (and more)

workflow()

Use Pipeline to combine preprocessing and modeling steps.

add_formula()

Adds a formula to a workflow *

workflow() |> add_formula(average_daily_rate ~ children)

add_recipe()

Adds a recipe() to a workflow

workflow() |> add_recipe(hotel_rec)

add_model()

Adds a {parsnip} model specification to a workflow

knn_mod <- nearest_neighbor(neighbors = 5) |>
  set_engine("kknn") |>
  set_mode("regression")

workflow() |> add_model(knn_mod)

Measuring performance

We’ll compute three measures: root mean squared error, mean absolute error, and the coefficient of determination (a.k.a \(R^2\)).

\[ \begin{align} RMSE &= \sqrt{\frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2} \notag \\ MAE &= \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i| \notag \\ R^2 &= cor(y_i, \hat{y}_i)^2 \end{align} \]

reg_metrics <- metric_set(rmse, mae, rsq)

Establish a baseline

null_mod <- null_model() |>
  set_engine("parsnip") |>
  set_mode("regression")

null_wflow <- workflow() |>
  add_recipe(hotel_rec) |>
  add_model(null_mod)

null_wflow |>
  fit_resamples(hotel_rs, metrics = reg_metrics) |>
  collect_metrics()
# A tibble: 3 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 mae     standard    53.1    10   0.460 Preprocessor1_Model1
2 rmse    standard    65.9    10   0.758 Preprocessor1_Model1
3 rsq     standard   NaN       0  NA     Preprocessor1_Model1

Using a workflow

set.seed(9)

hotel_lm_wflow <- workflow() |>
  add_recipe(hotel_rec) |>
  add_model(linear_reg())

ctrl <- control_resamples(save_pred = TRUE)
hotel_lm_res <- hotel_lm_wflow |>
  fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)

collect_metrics(hotel_lm_res)
# A tibble: 3 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 mae     standard   17.5      10 0.283   Preprocessor1_Model1
2 rmse    standard   24.1      10 0.540   Preprocessor1_Model1
3 rsq     standard    0.867    10 0.00461 Preprocessor1_Model1

Calibration Plot

Train a nearest neighbor model

K Nearest Neighbors (KNN)

To predict the outcome of a new data point:

  • Find the K most similar old data points
  • Take the average/mode/etc. outcome

Fact

KNN requires all numeric predictors, and all need to be centered and scaled.

What does that mean?

Quiz

Why do you need to “train” a recipe?

Imagine “scaling” a new data point. What do you subtract from it? What do you divide it by?

📝 Properly sequence your steps

Instructions

Arrange these data preprocessing/feature engineering steps in the correct order for a KNN model:

  • Center and scale numeric predictors
  • Remove highly correlated predictors
  • Convert arrival_date to indicators for day of week, month, year
  • Remove arrival_date
  • Indicators for holidays
  • Remove zero-variance predictors
  • Convert categorical predictors to binary indicators
05:00

Define the recipe

knn_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_corr(all_numeric_predictors(), threshold = 0.9)

Fit the model

knn_mod <- nearest_neighbor(neighbors = 10) |>
  set_engine("kknn") |>
  set_mode("regression")

set.seed(12)

hotel_knn_wflow <- workflow() |>
  add_recipe(knn_rec) |>
  add_model(knn_mod)

hotel_knn_res <- hotel_knn_wflow |>
  fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)

collect_metrics(hotel_knn_res)
# A tibble: 3 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 mae     standard   27.6      10  0.529  Preprocessor1_Model1
2 rmse    standard   39.6      10  0.752  Preprocessor1_Model1
3 rsq     standard    0.641    10  0.0107 Preprocessor1_Model1

Calibration Plot

cal_plot_regression(hotel_knn_res)

Wrap-up

Recap

  • Predictors often need modification prior to modeling, either for data preprocessing or feature engineering
  • Data preprocessing steps ensure that the data meets the model’s requirements
  • Feature engineering steps may improve model performance, but will only find out after fitting the model
  • Use {recipes} or Pipeline to create reusable sequences of data transformations

Acknowledgments