Feature engineering for categorical predictors

Lecture 5

Dr. Benjamin Soltoff

Cornell University
INFO 4940/5940 - Fall 2024

September 12, 2024

Announcements

Homework 01

Learning objectives

Identify the importance of preparing predictors for a machine learning model
Distinguish between data preprocessing and feature engineering
Implement data preprocessing steps using the {recipes} package
Identify common methods for encoding categorical predictors
Define and implement feature hashing for categorical predictors with a large number of levels

Application exercise

`ae-04`

Go to the course GitHub org and find your ae-04 (repo name will be suffixed with your GitHub name).
Clone the repo in RStudio, open the Quarto document in the repo, install the required packages, and follow along and complete the exercises.
Render, commit, and push your edits by the AE deadline – end of the day

A new day, a new data set

Hotel data

We’ll use data on hotels to predict the cost of a room.

library(tidymodels)

# Add another package:
library(textrecipes)

data(hotel_rates)
set.seed(295)

hotel_rates <- hotel_rates |>
  sample_n(5000) |>
  arrange(arrival_date) |>
  select(-arrival_date) |>
  mutate(
    company = factor(as.character(company)),
    country = factor(as.character(country)),
    agent = factor(as.character(agent))
  )

Hotel date columns

names(hotel_rates)

 [1] "avg_price_per_room"             "lead_time"                     
 [3] "stays_in_weekend_nights"        "stays_in_week_nights"          
 [5] "adults"                         "children"                      
 [7] "babies"                         "meal"                          
 [9] "country"                        "market_segment"                
[11] "distribution_channel"           "is_repeated_guest"             
[13] "previous_cancellations"         "previous_bookings_not_canceled"
[15] "reserved_room_type"             "assigned_room_type"            
[17] "booking_changes"                "agent"                         
[19] "company"                        "days_in_waiting_list"          
[21] "customer_type"                  "required_car_parking_spaces"   
[23] "total_of_special_requests"      "arrival_date_num"              
[25] "near_christmas"                 "near_new_years"                
[27] "historical_adr"

Data spending

Let’s split the data into a training set (75%) and testing set (25%) using stratification:

set.seed(4028)
hotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

⏱️ Your turn

Let’s take some time and investigate the training data. The outcome is avg_price_per_room.

Are there any interesting characteristics of the data?

10:00

Exploratory analysis

library(skimr)
skim(hotel_train)

Data summary
Name	hotel_train
Number of rows	3749
Number of columns	27
_______________________
Column type frequency:
factor	9
numeric	18
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
meal	1	FALSE	4	bed: 2865, bre: 743, no_: 116, bre: 25
country	1	FALSE	66	prt: 1206, gbr: 814, esp: 353, irl: 216
market_segment	1	FALSE	5	onl: 1617, dir: 756, off: 736, gro: 416
distribution_channel	1	FALSE	3	ta_: 2616, dir: 812, cor: 321, und: 0
reserved_room_type	1	FALSE	7	a: 2034, d: 787, e: 500, g: 138
assigned_room_type	1	FALSE	9	a: 1395, d: 1066, e: 562, c: 250
agent	1	FALSE	98	dev: 1132, not: 834, ale: 360, cha: 208
company	1	FALSE	100	not: 3416, par: 83, lin: 24, ber: 14
customer_type	1	FALSE	4	tra: 2699, tra: 811, con: 207, gro: 32

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
avg_price_per_room	1	105.62	66.08	19.35	55.00	83.00	142.76	426.25	▇▃▂▁▁
lead_time	1	88.65	101.77	0.00	7.00	43.00	152.00	542.00	▇▂▁▁▁
stays_in_weekend_nights	1	1.17	1.14	0.00	0.00	1.00	2.00	13.00	▇▁▁▁▁
stays_in_week_nights	1	3.08	2.41	0.00	1.00	3.00	5.00	32.00	▇▁▁▁▁
adults	1	1.85	0.47	1.00	2.00	2.00	2.00	4.00	▂▇▁▁▁
children	1	0.13	0.43	0.00	0.00	0.00	0.00	2.00	▇▁▁▁▁
babies	1	0.01	0.12	0.00	0.00	0.00	0.00	2.00	▇▁▁▁▁
is_repeated_guest	1	0.07	0.25	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
previous_cancellations	1	0.01	0.11	0.00	0.00	0.00	0.00	4.00	▇▁▁▁▁
previous_bookings_not_canceled	1	0.25	1.37	0.00	0.00	0.00	0.00	29.00	▇▁▁▁▁
booking_changes	1	0.36	0.77	0.00	0.00	0.00	0.00	7.00	▇▁▁▁▁
days_in_waiting_list	1	0.57	7.76	0.00	0.00	0.00	0.00	125.00	▇▁▁▁▁
required_car_parking_spaces	1	0.20	0.42	0.00	0.00	0.00	0.00	8.00	▇▁▁▁▁
total_of_special_requests	1	0.74	0.86	0.00	0.00	1.00	1.00	5.00	▇▂▁▁▁
arrival_date_num	1	2017.08	0.33	2016.50	2016.80	2017.09	2017.36	2017.66	▇▇▇▇▇
near_christmas	1	0.01	0.08	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
near_new_years	1	0.01	0.09	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
historical_adr	1	86.64	39.48	41.96	52.27	71.35	116.78	167.49	▇▃▂▂▂

Exploratory analysis

ggplot(data = hotel_train, mapping = aes(x = avg_price_per_room)) +
  geom_histogram(color = "white", binwidth = 20, boundary = 0)

Exploratory analysis

ggplot(data = hotel_train, mapping = aes(x = lead_time)) +
  geom_histogram(color = "white")

ggplot(data = hotel_train, mapping = aes(x = lead_time)) +
  geom_histogram(color = "white") +
  scale_x_continuous(transform = boxcox_trans(p = 0.4))

Exploratory analysis

ggplot(data = hotel_train, mapping = aes(x = historical_adr, y = avg_price_per_room)) +
  geom_point(alpha = 0.2) +
  geom_smooth()

Exploratory analysis

hotel_train |>
  pivot_longer(
    cols = starts_with("near"),
    names_to = "holiday",
    values_to = "value"
  ) |>
  ggplot(mapping = aes(
    x = factor(value),
    y = avg_price_per_room
  )) +
  geom_boxplot() +
  facet_wrap(facets = vars(holiday), ncol = 1)

Exploratory analysis

ggplot(data = hotel_train, mapping = aes(x = company, y = avg_price_per_room)) +
  geom_boxplot()

Working with predictors

We might want to modify our predictors columns for a few reasons:

The model requires them in a different format (e.g. dummy variables for linear regression).
The model needs certain data qualities (e.g. same units for K-NN).
The outcome is better predicted when one or more columns are transformed in some way (a.k.a “feature engineering”).

What is feature engineering?

Think of a feature as some representation of a predictor that will be used in a model.

Example representations:

Interactions
Polynomial expansions/splines
Principal component analysis (PCA) feature extraction

Example: Dates

How can we represent date columns for our model?

When we use a date column in its native format, most models in R convert it to an integer.

We can re-engineer it as:

Days since a reference date
Day of the week
Month
Year
Indicators for holidays

General definitions

Data preprocessing steps allow your model to fit.
Feature engineering steps help the model do the least work to predict the outcome as well as possible.

The {recipes} package can handle both!

Resampling Strategy

We’ll use simple 10-fold cross-validation (stratified sampling):

set.seed(472)
hotel_rs <- vfold_cv(hotel_train, strata = avg_price_per_room)
hotel_rs

#  10-fold cross-validation using stratification 
# A tibble: 10 × 2
   splits             id    
   <list>             <chr> 
 1 <split [3372/377]> Fold01
 2 <split [3373/376]> Fold02
 3 <split [3373/376]> Fold03
 4 <split [3373/376]> Fold04
 5 <split [3373/376]> Fold05
 6 <split [3374/375]> Fold06
 7 <split [3375/374]> Fold07
 8 <split [3376/373]> Fold08
 9 <split [3376/373]> Fold09
10 <split [3376/373]> Fold10

Prepare your data for modeling

The {recipes} package is an extensible framework for pipeable sequences of preprocessing and feature engineering steps.
Statistical parameters for the steps can be estimated from an initial data set and then applied to other data sets.
The resulting processed output can be used as inputs for statistical or machine learning models.

A first recipe

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train)

The recipe() function assigns columns to roles of “outcome” or “predictor” using the formula

A first recipe

summary(hotel_rec)

# A tibble: 27 × 4
   variable                type      role      source  
   <chr>                   <list>    <chr>     <chr>   
 1 lead_time               <chr [2]> predictor original
 2 stays_in_weekend_nights <chr [2]> predictor original
 3 stays_in_week_nights    <chr [2]> predictor original
 4 adults                  <chr [2]> predictor original
 5 children                <chr [2]> predictor original
 6 babies                  <chr [2]> predictor original
 7 meal                    <chr [3]> predictor original
 8 country                 <chr [3]> predictor original
 9 market_segment          <chr [3]> predictor original
10 distribution_channel    <chr [3]> predictor original
# ℹ 17 more rows

The type column contains information on the variables

⏱️ Your turn

What do you think are in the type vectors for the lead_time and country columns?

02:00

summary(hotel_rec)$type[[1]]

[1] "double"  "numeric"

summary(hotel_rec)$type[[8]]

[1] "factor"    "unordered" "nominal"

Create indicator variables

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_dummy(all_nominal_predictors())

For any factor or character predictors, make binary indicators.
There are many recipe steps that can convert categorical predictors to numeric columns.
step_dummy() records the levels of the categorical predictors in the training set.

Filter out constant columns

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

In case there is a factor level that was never observed in the training data (resulting in a column of all 0s), we can delete any zero-variance predictors that have a single unique value.

Normalization

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

This centers and scales the numeric predictors.
The recipe will use the training set to estimate the means and standard deviations of the data.

All data the recipe is applied to will be normalized using those statistics (there is no re-estimation).

Reduce correlation

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_corr(all_numeric_predictors(), threshold = 0.9)

To deal with highly correlated predictors, find the minimum set of predictor columns that make the pairwise correlations less than the threshold.

Other possible steps

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_pca(all_numeric_predictors())

PCA feature extraction…

Other possible steps

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  embed::step_umap(all_numeric_predictors(), outcome = vars(avg_price_per_room))

A fancy machine learning supervised dimension reduction technique called UMAP…

Other possible steps

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_spline_natural(arrival_date_num, deg_free = 10)

Nonlinear transforms like natural splines, and so on!

⏱️ Your turn

Create a recipe() for the hotel data to:

use a Yeo-Johnson (YJ) transformation on lead_time
convert factors to indicator variables
remove zero-variance variables
add the spline technique shown previously

03:00

Minimal recipe

hotel_indicators <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_YeoJohnson(lead_time) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_spline_natural(arrival_date_num, deg_free = 10)

Measuring Performance

We’ll compute two measures: mean absolute error and the coefficient of determination (a.k.a \(R^2\)).

\[\begin{align} MAE &= \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i| \notag \\ R^2 &= cor(y_i, \hat{y}_i)^2 \end{align}\]

The focus will be on MAE for parameter optimization. We’ll use a metric set to compute these:

reg_metrics <- metric_set(mae, rsq)

Using a workflow

set.seed(9)

hotel_lm_wflow <- workflow() |>
  add_recipe(hotel_indicators) |>
  add_model(linear_reg())

ctrl <- control_resamples(save_pred = TRUE)
hotel_lm_res <- hotel_lm_wflow |>
  fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)

collect_metrics(hotel_lm_res)

# A tibble: 2 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 mae     standard   16.6      10 0.214   Preprocessor1_Model1
2 rsq     standard    0.884    10 0.00339 Preprocessor1_Model1

⏱️ Your turn

Use fit_resamples() to fit your workflow with a recipe.

Collect the predictions from the results.

05:00

Holdout predictions

# Since we used `save_pred = TRUE`
lm_cv_pred <- collect_predictions(hotel_lm_res)
lm_cv_pred

# A tibble: 3,749 × 5
   .pred id      .row avg_price_per_room .config             
   <dbl> <chr>  <int>              <dbl> <chr>               
 1  75.1 Fold01    20               40   Preprocessor1_Model1
 2  49.3 Fold01    28               54   Preprocessor1_Model1
 3  64.9 Fold01    45               50   Preprocessor1_Model1
 4  52.8 Fold01    49               42   Preprocessor1_Model1
 5  48.6 Fold01    61               49   Preprocessor1_Model1
 6  29.8 Fold01    66               40   Preprocessor1_Model1
 7  36.9 Fold01    88               49   Preprocessor1_Model1
 8  28.2 Fold01   103               45   Preprocessor1_Model1
 9  56.6 Fold01   111               44   Preprocessor1_Model1
10  36.5 Fold01   112               54.9 Preprocessor1_Model1
# ℹ 3,739 more rows

Calibration Plot

library(probably)

cal_plot_regression(hotel_lm_res)

Handling large numbers of categorical values

What do we do with the agent and company data?

There are 98 unique agent values and 100 unique companies in our training set. How can we include this information in our model?

We could:

make the full set of indicator variables 😳
lump agents and companies that rarely occur into an “other” group
use feature hashing to create a smaller set of indicator variables
use effect encoding to replace the agent and company columns with the estimated effect of that predictor

Per-agent statistics

Collapsing factor levels

There is a recipe step that will redefine factor levels based on their frequency in the training set:

hotel_other_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_YeoJohnson(lead_time) |>
  step_other(agent, threshold = 0.001) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_spline_natural(arrival_date_num, deg_free = 10)

Using this code, 34 agents (out of 98) were collapsed into “other” based on the training set.

We could try to optimize the threshold for collapsing (see “model tuning”).

Does othering help?

hotel_other_wflow <- hotel_lm_wflow |>
  update_recipe(hotel_other_rec)

hotel_other_res <-
  hotel_other_wflow |>
  fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)

collect_metrics(hotel_other_res)

# A tibble: 2 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 mae     standard   16.7      10 0.213   Preprocessor1_Model1
2 rsq     standard    0.884    10 0.00341 Preprocessor1_Model1

About the same MAE and much faster to complete.

Feature Hashing

Between agent and company, simple dummy variables would create 198 new columns (that are mostly zeros).

Another option is to have a binary indicator that combines some levels of these variables.

Feature hashing:

uses the character values of the levels
converts them to integer hash values
uses the integers to assign them to a specific indicator column.

Feature Hashing

Suppose we want to use 32 indicator variables for agent.

For an agent with value “Benjamin_Soltoff”, a hashing function converts it to an integer (say 6768247).

To assign it to one of the 32 columns, we would use modular arithmetic to assign it to a column:

# For "Benjamin_Soltoff" put a '1' in column:
6768247 %% 32

[1] 23

Hash functions are meant to emulate randomness.

Feature Hashing Pros

The procedure will automatically work on new values of the predictors
It is fast.
“Signed” hashes add a sign to help avoid aliasing

Feature Hashing Cons

There is no real logic behind which factor levels are combined
We don’t know how many columns to add
Some columns may have all zeros
If a indicator column is important to the model, we can’t easily determine why

Feature Hashing in recipes

The {textrecipes} package has a step that can be added to the recipe:

library(textrecipes)

hash_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_YeoJohnson(lead_time) |>
  # Defaults to 32 signed indicator columns
  step_dummy_hash(agent) |>
  step_dummy_hash(company) |>
  # Regular indicators for the others
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_spline_natural(arrival_date_num, deg_free = 10)

hotel_hash_wflow <- hotel_lm_wflow |>
  update_recipe(hash_rec)

Feature Hashing in recipes

hotel_hash_res <- hotel_hash_wflow |>
  fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)

collect_metrics(hotel_hash_res)

# A tibble: 2 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 mae     standard   16.7      10 0.239   Preprocessor1_Model1
2 rsq     standard    0.884    10 0.00324 Preprocessor1_Model1

About the same performance but now we can handle new values.

Effect encoding

What is an effect encoding?

We replace the qualitative’s predictor data with their effect on the outcome.

Data before:

before

# A tibble: 7 × 3
  avg_price_per_room agent            .row
               <dbl> <fct>           <int>
1               52.7 cynthia_worsley     1
2               51.8 carlos_bryant       2
3               53.8 lance_hitchcock     3
4               51.8 lance_hitchcock     4
5               46.8 cynthia_worsley     5
6               54.7 charles_najera      6
7               46.8 cynthia_worsley     7

Data after:

after

# A tibble: 7 × 3
  avg_price_per_room agent  .row
               <dbl> <dbl> <int>
1               52.7  88.5     1
2               51.8  89.5     2
3               53.8  79.8     3
4               51.8  79.8     4
5               46.8  88.5     5
6               54.7 109.      6
7               46.8  88.5     7

The agent column is replaced with an estimate of the ADR.

Per-agent statistics again

Good statistical methods for estimating these means use partial pooling.
Pooling borrows strength across agents and shrinks extreme values towards the mean for agents with very few translations
The {embed} package has recipe steps for effect encodings.

Partial pooling

Agent effects

library(embed)

hotel_effect_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_YeoJohnson(lead_time) |>
  step_lencode_mixed(agent, company, outcome = vars(avg_price_per_room)) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

It is very important to appropriately validate the effect encoding step to make sure that we are not overfitting.

Effect encoding results

hotel_effect_wflow <- workflow() |>
  add_model(linear_reg()) |>
  update_recipe(hotel_effect_rec)

reg_metrics <- metric_set(mae, rsq)

hotel_effect_res <-
  hotel_effect_wflow |>
  fit_resamples(hotel_rs, metrics = reg_metrics)

collect_metrics(hotel_effect_res)

# A tibble: 2 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 mae     standard   17.8      10 0.189   Preprocessor1_Model1
2 rsq     standard    0.870    10 0.00357 Preprocessor1_Model1

Slightly worse but it can handle new agents (if they occur).

Debugging a recipe

Typically, you will want to use a workflow to estimate and apply a recipe.
If you have an error and need to debug your recipe, the original recipe object (e.g. hash_rec) can be estimated manually with a function called prep(). It is analogous to fit().
Another function (bake()) is analogous to predict(), and gives you the processed data back.
The tidy() function can be used to get specific results from the recipe.

Example

hash_rec_fit <- prep(hash_rec)

# Get the transformation coefficient
tidy(hash_rec_fit, number = 1)

# A tibble: 1 × 3
  terms     value id              
  <chr>     <dbl> <chr>           
1 lead_time 0.173 YeoJohnson_gSg4m

# Get the processed data
bake(hash_rec_fit, hotel_train |> slice(1:3), contains("_agent_"))

# A tibble: 3 × 30
  dummyhash_agent_01 dummyhash_agent_02 dummyhash_agent_03 dummyhash_agent_04
               <int>              <int>              <int>              <int>
1                  0                  0                  0                  0
2                  0                 -1                  0                  0
3                  0                  0                  0                  0
# ℹ 26 more variables: dummyhash_agent_05 <int>, dummyhash_agent_06 <int>,
#   dummyhash_agent_07 <int>, dummyhash_agent_08 <int>,
#   dummyhash_agent_10 <int>, dummyhash_agent_11 <int>,
#   dummyhash_agent_12 <int>, dummyhash_agent_13 <int>,
#   dummyhash_agent_14 <int>, dummyhash_agent_15 <int>,
#   dummyhash_agent_16 <int>, dummyhash_agent_18 <int>,
#   dummyhash_agent_19 <int>, dummyhash_agent_20 <int>, …

Wrap-up

Recap

Predictors often need modification prior to modeling, either for data preprocessing or feature engineering
The {recipes} package is a flexible framework for preparing predictors
Feature hashing can be used to reduce the number of columns for high-cardinality predictors
Effect encoding can be used to replace categorical predictors with their estimated effect on the outcome

Acknowledgments

Materials derived in part from Machine learning with {tidymodels} and licensed under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA) License.

Feature engineering for categorical predictors

Announcements

Announcements

Learning objectives

Application exercise

ae-04

A new day, a new data set

Hotel data

Hotel date columns

Data spending

⏱️ Your turn

Exploratory analysis

Exploratory analysis

Exploratory analysis

Exploratory analysis

Exploratory analysis

Exploratory analysis

Working with predictors

Working with predictors

What is feature engineering?

Example: Dates

General definitions

Resampling Strategy

Resampling Strategy

Prepare your data for modeling

A first recipe

A first recipe

⏱️ Your turn

Create indicator variables

Create indicator variables

Filter out constant columns

Normalization

Reduce correlation

Other possible steps

Other possible steps

Other possible steps

⏱️ Your turn

Minimal recipe

Measuring Performance

Measuring Performance

Using a workflow

⏱️ Your turn

Holdout predictions

Calibration Plot

Handling large numbers of categorical values

What do we do with the agent and company data?

Per-agent statistics

Collapsing factor levels

Does othering help?

Feature Hashing

Feature Hashing

Feature Hashing

Feature Hashing Pros

Feature Hashing Cons

Feature Hashing in recipes

Feature Hashing in recipes

Effect encoding

What is an effect encoding?

Per-agent statistics again

Partial pooling

Agent effects

Effect encoding results

Debugging a recipe

Debugging a recipe

Example

More on recipes

Wrap-up

Recap

Acknowledgments

Riding the thermals

`ae-04`