Feature reduction/selection

Lecture 12

Dr. Benjamin Soltoff

Cornell University
INFO 4940/5940 - Fall 2024

October 8, 2024

Announcements

Homework 02 due tomorrow

Learning objectives

Identify the benefits of a simplified machine learning model
Consider how dimension reduction can be used to reduce the number of features in a model
Examine how tree-based and regularization models naturally incorporate feature selection
Implement feature selection and reduction techniques using {tidymodels}

Keep

Simple

Stupid

Keep it simple, stupid - why?

Reduces overfitting
Improves accuracy
Enhances model interpretability
Reduces computational costs
Mitigates the curse of dimensionality
Removes multicollinearity

Feature reduction

Use dimension reduction techniques to transform the predictors and use a reduced number of variables to fit the model.

Often used in linear models to reduce multicollinearity
Remember that techniques like PCA construct new features that are orthogonal
By definition the transformed variables are uncorrelated

Highly multicollinear data

# A tibble: 400 × 201
         y      X1     X2      X3      X4     X5     X6      X7     X8     X9
     <dbl>   <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
 1  -2.71   0.767   1.64   1.19   -1.13    0.912  1.97   1.56    1.22  -0.919
 2  40.4   -0.645   0.857 -1.58    2.03   -3.08  -1.59  -0.785  -1.74   0.910
 3 -26.9   -0.597   1.60  -0.499   1.67    1.52   0.918  0.335   2.49  -2.36 
 4 -16.6   -1.75   -1.94  -0.0462 -0.0284  2.69  -0.802  0.533   0.821 -2.06 
 5 -45.0    0.674   0.815  1.39   -2.37    2.21   2.98   1.78    1.57  -1.33 
 6   0.550 -0.0497  4.18  -1.50    2.91   -1.09   1.47   0.0364  1.80  -1.69 
 7  45.6   -0.960  -3.09  -2.07    3.24   -1.91  -4.99  -3.48   -1.88   1.70 
 8 -10.1    1.16   -0.309  3.08   -2.38    4.78   1.80   2.36    3.36  -3.02 
 9  41.0   -0.170  -0.831 -0.484   1.33   -1.02  -2.22  -0.765  -0.590  0.188
10  14.9    0.881   0.102  0.436  -0.308  -0.492 -0.163 -0.179  -0.241  0.713
# ℹ 390 more rows
# ℹ 191 more variables: X10 <dbl>, X11 <dbl>, X12 <dbl>, X13 <dbl>, X14 <dbl>,
#   X15 <dbl>, X16 <dbl>, X17 <dbl>, X18 <dbl>, X19 <dbl>, X20 <dbl>,
#   X21 <dbl>, X22 <dbl>, X23 <dbl>, X24 <dbl>, X25 <dbl>, X26 <dbl>,
#   X27 <dbl>, X28 <dbl>, X29 <dbl>, X30 <dbl>, X31 <dbl>, X32 <dbl>,
#   X33 <dbl>, X34 <dbl>, X35 <dbl>, X36 <dbl>, X37 <dbl>, X38 <dbl>,
#   X39 <dbl>, X40 <dbl>, X41 <dbl>, X42 <dbl>, X43 <dbl>, X44 <dbl>, …

Variance inflation factor (VIF)

Ratio of the variance of a parameter estimate in a model with multiple predictors to the variance of the same parameter estimate in a model with only one predictor.

Measures how much the variance for individual parameters is inflated due to multicollinearity.

Ideally VIF for all variables is below 10.

Variance inflation factor (VIF)

Summary of feature reduction

Feature reduction methods include:

Principal components analysis
Partial least squares (PLS)¹
Independent component analysis (ICA)
Uniform manifold approximation and projection (UMAP)

Remember these are all methods of feature reduction - they still require all the original predictors and are transformations of the original variables.

What if we want to select a subset of the original features?

Warning

Know your data! Don’t follow feature selection results blindly. If you know from theory or experience that certain predictors should explain/predict the outcome, then strongly consider including them in the model regardless of the feature selection results.

Subset selection methods

Identify a subset of the original predictors that are most relevant to the outcome of interest and fit a model to it.

Best subset selection - fit all possible combinations of variables and select the best model
Forward stepwise selection - start with no predictors and add one at a time, always adding the best variable
Backward stepwise selection - start with all predictors and remove one at a time, always removing the worst variable

Drawbacks to subset selection methods

Methods do not guarantee consistent results
Big risk of overfitting
Computationally intensive
- Best subset selection - estimate \(2^p\) models
- Forward/backward stepwise selection - estimate \(\frac{1+p(p+1)}{2}\) models
- For \(p = 20\), requires estimating 1,048,576 or 210 models respectively
Only works with linear models

Regularization methods

Brief overview

Linear regression estimates \(\beta_0, \beta_1, \ldots, \beta_p\) by minimizing the residual sum of squares (RSS).

\[\min_{\boldsymbol{\beta}} \sum_{i=1}^{n} \left( y_i - \left( \beta_0 + \sum_{j=1}^{p} \beta_j x_{ij} \right) \right)^2\]

Regularization methods add a penalty term to the RSS to shrink the coefficients towards zero.

Bias vs. variance trade-off
Reduces overfitting
Unimportant features are shrunken closer to zero

Forms of regularization penalty

Ridge regression

\[\min_{\boldsymbol{\beta}} \sum_{i=1}^{n} \left( y_i - \left( \beta_0 + \sum_{j=1}^{p} \beta_j x_{ij} \right) \right)^2 + \textcolor{orange}{\lambda \sum_{j=1}^{p} \beta_j^2 }\]

Lasso regression

\[\min_{\boldsymbol{\beta}} \sum_{i=1}^{n} \left( y_i - \left( \beta_0 + \sum_{j=1}^{p} \beta_j x_{ij} \right) \right)^2 + \textcolor{orange}{\lambda \sum_{j=1}^{p} \left| \beta_j \right|}\]

\(\lambda\) is the penalty parameter (tuning parameter) and controls how much weight is applied to it
Penalties are the same regardless of \(\lambda\), so we can fit a single penalized regression model for varying values of \(\lambda\)

Differences between ridge and lasso

Ridge penalty \[\sum_{j=1}^{p} \beta_j^2\]

Lasso penalty \[\sum_{j=1}^{p} \left| \beta_j \right|\]

Ridge penalty pushes coefficients near zero, whereas lasso penalty can shrink coefficients all the way to zero.

Example of shrinkage

Elastic net

Want to include both?

\[\min_{\boldsymbol{\beta}} \sum_{i=1}^{n} \left( y_i - \left( \beta_0 + \sum_{j=1}^{p} \beta_j x_{ij} \right) \right)^2 + \textcolor{orange}{\lambda_2 \sum_{j=1}^{p} \beta_j^2 } + \textcolor{orange}{\lambda_1 \sum_{j=1}^{p} \left| \beta_j \right|}\]

Regularization using {tidymodels}

Model specification

# lasso regression
linear_reg(penalty = tune(), mixture = 1) |>
  set_engine("glmnet")

# ridge regression
linear_reg(penalty = tune(), mixture = 0) |>
  set_engine("glmnet")

# elastic net
linear_reg(penalty = tune(), mixture = tune()) |>
  set_engine("glmnet")

Tuning parameters

penalty()

Amount of Regularization (quantitative)
Transformer: log-10 [1e-100, Inf]
Range (transformed scale): [-10, 0]

mixture()

Proportion of Lasso Penalty (quantitative)
Range: [0, 1]

Minimum {recipe} steps

recipe(outcome ~ ., data = df_train) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_predictors())

Example process

sim_df

# A tibble: 5,000 × 21
   outcome predictor_01 predictor_02 predictor_03 predictor_04 predictor_05
     <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
 1   72.1        4.05         -0.988       -2.37        -6.87        -5.51 
 2   -5.39       0.556         3.46        -1.52        -2.47         1.37 
 3   -1.45       1.29          1.17         1.91         1.20         0.177
 4   74.4       -0.572        -0.539       -0.135        4.18         4.61 
 5   20.2       -2.91         -1.64        -5.10        -2.20         1.78 
 6   -2.19       2.30          1.07         0.460       -0.809       -6.10 
 7   43.3       -0.535         3.71         0.804       -2.76        -0.215
 8   16.4       -0.0130        0.346       -0.481        3.16        -2.54 
 9   37.8        4.98         -6.56        -3.90         1.40        -5.42 
10   17.4       -1.97          0.867       -4.50        -2.38        -2.78 
# ℹ 4,990 more rows
# ℹ 15 more variables: predictor_06 <dbl>, predictor_07 <dbl>,
#   predictor_08 <dbl>, predictor_09 <dbl>, predictor_10 <dbl>,
#   predictor_11 <dbl>, predictor_12 <dbl>, predictor_13 <dbl>,
#   predictor_14 <dbl>, predictor_15 <dbl>, predictor_16 <dbl>,
#   predictor_17 <dbl>, predictor_18 <dbl>, predictor_19 <dbl>,
#   predictor_20 <dbl>

Specify workflow

glmnet_rec <- recipe(outcome ~ ., data = sim_df) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_predictors())

glmnet_spec <- linear_reg(penalty = tune(), mixture = 1) |>
  set_engine("glmnet")

glmnet_wf <- workflow() |>
  add_model(glmnet_spec) |>
  add_recipe(glmnet_rec)

Fit model

set.seed(532)

glmnet_tune <- tune_grid(
  glmnet_wf,
  resamples = sim_resamples,
  # evenly spaced set of 30 penalty parameters
  grid = grid_regular(
    penalty(),
    levels = 30
  ),
  control = control_grid(save_workflow = TRUE)
)

Model performance

autoplot(glmnet_tune)

Fit a final model

# get the best penalty parameter and fit the model using the entire training set
final_fit <- fit_best(glmnet_tune)

Feature importance

# extract coefficients
final_fit |>
  tidy() |>
  filter(estimate != 0)

# A tibble: 14 × 3
   term         estimate penalty
   <chr>           <dbl>   <dbl>
 1 (Intercept)  14.8       0.204
 2 predictor_01  2.57      0.204
 3 predictor_02 -0.00658   0.204
 4 predictor_04  0.00826   0.204
 5 predictor_07 -0.128     0.204
 6 predictor_08  0.0486    0.204
 7 predictor_10  0.0174    0.204
 8 predictor_11  1.51      0.204
 9 predictor_12  0.189     0.204
10 predictor_14  5.89      0.204
11 predictor_15 -0.184     0.204
12 predictor_16 -0.493     0.204
13 predictor_17  1.38      0.204
14 predictor_18 -6.20      0.204

Feature importance with {vip}

library(vip)

# data frame
final_fit |>
  extract_fit_parsnip() |>
  vi()

# A tibble: 20 × 3
   Variable     Importance Sign 
   <chr>             <dbl> <chr>
 1 predictor_18     6.42   NEG  
 2 predictor_14     6.08   POS  
 3 predictor_01     2.76   POS  
 4 predictor_11     1.70   POS  
 5 predictor_17     1.56   POS  
 6 predictor_16     0.693  NEG  
 7 predictor_12     0.374  POS  
 8 predictor_15     0.369  NEG  
 9 predictor_07     0.318  NEG  
10 predictor_08     0.242  POS  
11 predictor_04     0.207  POS  
12 predictor_10     0.202  POS  
13 predictor_13     0.194  POS  
14 predictor_02     0.180  NEG  
15 predictor_05     0.175  POS  
16 predictor_09     0.0775 POS  
17 predictor_06     0.0663 NEG  
18 predictor_19     0.0489 POS  
19 predictor_03     0.0367 NEG  
20 predictor_20     0.0212 NEG

# plot
final_fit |>
  extract_fit_parsnip() |>
  vip()

Application exercise

`ae-11`

Go to the course GitHub org and find your ae-11 (repo name will be suffixed with your GitHub name).
Clone the repo in RStudio, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
Render, commit, and push your edits by the AE deadline – end of the day

⏱️ Your turn

Estimate the regularized model to predict Taylor Swift vs. Beyoncé songs. Examine the coefficients and interpret in terms of feature selection.

10:00

Challenges with regularized regression for feature selection

Regularized regression still assumes a linear functional form of the model
Requires standardizing variables to ensure all variables are on the same scale, but now means interpretation of the coefficients is unit-less

What if we want to account for non-linear, interactive functional forms in the variables’ original units?

Tree-based feature importance

Decision trees

Decision trees are a natural ML technique for feature importance
Tree-fitting procedure automatically excludes predictors that do not provide sufficient information to create pure nodes
If a feature is not used in the tree, then it is not important
Decision trees naturally take into account non-linear and interactive relationships between the predictors and outcome of interest
However, ensemble methods like random forests and gradient boosting virtually guarantee inconsistent usage of features across multiple trees

Impurity-based feature importance

Measure of the total decrease in node impurity that results from splits over that variable.

Based on the training set
Sus on high cardinality features
Some methods available to “correct” for the bias

Mean decrease of accuracy in predictions on the out-of-bag samples

Using the out-of-bag observations for each decision tree and predictor:

Calculate the prediction accuracy for the OOB observations.
Randomly permute (shuffle) the values of the predictor.
Calculate the prediction accuracy for the permuted OOB observations.
Record the difference in accuracy between the true and permuted OOB observations.
Rinse and repeat a sufficient number of times.

Uses OOB observations not used to train the trees, so avoids that bias.

Also shown to be problematic when predictors are highly correlated.

Tree-based feature importance using {tidymodels}

Specify workflow model

rf_spec <- rand_forest(trees = 1000) |>
  set_mode("regression") |>
  set_engine("ranger", importance = "permutation")

rf_wf <- workflow() |>
  add_model(rf_spec) |>
  add_formula(outcome ~ .)

1: Use importance = "impurity" for impurity-based feature importance.

Fit model

set.seed(056)

rf_fit <- rf_wf |>
  fit(data = sim_train)

Note

Currently the model does not require tuning so we can fit() it once using the entire training set. If you have tuning parameters, tune and finalize those parameters first.

Feature importance

rf_fit |>
  extract_fit_parsnip() |>
  vip(method = "model", num_features = 20L) +
  ggtitle("Increase in mean squared error")

Comparing techniques

⏱️ Your turn

Estimate the random forest model to predict Taylor Swift vs. Beyoncé songs. Examine the feature importance scores and interpret in terms of feature selection. How do these results compare to the regularized regression approach?

10:00

Wrap-up

Recap

Feature reduction methods transform the predictors to reduce the number of (transformed) variables in the model
Subset selection methods identify a subset of the original predictors that are most relevant to the outcome of interest
Regularization methods add a penalty term to shrink the coefficients towards zero
Tree-based methods are a common ML technique for feature importance
Feature selection is not automatic – you as the developer need to make final decisions on what predictors to keep