AE 17: Explaining student debt predictions

Application exercise
Modified

November 5, 2024

Load packages

# packages for wrangling data and the original models
library(tidyverse)
library(tidymodels)
library(rcis)

# packages for model interpretation/explanation
library(DALEX)
library(DALEXtra)

# helper packages
library(tictoc)
library(patchwork)

# set random number generator seed value for reproducibility
set.seed(123)

Student debt in the United States has increased substantially over the past twenty years. In this application exercise we will interpret the results of a set of machine learning models predicting the median student debt load for students graduating in 2020-21 at four-year colleges and universities as a function of university-specific factors (e.g. public vs. private school, admissions rate, cost of attendance).

Tip

Load the documentation for rcis::scorecard to identify the variables used in the models.

Preparations

Import models

We have estimated three distinct machine learning models to predict the median student debt load for students graduating in 2021-22 at four-year colleges and universities. Each model uses the same set of predictors, but the algorithms differ. Specifically, we have estimated

  • Random forest
  • Penalized regression
  • 10-nearest neighbors

All models were estimated using tidymodels. We will load the training set, test set, and ML workflows from data/scorecard-models.Rdata.

# load Rdata file with all the data frames and pre-trained models
load("data/scorecard-models.RData")

Create explainer objects

In order to generate our interpretations, we will use the {DALEX} package. The first step in any {DALEX} operation is to create an explainer object. This object contains all the information needed to interpret the model’s predictions. We will create explainer objects for each of the three models.

Your turn: Review the syntax below for creating explainer objects using the explain_tidymodels() function. Then, create explainer objects for the random forest and \(k\) nearest neighbors models.

# use explain_*() to create explainer object
# first step of an DALEX operation
explainer_glmnet <- explain_tidymodels(
  model = glmnet_wf,
  # data should exclude the outcome feature
  data = scorecard_train |> select(-debt),
  # y should be a vector containing the outcome of interest for the training set
  y = scorecard_train$debt,
  # assign a label to clearly identify model in later plots
  label = "penalized regression"
)

# explainer for random forest model
explainer_rf <- explain_tidymodels(
  model = TODO,
  data = TODO,
  y = TODO,
  label = "random forest"
)

# explainer for nearest neighbors model
explainer_kknn <- explain_tidymodels(
  model = TODO,
  data = TODO,
  y = TODO,
  label = "k nearest neighbors"
)

Shapley values

Choose the observations to explain

cornell <- filter(.data = scorecard, name == "Cornell University") |>
  select(-unitid, -name)
ic <- filter(.data = scorecard, name == "Ithaca College") |>
  select(-unitid, -name)

Estimate Shapley values

# explain Cornell with rf model
shap_cornell_rf <- predict_parts(
  explainer = explainer_rf,
  new_observation = cornell,
  type = "shap"
)
plot(shap_cornell_rf)

# explain Cornell with kknn model
shap_cornell_kknn <- predict_parts(
  explainer = explainer_kknn,
  new_observation = cornell,
  type = "shap"
)
plot(shap_cornell_kknn)

# increase the number of feature order permutations
shap_cornell_kknn_40 <- predict_parts(
  explainer = explainer_kknn,
  new_observation = cornell,
  type = "shap",
  B = 40
)

plot(shap_cornell_kknn_40)

Pair with {ggplot2}

# based on example from https://www.tmwr.org/explain#local-explanations

shap_cornell_kknn |>
  # convert to pure tibble-formatted data frame
  as_tibble() |>
  # calculate average contribution per variable across permutations
  mutate(mean_val = mean(contribution), .by = variable) |>
  # reorder variable levels in order of absolute value of mean contribution
  mutate(variable = fct_reorder(variable, abs(mean_val))) |>
  # define basic ggplot object for horizontal boxplot
  ggplot(mapping = aes(x = contribution, y = variable, fill = mean_val > 0)) +
  # add a bar plot
  geom_col(
    data = ~ distinct(., variable, mean_val),
    mapping = aes(x = mean_val, y = variable),
    alpha = 0.5
  ) +
  # overlay with boxplot to show distribution
  geom_boxplot(width = 0.5) +
  # outcome variable is measured in dollars - contributions are the same units
  scale_x_continuous(labels = label_dollar()) +
  # use viridis color palette
  scale_fill_viridis_d(guide = "none") +
  labs(y = NULL)

Exercises

Your turn: Explain each model’s prediction for Ithaca College. How do they differ from each other?

# calculate shapley values
shap_ic_rf <- predict_parts(
  explainer = explainer_rf,
  new_observation = ic,
  type = "shap"
)

shap_ic_kknn <- predict_parts(
  explainer = explainer_kknn,
  new_observation = ic,
  type = "shap"
)

shap_ic_glmnet <- predict_parts(
  explainer = explainer_glmnet,
  new_observation = ic,
  type = "shap"
)

# generate plots for each
plot(shap_ic_rf)
plot(shap_ic_kknn)
plot(shap_ic_glmnet)

Add response here. TODO

Choose your own adventure

Your turn: Choose at least two other universities in the scorecard dataset. Generate explanations of their predicted median student debt from the random forest model using Shapley values. Compare the results. What are the most important features for each university? How do the explanations differ?

# choose institutions I have attended or worked at
my_obs <- filter(.data = scorecard, name %in% c(
  TODO
)) |>
  select(-unitid, -name)

# calculate shapley values
# add code here

Add response here.

Permutation-based feature importance

The {DALEX} package provides a variety of methods for interpreting machine learning models. One common method is to calculate feature importance. Feature importance measures the contribution of each predictor variable to the model’s predictions. We will use the model_parts() function to calculate feature importance for the random forest model. It includes a built-in plot() method using {ggplot2} to visualize the results.

# generate feature importance measures
vip_rf <- model_parts(explainer_rf)
vip_rf

# visualize feature importance
plot(vip_rf)

Your turn: Calculate feature importance for the random forest model using 100, 1000, and all observations for permutations. How does the compute time change?

# N = 100
tic()
model_parts(explainer_rf, N = TODO)
toc()

# default N = 1000
tic()
model_parts(explainer_rf, N = TODO)
toc()

# all observations
tic()
model_parts(explainer_rf, N = TODO)
toc()

Add response here.

Your turn: Calculate feature importance for the random forest model using the ratio of the raw change in the loss function. How does this differ from the raw change?

# calculate ratio rather than raw change
model_parts(explainer_rf, type = "ratio") |>
  plot()

Add response here.

Exercises

Your turn: Calculate feature importance for the penalized regression and \(k\) nearest neighbors using all observations for permutations. How do they compare to the random forest model?

# random forest model
vip_rf <- model_parts(explainer_rf, N = NULL)
plot(vip_rf)

# compare to the glmnet model
vip_glmnet <- model_parts(explainer_glmnet, N = NULL)
plot(vip_glmnet)

# compare to the kknn model
vip_kknn <- model_parts(explainer_kknn, N = NULL)
plot(vip_kknn)

Add response here.

Your turn: Calculate feature importance for the random forest model three times using N = 100, changing the random seed value before each calculation. How do the results change? How does that compare to N = 1000?

# calculate random forest feature importance thrice
set.seed(123)
model_parts(explainer_rf, N = 100) |> plot()

set.seed(123)
model_parts(explainer_rf, N = 1000) |> plot()

Add response here.

Partial dependence plots

In order to generate partial dependence plots, we will use model_profile(). This function calculates the average model prediction for a given variable while holding all other variables constant. We will generate partial dependence plots for the random forest model using the netcost variable.

# basic pdp for RF model and netcost variable
pdp_netcost <- model_profile(explainer_rf, variables = "netcost")
pdp_netcost

We can visualize just the PDP, or the PDP with the ICE curves that underly the PDP.

# just the PDP
plot(pdp_netcost)

# PDP with ICE curves
plot(pdp_netcost, geom = "profiles")

By default model_profile() only uses 100 randomly selected observations to generate the profiles. Increasing the sample size may increase the quality of our estimate of the PDP, at the expense of a longer compute time.

# larger sample size
model_profile(explainer_rf, variables = "netcost", N = 500) |>
  plot(geom = "profiles")

We can also examine the PDP of a variable for distinct subgroups in the dataset. This is relevant for ML models which (either by design or the type of algorithm used) allow for interactions between predictor variables. In this case, we will examine the PDP of the net cost of attendance independently for public, private non-profit, and private for-profit institutions.

pdp_cost_group <- model_profile(explainer_rf, variables = "netcost", groups = "type")
plot(pdp_cost_group, geom = "profiles")

We also might want to generate a visualization directly from the aggregated profiles, rather than rely on {DALEX} to generate it for us.

# PDP for state - very hard to read
pdp_state_kknn <- model_profile(explainer_kknn, variables = "state")
plot(pdp_state_kknn)

# examine the data structure
pdp_state_kknn

# extract aggregated profiles
pdp_state_kknn$agr_profiles |>
  # convert to tibble
  as_tibble() |>
  # reorder for plotting
  mutate(`_x_` = fct_reorder(.f = `_x_`, .x = `_yhat_`)) |>
  ggplot(mapping = aes(x = `_yhat_`, y = `_x_`, fill = `_yhat_`)) +
  geom_col() +
  scale_x_continuous(labels = label_dollar()) +
  scale_fill_viridis_c(guide = "none") +
  labs(
    title = "Partial dependence plot for state",
    subtitle = "Created for the k nearest neighbors model",
    x = "Average prediction",
    y = NULL
  )

Exercises

Your turn: Create PDP + ICE curves for average faculty salary using all three models. How does the role of average faculty salary differ across the models?

# create PDP + ICE curves for avgfacsal from all three models
model_profile(explainer_rf, variables = TODO) |> plot(geom = "profiles")
model_profile(explainer_glmnet, variables = TODO) |> plot(geom = "profiles")
model_profile(explainer_kknn, variables = TODO) |> plot(geom = "profiles")

Add response here.

Your turn: Create a PDP for all numeric variables in the penalized regression model. How do the variables compare in terms of their impact on the model?

# create PDP for all numeric variables in glmnet model
model_profile(explainer_glmnet, variables = TODO) |>
  plot()

Add response here.