Fairness in machine learning

Lecture 13

Dr. Benjamin Soltoff

Cornell University
INFO 4940/5940 - Fall 2024

October 10, 2024

Announcements

Announcements

  • Project teams

Learning objectives

  • Define the concept of fairness in machine learning
  • Introduce metrics for evaluating fairness and equity in model performance
  • Implement fairness-aware machine learning models
  • Utilize workflow sets to screen many models
  • Justify appropriate criteria for evaluating model performance and selecting a final model

What is fairness?

⏱️ Your turn

What is “fairness”? Discuss with a partner and write down your definition.

03:00

High-risk health care management programs

These programs seek to improve the care of patients with complex health needs by providing additional resources, including greater attention from trained providers, to help ensure that care is well coordinated. Most health systems use these programs as the cornerstone of population health management efforts, and they are widely considered effective at improving outcomes and satisfaction while reducing costs. […] Because the programs are themselves expensive—with costs going toward teams of dedicated nurses, extra primary care appointment slots, and other scarce resources—health systems rely extensively on algorithms to identify patients who will benefit the most.

Competing interests

  • Patients
  • Doctors
  • Hospital administrators

⏱️ Your turn

We want to create a fair algorithm to identify patients who will benefit the most from high-risk health care management programs. What does it mean for this algorithm to be “fair”?

  • Patients
  • Doctors
  • Hospital administrators
05:00

{readmission}

{readmission}

Publicly available database containing information on 71,515 hospital stays from diabetes patients

library(readmission)

readmission
# A tibble: 71,515 × 12
   readmitted race   sex   age   admission_source blood_glucose insurer duration
   <fct>      <fct>  <fct> <fct> <fct>            <fct>         <fct>      <dbl>
 1 Yes        Afric… Male  [60-… Referral         <NA>          <NA>           7
 2 No         Cauca… Fema… [50-… Emergency        Normal        Private        4
 3 Yes        Cauca… Fema… [70-… Referral         <NA>          Medica…        5
 4 No         Cauca… Fema… [80-… Referral         <NA>          Private        5
 5 No         Cauca… Fema… [70-… Referral         <NA>          <NA>           4
 6 No         Cauca… Male  [50-… Emergency        Very High     <NA>           2
 7 Yes        Afric… Fema… [70-… Referral         <NA>          Private        3
 8 No         Cauca… Fema… [20-… Emergency        <NA>          <NA>           1
 9 No         Cauca… Male  [60-… Other            <NA>          <NA>          12
10 No         Cauca… Fema… [80-… Referral         <NA>          Medica…        1
# ℹ 71,505 more rows
# ℹ 4 more variables: n_previous_visits <dbl>, n_diagnoses <dbl>,
#   n_procedures <dbl>, n_medications <dbl>

Modeling goals

  • Predict whether or not a patient is readmitted within 30 days of discharge (needed additional care)
  • Account for whether or not the model consistently identifies lesser need in one subgroup than another
  • Train a model that is both fair with regard to how it treats race groups and is as performant as possible

Load packages

library(tidymodels)
library(desirability2)
library(GGally)

Exploratory analysis

Distribution of outcome of interest

readmission |>
  count(readmitted)
# A tibble: 2 × 2
  readmitted     n
  <fct>      <int>
1 Yes         6293
2 No         65222

Distribution of race

readmission |>
  count(race)
# A tibble: 6 × 2
  race                 n
  <fct>            <int>
1 African American 12887
2 Asian              497
3 Caucasian        53491
4 Hispanic          1517
5 Other             1177
6 Unknown           1946

Impact on variability of metrics

Proportion of readmission across race groups using 10-fold cross validation

# A tibble: 6 × 4
  race               mean      sd     n
  <fct>             <dbl>   <dbl> <int>
1 African American 0.0849 0.00713 12887
2 Asian            0.0825 0.0295    497
3 Caucasian        0.0900 0.00369 53491
4 Hispanic         0.0805 0.0285   1517
5 Other            0.0680 0.0265   1177
6 Unknown          0.0720 0.0178   1946

⏱️ Your turn

What should we do with the infrequent race classes?

  • Remove rows arising from infrequent classes?
  • Collapse infrequent classes into a generic “other” category?
  • Something else?
03:00

Collapse into “other”

readmission_collapsed <- readmission |>
  mutate(
    race = case_when(
      !(race %in% c("Caucasian", "African American")) ~ "Other",
      .default = race
    ),
    race = factor(race)
  )

readmission_collapsed |>
  count(race)
# A tibble: 3 × 2
  race                 n
  <fct>            <int>
1 African American 12887
2 Caucasian        53491
3 Other             5137

Reason for readmission

Insurance coverage

\(N\) variables

Prep for modeling

Resample data

set.seed(1)
readmission_splits <- initial_split(readmission_collapsed, strata = readmitted)
readmission_train <- training(readmission_splits)
readmission_test <- testing(readmission_splits)
readmission_folds <- vfold_cv(readmission_train, strata = readmitted)
readmission_folds
#  10-fold cross-validation using stratification 
# A tibble: 10 × 2
   splits               id    
   <list>               <chr> 
 1 <split [48272/5364]> Fold01
 2 <split [48272/5364]> Fold02
 3 <split [48272/5364]> Fold03
 4 <split [48272/5364]> Fold04
 5 <split [48272/5364]> Fold05
 6 <split [48272/5364]> Fold06
 7 <split [48273/5363]> Fold07
 8 <split [48273/5363]> Fold08
 9 <split [48273/5363]> Fold09
10 <split [48273/5363]> Fold10

Basic recipe

We’ll first define a basic recipe that first sets a factor level for missing values and then centers and scales numeric data:

recipe_basic <- recipe(readmitted ~ ., data = readmission) |>
  step_unknown(all_nominal_predictors()) |>
  step_YeoJohnson(all_numeric_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())

Treat age as numeric

# e.g. "[10-20]" -> 15
age_bin_to_midpoint <- function(age_bin) {
  # ensure factors are treated as their label
  age <- as.character(age_bin)
  # take the second character, e.g. "[10-20]" -> "1"
  age <- substr(age, 2, 2)
  # convert to numeric, e.g. "1" -> 1
  age <- as.numeric(age)
  # scale to bin's midpoint, e.g. 1 -> 10 + 5 -> 15
  age * 10 + 5
}

recipe_age <- recipe(readmitted ~ ., data = readmission) |>
  step_mutate(age_num = age_bin_to_midpoint(age)) |>
  step_rm(age) |>
  step_unknown(all_nominal_predictors()) |>
  step_YeoJohnson(all_numeric_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())

Model specifications

Logistic regression, boosted tree, and a neural network

spec_lr <- logistic_reg("classification")

spec_bt <- boost_tree(
  "classification",
  mtry = tune(), learn_rate = tune(),
  stop_iter = 10, trees = 500
)

spec_nn <- mlp("classification", hidden_units = tune(), penalty = tune())

Define all the workflows using {workflowsets}

wflow_set <- workflow_set(
  preproc = list(basic = recipe_basic, age = recipe_age),
  models = list(lr = spec_lr, bt = spec_bt, nn = spec_nn)
)
wflow_set
# A workflow set/tibble: 6 × 4
  wflow_id info             option    result    
  <chr>    <list>           <list>    <list>    
1 basic_lr <tibble [1 × 4]> <opts[0]> <list [0]>
2 basic_bt <tibble [1 × 4]> <opts[0]> <list [0]>
3 basic_nn <tibble [1 × 4]> <opts[0]> <list [0]>
4 age_lr   <tibble [1 × 4]> <opts[0]> <list [0]>
5 age_bt   <tibble [1 × 4]> <opts[0]> <list [0]>
6 age_nn   <tibble [1 × 4]> <opts[0]> <list [0]>

Measuring fairness in ML

equal_opportunity()

Equal opportunity: when a model’s predictions have the same true positive and false negative rates across protected groups

Correctly predict readmission and incorrectly predict non-readmission at the same rate across race groups.

Stakeholders: The patient

Since this metric does not consider false positives, it does not penalize disparately providing additional care resources to a patient who may not need them.

equalized_odds()

Equalized odds: when a model’s predictions have the same false positive, true positive, false negative, and true negative rates across protected groups.

Extends equal opportunity to additionally constrain based on the false positive and true negative rates.

Correctly predict both readmission and non-readmission and incorrectly predict readmission and non-readmission at the same rate across race groups.

Stakeholders: The patient

Also aims to prevent disparately

  1. Providing additional care resources to those who may not need them
  2. Identifying patients who do not need additional care resources correctly

demographic_parity()

Demographic parity: when a model’s predictions have the same predicted positive rate across groups.

Predict readmission at the same rate across race groups.

Does not depend on the true outcome readmitted - only the prediction of readmission.

Stakeholders: Those who would like to see additional care resources provisioned at the same rate across race groups, even if the actual need for those resources differs among groups (e.g. hospitals).

Metrics

  • equal_opportunity()
  • equalized_odds()
  • demographic_parity()

Values closer to zero indicate that a model is more fair.

  • accuracy()
  • roc_auc()
  • Other traditional metrics

Stakeholders: Those who prefer the most performant model regardless of group membership.

Create our metric set

m_set <- metric_set(
  accuracy,
  roc_auc,
  equal_opportunity(race),
  equalized_odds(race),
  demographic_parity(race)
)

m_set
A metric set, consisting of:
- `accuracy()`, a class metric                 | direction: maximize
- `roc_auc()`, a probability metric            | direction: maximize
- `equal_opportunity(race)()`, a class metric  | direction: minimize,
group-wise on: race
- `equalized_odds(race)()`, a class metric     | direction: minimize,
group-wise on: race
- `demographic_parity(race)()`, a class metric | direction: minimize,
group-wise on: race

Fit and evaluate the models

Fit our workflow set

set.seed(1)
wflow_set_fit <- workflow_map(
  wflow_set,
  seed = 1,
  metrics = m_set,
  resamples = readmission_folds
)

Fitted workflow set

wflow_set_fit
# A workflow set/tibble: 6 × 4
  wflow_id info             option    result   
  <chr>    <list>           <list>    <list>   
1 basic_lr <tibble [1 × 4]> <opts[2]> <rsmp[+]>
2 basic_bt <tibble [1 × 4]> <opts[2]> <tune[+]>
3 basic_nn <tibble [1 × 4]> <opts[2]> <tune[+]>
4 age_lr   <tibble [1 × 4]> <opts[2]> <rsmp[+]>
5 age_bt   <tibble [1 × 4]> <opts[2]> <tune[+]>
6 age_nn   <tibble [1 × 4]> <opts[2]> <tune[+]>

Model selection

Rank the models

rank_results(wflow_set_fit, rank_metric = "roc_auc") |>
  filter(.metric == "roc_auc")
# A tibble: 42 × 9
   wflow_id .config         .metric  mean std_err     n preprocessor model  rank
   <chr>    <chr>           <chr>   <dbl>   <dbl> <int> <chr>        <chr> <int>
 1 age_bt   Preprocessor1_… roc_auc 0.605 0.00424    10 recipe       boos…     1
 2 basic_bt Preprocessor1_… roc_auc 0.605 0.00423    10 recipe       boos…     2
 3 age_bt   Preprocessor1_… roc_auc 0.604 0.00378    10 recipe       boos…     3
 4 age_bt   Preprocessor1_… roc_auc 0.603 0.00421    10 recipe       boos…     4
 5 basic_bt Preprocessor1_… roc_auc 0.603 0.00410    10 recipe       boos…     5
 6 basic_bt Preprocessor1_… roc_auc 0.603 0.00436    10 recipe       boos…     6
 7 age_bt   Preprocessor1_… roc_auc 0.602 0.00372    10 recipe       boos…     7
 8 basic_bt Preprocessor1_… roc_auc 0.599 0.00449    10 recipe       boos…     8
 9 age_bt   Preprocessor1_… roc_auc 0.599 0.00441    10 recipe       boos…     9
10 basic_lr Preprocessor1_… roc_auc 0.599 0.00441    10 recipe       logi…    10
# ℹ 32 more rows

Boosted tree model results

autoplot(wflow_set_fit, id = "age_bt")

Choose the best parameters

How do we choose a model that performs well both with respect to a typical performance metric like roc_auc() and the fairness metrics we’ve chosen?

Collect the metrics

# extract the tuning results for the boosted tree model
age_bt_metrics <- extract_workflow_set_result(wflow_set_fit, "age_bt") |>
  # collect the metrics associated with it
  collect_metrics() |>
  # pivot the metrics so that each is in a column
  pivot_wider(
    id_cols = c(mtry, learn_rate),
    names_from = .metric,
    values_from = mean
  )
age_bt_metrics
# A tibble: 10 × 7
    mtry learn_rate accuracy demographic_parity equal_opportunity equalized_odds
   <int>      <dbl>    <dbl>              <dbl>             <dbl>          <dbl>
 1    13    0.173      0.910           0.00262           0.0138         0.0141  
 2     8    0.0817     0.912           0.00122           0.00259        0.00309 
 3     7    0.00456    0.912           0                 0              0       
 4    17    0.0392     0.912           0.000586          0.00127        0.00179 
 5    10    0.00172    0.912           0                 0              0       
 6    19    0.00842    0.912           0.000103          0              0.000115
 7    15    0.0237     0.912           0.000398          0.000253       0.000568
 8     1    0.0146     0.912           0                 0              0       
 9     3    0.00268    0.912           0                 0              0       
10    21    0.222      0.909           0.00400           0.0145         0.0149  
# ℹ 1 more variable: roc_auc <dbl>

Calculate desirability

age_bt_metrics |>
  mutate(
    # higher roc values are better; detect max and min from the data
    d_roc     = d_max(roc_auc, use_data = TRUE),
    # lower equalized odds are better; detect max and min from the data
    d_e_odds  = d_min(equalized_odds, use_data = TRUE),
    # compute overall desirability based on d_roc and d_e_odds
    d_overall = d_overall(across(starts_with("d_")))
  ) |>
  select(roc_auc, d_roc, equalized_odds, d_e_odds, d_overall)
# A tibble: 10 × 5
   roc_auc d_roc equalized_odds d_e_odds d_overall
     <dbl> <dbl>          <dbl>    <dbl>     <dbl>
 1   0.563 0.113       0.0141     0.0538    0.0780
 2   0.587 0.614       0.00309    0.792     0.698 
 3   0.605 1           0          1         1     
 4   0.591 0.710       0.00179    0.880     0.790 
 5   0.603 0.964       0          1         0.982 
 6   0.599 0.870       0.000115   0.992     0.929 
 7   0.597 0.823       0.000568   0.962     0.890 
 8   0.602 0.948       0          1         0.974 
 9   0.604 0.983       0          1         0.991 
10   0.558 0           0.0149     0         0     

Choose the best parameters

best_params <- age_bt_metrics |>
  mutate(
    # higher roc values are better; detect max and min from the data
    d_roc     = d_max(roc_auc, use_data = TRUE),
    # lower equalized odds are better; detect max and min from the data
    d_e_odds  = d_min(equalized_odds, use_data = TRUE),
    # compute overall desirability based on d_roc and d_e_odds
    d_overall = d_overall(across(starts_with("d_")))
  ) |>
  # pick the model with the highest desirability value
  slice_max(d_overall)
best_params
# A tibble: 1 × 10
   mtry learn_rate accuracy demographic_parity equal_opportunity equalized_odds
  <int>      <dbl>    <dbl>              <dbl>             <dbl>          <dbl>
1     7    0.00456    0.912                  0                 0              0
# ℹ 4 more variables: roc_auc <dbl>, d_roc <dbl>, d_e_odds <dbl>,
#   d_overall <dbl>

Finalize the workflow

final_model_config <- extract_workflow(wflow_set_fit, "age_bt") |>
  finalize_workflow(best_params)
final_model <- last_fit(final_model_config, readmission_splits, metrics = m_set)

Examine model performance

collect_metrics(final_model)
# A tibble: 5 × 5
  .metric            .estimator .estimate .by   .config             
  <chr>              <chr>          <dbl> <chr> <chr>               
1 accuracy           binary         0.912 <NA>  Preprocessor1_Model1
2 equal_opportunity  binary         0     race  Preprocessor1_Model1
3 equalized_odds     binary         0     race  Preprocessor1_Model1
4 demographic_parity binary         0     race  Preprocessor1_Model1
5 roc_auc            binary         0.602 <NA>  Preprocessor1_Model1

Further steps

  • Additional metrics focused on predictions on observations from the minority class (e.g. sensitivity)
  • Fairness related to sex
  • Fairness related to the intersection of race and sex
  • How are these predictions used by practitioners or trusted by patients?

Wrap-up

Recap

  • Fairness is an important goal of machine learning, but historically ignored
  • Quantifying fairness allows us to evaluate models on multiple dimensions
  • There is no single definition of “fairness” - needs to be identified by stakeholders prior to the modeling project

Additional resources

Acknowledgments