Monitor models

Lecture 22

Dr. Benjamin Soltoff

Cornell University
INFO 4940/5940 - Fall 2024

December 3, 2024

Announcements

Announcements

  • Adjustments to project components
  • Homework 05
  • Homework 06?

Learning objectives

  • Define model performance
  • Identify causes of model drift
  • Monitor model inputs using summary statistics and visualizations
  • Monitor model outputs using metrics and visualizations
  • Generate Quarto dashboards to communicate model monitoring artifacts

Monitoring models

Data for model development

Data that you use while building a model for training/testing

housing_train <- read_csv("data/housing_train.csv")
housing_val <- read_csv("data/housing_val.csv")

Data for model monitoring

New data that you predict on after your model deployed

housing_monitor <- read_csv("data/housing_monitor.csv")

Data for model monitoring

My model is performing well!

👩🏼‍🔧 My model returns predictions quickly, doesn’t use too much memory or processing power, and doesn’t have outages.

Metrics

  • latency
  • memory and CPU usage
  • uptime

My model is performing well!

👩🏽‍🔬 My model returns predictions that are close to the true values for the predicted quantity.

Metrics

  • Accuracy
  • ROC AUC
  • F1 score
  • RMSE
  • Log loss

Model drift 📉

Model drift 📉

Degradation of ML model performance due to changes in the data or in the relationships between input and output variables.

Causes of model drift

  • Data drift
  • Concept drift
    • Seasonal
    • Sudden
    • Gradual

When should you retrain your model? 🧐

⏱️ Your turn

Activity

Using our data, what could be an example of data drift? Concept drift?

05:00

Monitor your model’s inputs

Monitor your model’s inputs

Typically it is most useful to compare to your model development data1

Monitor your model’s inputs

Monitor your model’s inputs

Application exercise

ae-21

  • Go to the course GitHub org and find your ae-21 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of the day

⏱️ Your turn

Activity

Create a plot or table comparing the development vs. monitoring distributions of a model input/feature.

How might you make this comparison if you didn’t have all the model development data available when monitoring?

What summary statistics might you record during model development, to prepare for monitoring?

07:00

Monitor your model’s outputs

Monitor your model’s outputs

  • If a realtor used a model like this one before putting a house on the market, they would get:
    • A predicted price from their model
    • A real price result after the home was sold
  • In this case, we can monitor our model’s statistical performance
  • If you don’t ever get a “real” result, you can still monitor the distribution of your outputs

Monitor your model’s outputs

library(vetiver)
library(tidymodels)
url <- "http://appliedml.infosci.cornell.edu:2300/predict"
endpoint <- vetiver_endpoint(url)

augment(endpoint, new_data = housing_monitor) |>
  vetiver_compute_metrics(
    date_var = sold_date,
    period = "month",
    truth = price,
    estimate = .pred,
    metric_set = metric_set(rmse, rsq, mae)
  ) |>
  vetiver_plot_metrics()

Monitor your model’s outputs

⏱️ Your turn

Activity

Use the functions for metrics monitoring from {vetiver} to create a monitoring visualization.

Choose a different set of metrics or time aggregation.

Note that there are functions for using {pins} as a way to version and update monitoring results too!

05:00

Feedback loops 🔁

Deployment of an ML model may alter the training data

  • Movie recommendation systems on Netflix, Disney+, Hulu, etc
  • Identifying fraudulent credit card transactions at Stripe
  • Recidivism models

Feedback loops can have unexpected consequences

Feedback loops 🔁

  • Users take some action as a result of a prediction
  • Users rate or correct the quality of a prediction
  • Produce annotations (crowdsource or expert)
  • Produce feedback automatically

⏱️ Your turn

Activity

What is a possible feedback loop for the Tompkins County housing data?

Do you think your example would be harmful or helpful? To whom?

05:00

ML metrics ➡️ organizational outcomes

  • Are machine learning metrics like F1 score or RMSE what matter to your organization?
  • Consider how ML metrics are related to important outcomes or KPIs for your business or org
  • There isn’t always a 1-to-1 mapping 😔
  • You can partner with stakeholders to monitor what’s truly important about your model

⏱️ Your turn

Activity

Let’s say that the most important organizational outcome for an Ithaca realtor is how accurate a pricing model is in terms of percentage on prices in USD rather than an absolute value. (Think about being 20% wrong vs. $20,000 wrong.)

We can measure this with the mean absolute percentage error.

Compute this quantity with the monitoring data, and aggregate by week/month, number of bedrooms/bathrooms, or town location.

If you have time, make a visualization showing your results.

07:00

ML metrics ➡️ organizational outcomes

# convert back to raw dollars for calculating percentage error
augment(endpoint, housing_monitor) |>
  mutate(
    .pred = 10^.pred,
    price = 10^price
  ) |>
  group_by(town) |>
  mape(price, .pred)
# A tibble: 12 × 4
   town          .metric .estimator .estimate
   <chr>         <chr>   <chr>          <dbl>
 1 Caroline      mape    standard       22.8 
 2 Cortlandville mape    standard        5.88
 3 Danby         mape    standard       47.1 
 4 Dryden        mape    standard       31.5 
 5 Enfield       mape    standard       21.1 
 6 Groton        mape    standard       47.2 
 7 Harford       mape    standard      113.  
 8 Hector        mape    standard       19.8 
 9 Ithaca        mape    standard       20.3 
10 Lansing       mape    standard       24.4 
11 Newfield      mape    standard       24.3 
12 Ulysses       mape    standard       27.4 

Possible model monitoring artifacts

Possible model monitoring artifacts

  • Adhoc analysis that you post in Slack
  • Report that you share in Google Drive
  • Fully automated dashboard published online

Possible model monitoring artifacts

⏱️ Your turn

Demonstration

Create a Quarto dashboard for model monitoring.

10:00

Wrap-up

Recap

  • Model drift can occur due to changes in data or relationships between input and output variables
  • Monitor your models to catch drift early and retrain as necessary
  • Create dashboards to communicate monitoring artifacts within your organization

Acknowledgments