Make a model

Lecture 3

Dr. Benjamin Soltoff

Cornell University
INFO 4940/5940 - Fall 2025

September 2, 2025

Announcements

Announcements

  • Complete Homework 00 by Wednesday at 11:59pm

Learning objectives

  • Identify the purpose of a predictive model
  • Categorize models based on common characteristics
  • Distinguish between interpretable and black-box models
  • Evaluate trade-offs between model complexity and interpretability
  • Identify packages for defining models in R and Python
  • Review the Quarto document format

What is machine learning?

An AI generated image of a whimsical house with a front porch and vultures on the roof, not too spring-like

Examples of supervised learning

  • Will a user click on this ad?
  • Will this property flood in the next year?
  • Will a police officer engage in misconduct in the next six months?
  • How many individuals will become infected with COVID-19 in the next week?
  • What will be the volatility of the stock market over the next month?

Two modes

Classification

Will this home sell in the next 30 days?

Regression

What will the sale price be for this home?

What is the goal of machine learning?

Build models that

generate accurate predictions*

for future, yet-to-be-seen data.

Build a model

Predictive models

A computer program that learns from data to make predictions

Many different types of models or algorithms

Common supervised learning models

  • Linear regression
  • Logistic regression
  • Generalized linear models
  • Naive Bayes
  • Penalized regression (LASSO, Ridge)
  • Regression splines
  • Local regression
  • Generalized additive models
  • Multivariate adaptive regression splines (MARS)
  • Decision trees
  • Bagged trees
  • Random forests
  • Boosting (GBM, XGBoost, LightGBM, CatBoost)
  • Support vector machines
  • Neural networks

Linear regression

Find an equation that takes the form

\[\hat{Y} = b_0 + b_1 X\]

which minimizes the sum of squared errors (SSE)

Decision tree

To predict the outcome of a new data point:

  • Uses rules learned from splits
  • Each split maximizes information gain

Neural network

  • Pass information forward through layers of interconnected nodes
  • Each node applies a transformation (activation function) to the inputs
  • Layers are used to identify patterns in the data and make predictions

📝 How do you know if a model is good?

Instructions

With a partner, identify characteristics that make for a “good” predictive model.

04:00

Good is in the eye of the beholder

Quantitative measures of “good”

  • Accuracy
  • ROC AUC
  • Brier score

Qualitative measures of “good”

  • How did it make this prediction?
  • Is the model equitable across different groups?
  • Can I explain the results to a non-technical stakeholder?

Interpretability-complexity trade-off

A graph showing the trade-off between predictive power and interpretability. As predictive power increases, interpretability decreases.

Choosing the desirable qualities for a model

📝 Identify a desirable model

Instructions

With a partner, discuss the following predictive problems and identify what characteristics you would want in a model. Carefully consider the competing desires of the stakeholders.

  • Assessing personal property value for tax purposes in Tompkins County, NY
    • The Assessor’s Office
    • Tax jurisdiction (e.g. City of Ithaca)
    • Property owners
  • Approving mortgage applications
    • Lending institution
    • Loan applicants
    • Government regulators
12:00

Defining models in R and Python

scikit-learn

  • Python library for machine learning
  • Almost 20 years old
  • Supports classification, regression, clustering, and more
  • Feature preprocessing and model evaluation tools
  • Directly implements machine learning algorithms

scikit-learn example

from sklearn.linear_model import LinearRegression
import pandas as pd
from palmerpenguins import load_penguins

# Load and clean data
penguins = load_penguins()
df = penguins.dropna()
X = df[["bill_length_mm"]]
y = df["body_mass_g"]

# Fit model
model = LinearRegression()
model.fit(X, y)

# Predict
y_pred = model.predict(X)

{tidymodels}

Core {tidymodels} packages

  • {parsnip}: a unified interface for creating and using models
  • {recipes}: a framework for preprocessing and feature engineering
  • {workflows}: a way to bundle models and preprocessing together
  • {tune}: tools for hyperparameter tuning
  • {rsample}: functions for resampling data sets
  • {yardstick}: functions for measuring model performance
  • {dials}: tools for creating and managing tuning parameters

{tidymodels} example

library(tidymodels)

# Load and clean data
data("penguins", package = "datasets")

# Define model
model <- linear_reg() |>
  set_engine("lm")

# Fit model
fit <- model |>
  fit(body_mass ~ bill_len, data = penguins)

# Predict
pred <- predict(fit, new_data = penguins)

Alternative ML ecosystems

Quarto

Major components

  1. A YAML header surrounded by ---s
  2. Cells of code surounded by ```
  3. Text mixed with simple text formatting using the Markdown syntax

Rendering process

A schematic representing rendering of Quarto documents from .qmd, to knitr or jupyter, to plain text markdown, then converted by pandoc into any number of output types including html, PDF, or Word document.

Rendering process

A schematic representing the multi-language input (e.g. Python, R, Observable, Julia) and multi-format output (e.g. PDF, html, Word documents, and more) versatility of Quarto.

⌨️ Render a Quarto document

Instructions

Follow along with the demo and continue working on hw-00.

Wrap-up

Recap

  • Machine learning is a set of tools for building predictive models
  • There are many different types of models with different strengths and weaknesses
  • Model choice depends on the problem and stakeholders
  • scikit-learn and {tidymodels} are two popular ecosystems for defining models in Python and R, respectively
  • Quarto is a powerful tool for authoring documents that combine code and text