Lecture 15
Cornell University
INFO 4940/5940 - Fall 2025
October 21, 2025
TODO
TODO test on Posit Workbench
ae-14
Instructions
ae-14
(repo name will be suffixed with your GitHub name).renv::restore()
to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.Vetiver, the oil of tranquility, is used as a stabilizing ingredient in perfumery to preserve more volatile fragrances.
you can operationalize that model!
you likely should be the one to operationalize that model!
. . .
price
beds
, baths
, area
, and year_built
are numeric predictorstown
and municipality
could be nominal predictorssold_date
could be a date predictorsold_date | price | beds | baths | area | lot_size | year_built | hoa_month | town | municipality | long | lat |
---|---|---|---|---|---|---|---|---|---|---|---|
2022-08-16 | 335500 | 3 | 2.0 | 1957 | 4.50000000 | 1880 | NA | Ulysses | Unincorporated | -76.67680 | 42.53255 |
2022-11-14 | 331500 | 3 | 2.0 | 1416 | 0.58999082 | 1930 | NA | Lansing | Unincorporated | -76.50347 | 42.53340 |
2022-03-31 | 302385 | 3 | 1.5 | 1476 | 0.20000000 | 1900 | NA | Ithaca | Ithaca city | -76.50439 | 42.44250 |
2022-09-28 | 285000 | 3 | 2.0 | 1728 | 0.46999541 | 2002 | NA | Dryden | Dryden village | -76.29495 | 42.48415 |
2022-07-22 | 350000 | 4 | 1.0 | 1698 | 0.12396694 | 1925 | NA | Ithaca | Ithaca city | -76.50146 | 42.43264 |
2023-11-28 | 225000 | 2 | 1.5 | 1047 | 0.08000459 | 1939 | NA | Ithaca | Ithaca city | -76.50576 | 42.43373 |
2023-09-13 | 285000 | 3 | 2.0 | 2311 | 1.26999541 | 1965 | NA | Caroline | Unincorporated | -76.33375 | 42.39048 |
2023-06-23 | 145000 | 2 | 2.0 | 1215 | 0.03999082 | 1990 | NA | Danby | Unincorporated | -76.49228 | 42.38340 |
2023-11-27 | 90900 | 5 | 3.0 | 2238 | 0.38000459 | 1880 | NA | Groton | Groton village | -76.36311 | 42.58533 |
2022-11-09 | 467500 | 6 | 4.0 | 2304 | 0.13000459 | 2017 | NA | Ithaca | Ithaca city | -76.50205 | 42.43136 |
Or your model of choice!
Activity
Split your data in training and testing.
Fit a model to your training data.
05:00
── tompkins-housing ─ <bundled_workflow> model for deployment
A lm regression modeling workflow using 4 features
Activity
Create your vetiver model object.
Check out the default description
that is created, and try out using a custom description.
Show your custom description to your neighbor.
05:00
Data, models, R/Python objects, etc.
❌ Email
❌ GitHub
🫤 Shared network drive
🫤 Dropbox, Google Drive, Box.com, etc.
✅ Amazon S3
✅ Azure
✅ Google Cloud
✅ Microsoft 365
The pins package publishes data, models, and other R and Python objects, making it easy to share them across projects and with your colleagues.
Creating new version '20251017T160311Z-ed6ba'
Writing to pin 'tompkins-housing'
Create a Model Card for your published model
• Model Cards provide a framework for transparent, responsible reporting
• Use the vetiver `.Rmd` template as a place to start
Model Cards provide a framework for transparent, responsible reporting.
Use the vetiver `.qmd` Quarto template as a place to start,
with vetiver.model_card()
Writing pin:
Name: 'tompkins-housing'
Version: 20251017T120311Z-1ee78
Activity
Pin your vetiver model object to a temporary board.
Retrieve the model metadata with pin_meta()
.
05:00
rf_rec <- recipe(
price ~ beds + baths + area + year_built + town,
data = housing_train
) |>
step_impute_mean(all_numeric_predictors()) |>
step_impute_mode(all_nominal_predictors())
housing_fit <- workflow() |>
add_recipe(rf_rec) |>
add_model(rand_forest(trees = 200, mode = "regression")) |>
fit(data = housing_train)
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Define feature columns
numeric_features = ["beds", "baths", "area", "year_built"]
categorical_features = ["town"]
# Create preprocessing steps
numeric_transformer = SimpleImputer(strategy="mean")
categorical_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore"))
])
# Combine preprocessors
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features)
]
)
# Create pipeline with preprocessor and model
housing_fit = Pipeline(steps=[
("preprocessor", preprocessor),
("regressor", RandomForestRegressor(n_estimators=200, random_state=123))
])
# Prepare training data with all features
X_train_full = housing.loc[X_train.index, numeric_features + categorical_features]
housing_fit.fit(X_train_full, y_train)
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('num', SimpleImputer(), ['beds', 'baths', 'area', 'year_built']), ('cat', Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))]), ['town'])])), ('regressor', RandomForestRegressor(n_estimators=200, random_state=123))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
steps | [('preprocessor', ...), ('regressor', ...)] | |
transform_input | None | |
memory | None | |
verbose | False |
transformers | [('num', ...), ('cat', ...)] | |
remainder | 'drop' | |
sparse_threshold | 0.3 | |
n_jobs | None | |
transformer_weights | None | |
verbose | False | |
verbose_feature_names_out | True | |
force_int_remainder_cols | 'deprecated' |
['beds', 'baths', 'area', 'year_built']
missing_values | nan | |
strategy | 'mean' | |
fill_value | None | |
copy | True | |
add_indicator | False | |
keep_empty_features | False |
['town']
missing_values | nan | |
strategy | 'most_frequent' | |
fill_value | None | |
copy | True | |
add_indicator | False | |
keep_empty_features | False |
categories | 'auto' | |
drop | None | |
sparse_output | True | |
dtype | <class 'numpy.float64'> | |
handle_unknown | 'ignore' | |
min_frequency | None | |
max_categories | None | |
feature_name_combiner | 'concat' |
n_estimators | 200 | |
criterion | 'squared_error' | |
max_depth | None | |
min_samples_split | 2 | |
min_samples_leaf | 1 | |
min_weight_fraction_leaf | 0.0 | |
max_features | 1.0 | |
max_leaf_nodes | None | |
min_impurity_decrease | 0.0 | |
bootstrap | True | |
oob_score | False | |
n_jobs | None | |
random_state | 123 | |
verbose | 0 | |
warm_start | False | |
ccp_alpha | 0.0 | |
max_samples | None | |
monotonic_cst | None |
Activity
Create a new vetiver model object using your linear regression model that is explicitly versioned = TRUE
and pin to your temporary board.
Then train a random forest model and create a new vetiver model object that is also versioned = TRUE
with the same name.
Write this new version of your model to the same pin, and see what versions you have with pin_versions()
.
05:00
REST API
An interface that can connect applications in a standard way
RESTful queries
Activity
TODO need to run on separate ports on Posit Workbench
Create a vetiver API for your model and run it locally.
Explore the visual documentation.
How many endpoints are there?
Discuss what you notice with your neighbor.
05:00
Image credit: Isabel Zimmerman
Image credit: Isabel Zimmerman
requests
or {httr2}/predict
endpointAny tool that can make an HTTP request can be used to interact with your model API!
You can treat your model API much like it is a local model in memory!
Activity
Create a vetiver endpoint object for your API.
Predict with your endpoint for new data.
Optional: call another endpoint like /ping
or /metadata
.
05:00