Draft

Project 01

Modified

October 28, 2025

Important

The purpose of the draft is to give you an opportunity to get early feedback on your analysis. Therefore, the draft will focus primarily on the exploratory analysis and initial drafts of the final product(s).

Write the draft write-up in the report.qmd file in your project repo. This should document your modeling strategies to date. At minimum, you are expected to include:

Objective(s). State the problem(s) you are solving clearly.
Data description. Your data description should be about your analysis-ready data.
Decisions based on EDA. Based on your exploratory analysis, explicitly identify the decisions you made about your data (e.g. what to exclude, what to include, what to transform, etc). These could be things you do to preprocess the data before partitioning into training/test set, or possible feature engineering steps you will evaluate in the modeling phase.
Resampling strategy. All teams are expected to partition their data into training/test sets using an appropriate strategy. Many teams will further partition the training set using a resampling strategy. Document your resampling strategy here and justify the approach you chose.
Overview of modeling strategies. Provide an overview of the modeling strategies you plan to use. This should include a brief description of the models you plan to use, potential preprocessing or feature engineering steps, tuning parameters, and the evaluation metrics you plan to use to compare models.
Initial results. Report any initial results you have. This should at least include a null model, as well as any of the modeling strategies from above that you have already tested. Include any relevant evaluation metrics and techniques we have learned in this class.

You do not need to fit all the models in the report.qmd file

You may fit all the models in a separate R/Python script or Quarto file and save/export any appropriate model objects so you can report relevant metrics or create visualizations/tables to report on the models’ performance. You should include the results of the models in the report.qmd file.

R
Python

Standard R objects can be saved to disk using readr::write_rds() or save().

# tune some complex models
tune_rf_res <- tune_grid(...)
tune_lgbm_res <- tune_grid(...)

# save a single object
write_rds(tune_rf_res, file = "output/tune_rf_res.rds")
write_rds(tune_lgbm_res, file = "output/tune_lgbm_res.rds")

# save together
save(tune_rf_res, tune_lgbm_res, file = "output/tune_results.RData")

For model fit() objects, you will likely want to butcher() the object to reduce its overall size (otherwise the file size may be hundreds of megabytes.)

# fit the best rf model
best_rf <- fit_best(tune_rf_res)
best_rf_lite <- butcher(best_rf)

Read the documentation for {butcher} for more information.

To save and load scikit-learn models for later use (e.g., generating predictions in a different script), use the joblib or pickle libraries.

Fit your model and save it using pickle.dump().

from sklearn import ensemble
from sklearn import datasets

# fit the model
clf = ensemble.HistGradientBoostingClassifier()
X, y = datasets.load_iris(return_X_y=True)
clf.fit(X, y)

# Here you can replace pickle with joblib
from pickle import dump
with open("filename.pkl", "wb") as f:
    dump(clf, f, protocol=5)

In another script, load the model with pickle.load() and use .predict() as usual.

# Here you can replace pickle with joblib
from pickle import load
with open("filename.pkl", "rb") as f:
    clf = load(f)

This approach preserves the model state and allows you to generate predictions in any script after loading.

Evaluation criteria

Category	Less developed projects	Typical projects	More developed projects
Objectives	Objective is not clearly stated or significantly limits potential analysis.	Clearly states the objective(s), which have moderate potential for relevant impact.	Clearly states complex objective(s) that leads to significant potential for relevant impact.
Data description	Simple description of some aspects of the dataset, little consideration for sources. The description is missing answers to applicable questions detailed in the “Datasheets for Datasets” paper.	Answers all relevant questions in the “Datasheets for Datasets” paper.	All expectations of typical projects + credits and values data sources.
Decisions based on EDA	Identifies minimal actions taken during the modeling stage based on the results of the EDA. Actions are unlikely to effect predictions.	Identifies concrete actions taken during the modeling stage based on the results of the EDA.	All expectations of typical projects + actions demonstrate deliberate and careful analysis of the exploratory analysis.
Resampling strategy	Does not use resampling methods (or an inappropriate method) to ensure robust model evaluation.	Selects an appropriate resampling strategy.	All expectations of typical projects + provides a thorough justification for the resampling strategy.
Modeling strategies	Includes only simplistic models. Does not demonstrate understanding of the types of models covered in the course. Feature engineering steps are non-existent. Does not select evaluation metrics or metrics are not appropriate to the objective(s) + models.	Identifies several modeling strategies which could generate a high-performance model. Documents relevant feature engineering steps to be evaluate for specific modeling strategies. Steps are selectively applied to appropriate models. Evaluation metrics are appropriate for the objective(s) + models.	All expectations of typical projects + provides thorough explanation for the models/feature engineering/metrics. Shows care in selecting their methods.
Initial results	Only reports results of null model. Results are presented in a disjointed or uninterpretable manner.	Reports the results of some (but not all) of their modeling strategies. Results are presented in a clear and interpretable manner.	Reports the results of the majority of their modeling strategies. Results are effectively communicated through the use of tables and/or figures.