Proposal

Project 01

Modified

September 26, 2025

There are two main purposes of the project proposal:

To help you think about the project early, so you can get a head start on finding data, reading relevant literature, thinking about the questions you wish to answer, etc.
To ensure that the data you wish to analyze, methods you plan to use, and the scope of your analysis are feasible and will allow you to be successful for this project.

Identify 3 problems you’re interested in potentially solving for the project. These problems should be solved using a classical machine learning model and real-world data. If you’re unsure where to find data, you can use the list of potential data sources in the Resources for datasets section as a starting point. It may also help to think of topics you’re interested in investigating and find datasets on those topics.

Write the proposal in the proposal.qmd file in your project repo.

Important

You must use one of the topics in the proposal for the final project, unless instructed otherwise when given feedback.

Proposal components

For each topic, include the following components.

Problem or question

What is the problem you will solve?

A well formulated objective. (You may include more than one idea if you want to receive feedback on different ideas for your project. However, one per topic is required.)
Statement on why this topic is important.
Identify the types of variables you will use. Categorical? Quantitative?
How will you make the model usable? An interactive web application a la Shiny? A deployable API?

Introduction and data

You need to begin identifying potential datasets for your project as early as possible. You should include at least one dataset for each topic you propose; keep in mind the project will likely incorporate data from multiple sources.

Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.
Address ethical concerns about the data, if any.

You may not use data from a secondary data archive

In plainest terms, do not use datasets you find from Kaggle or the UCI Machine Learning Repository. Your data should come from your own collection process (e.g. API or web scraping) or the primary source (e.g. government agency, research group, etc.).

Please ask a member of the course staff if you’re unsure whether your dataset meets the criteria.

Glimpse of data

For each dataset:

Place the file containing your data in the data-raw folder of the project repo.
Provide a brief statistical overview of the data.
- R
- Python
Use the skimr::skim() function to provide a glimpse of the dataset.

Use skimpy or a similar package to provide a comprehensive statistical overview of the dataset.

Did you install a new package for the proposal?

Update your {renv} lockfile to include the new package(s) in your renv.lock file for maximum reproducibility.

If you installed the package using uv add, then your pyproject.toml and uv.lock files should already be updated.

Resources for datasets

You can find data wherever you like, but here are some recommendations to get you started. You shouldn’t feel constrained to datasets that are already in a tidy format, you can start with data that needs cleaning and tidying, scrape data off the web, or collect your own data.

Evaluation criteria

Category	Less developed projects	Typical projects	More developed projects
Topic ideas	Fewer than three topics are included. Topics are vague and impossible or excessively difficult to collect.	Three topics are included and all or most data could feasibly be collected or accessed by the end of the project. Each dataset is described alongside a note about availability with a source cited.	All expectations of typical projects + each dataset could reasonably be part of a data science project, driven by an interesting research question or objective.