Proposal
There are two main purposes of the project proposal:
- To help you think about the project early, so you can get a head start on finding data, reading relevant literature, thinking about the questions you wish to answer, etc.
- To ensure that the data you wish to analyze, methods you plan to use, and the scope of your analysis are feasible and will allow you to be successful for this project.
Identify 3 problems you’re interested in potentially solving for the project. These problems should be solved using a classical machine learning model and real-world data. If you’re unsure where to find data, you can use the list of potential data sources in the Resources for datasets section as a starting point. It may also help to think of topics you’re interested in investigating and find datasets on those topics.
Write the proposal in the proposal.qmd
file in your project repo.
You must use one of the topics in the proposal for the final project, unless instructed otherwise when given feedback.
Proposal components
For each topic, include the following components.
Problem or question
What is the problem you will solve?
- A well formulated objective. (You may include more than one idea if you want to receive feedback on different ideas for your project. However, one per topic is required.)
- Statement on why this topic is important.
- Identify the types of variables you will use. Categorical? Quantitative?
- How will you make the model usable? An interactive web application a la Shiny? A deployable API?
Introduction and data
You need to begin identifying potential datasets for your project as early as possible. You should include at least one dataset for each topic you propose; keep in mind the project will likely incorporate data from multiple sources.
Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.
Address ethical concerns about the data, if any.
In plainest terms, do not use datasets you find from Kaggle or the UCI Machine Learning Repository. Your data should come from your own collection process (e.g. API or web scraping) or the primary source (e.g. government agency, research group, etc.).
Please ask a member of the course staff if you’re unsure whether your dataset meets the criteria.
Glimpse of data
For each dataset:
Place the file containing your data in the
data-raw
folder of the project repo.Provide a brief statistical overview of the data.
Use the
skimr::skim()
function to provide a glimpse of the dataset.Use
skimpy
or a similar package to provide a comprehensive statistical overview of the dataset.
Update your {renv} lockfile to include the new package(s) in your renv.lock
file for maximum reproducibility.
If you installed the package using uv add
, then your pyproject.toml
and uv.lock
files should already be updated.
Resources for datasets
You can find data wherever you like, but here are some recommendations to get you started. You shouldn’t feel constrained to datasets that are already in a tidy format, you can start with data that needs cleaning and tidying, scrape data off the web, or collect your own data.
- Awesome public datasets
- CDC
- Chicago Open Data Portal
- Data.gov
- Data is Plural
- Election Studies
- European Statistics
- FiveThirtyEight
- General Social Survey
- Goodreads
- Google Dataset Search
- Harvard Dataverse
- International Monetary Fund
- IPUMS survey data from around the world
- Los Angeles Open Data
- National Weather Service
- NHS Scotland Open Data
- NYC OpenData
- Open access to Scotland’s official statistics
- Pew Research
- Project Gutenberg
- Reddit posts and/or comments
- Sports Reference
- Statistics Canada
- The National Bureau of Economic Research
- UK Government Data
- UNICEF Data
- United Nations Data
- United Nations Statistics Division
- US Census Data
- World Bank Data
- Youth Risk Behavior Surveillance System (YRBSS)
Evaluation criteria
Category | Less developed projects | Typical projects | More developed projects |
---|---|---|---|
Topic ideas | Fewer than three topics are included. Topics are vague and impossible or excessively difficult to collect. |
Three topics are included and all or most data could feasibly be collected or accessed by the end of the project. Each dataset is described alongside a note about availability with a source cited. |
All expectations of typical projects + each dataset could reasonably be part of a data science project, driven by an interesting research question or objective. |