# tune some complex models
tune_rf_res <- tune_grid(...)
tune_lgbm_res <- tune_grid(...)
# save a single object
write_rds(tune_rf_res, file = "output/tune_rf_res.rds")
write_rds(tune_lgbm_res, file = "output/tune_lgbm_res.rds")
# save together
save(tune_rf_res, tune_lgbm_res, file = "output/tune_results.RData")
Project description
Important dates
- Proposal due Thu, Oct 31st 🎃
- Exploration due Thu, Nov 14th
- Draft due Tue, Nov 26th
- Presentation + slides due on TBD
- Final report + deployed model due on Thu, Dec 19th
The details will be updated as the project date approaches.
Introduction
TL;DR: Solve a problem using machine learning.
This is intentionally vague – part of the challenge is to design a project that showcases best your interests and strengths. However we expect you to solve some sort of problem using machine learning (broadly construed). This could be a prediction problem, a classification problem, a clustering problem, etc. The problem could be in any domain, and could involve any type of data. The project should be a substantial piece of work, and should involve a significant amount of data and computation.
The project is very open ended. Neatness, coherency, and clarity will count. All components of the project must be reproducible.
Deliverables
The four primary deliverables for the final project are
- A project proposal with three ideas.
- A deployed product that is publicly accessible.
- A final report that documents the model.
- A presentation with slides.
There will be additional submissions throughout the semester to facilitate completion of the final product and presentation.
Teams
Projects will be completed in teams of 3-4 students. Every team member should be involved in all aspects of planning and executing the project. Each team member should make an equal contribution to all parts of the project. The scope of your project is based on the number of contributing team members on your project. If you have 4 contributing team members, we will expect a larger project than a team of 3 contributing team members.
The course staff will assign students to teams. To facilitate this process, we will provide a short survey identifying study and communication habits. Once teams are assigned, they cannot be changed.
Team conflicts
Conflict is a healthy part of any team relationship. If your team doesn’t have conflict, then your team members are likely not communicating their issues with each other. Use your team contract (written at the beginning of the project) to help keep your team dynamic healthy.
When you have conflict, you should follow this procedure:
Refer to the team contract and follow it to address the conflict.
If you resolve the conflict without issue, great! Otherwise, update the team contract and try to resolve the conflict yourselves.
-
If your team is unable to resolve your conflict, please contact soltoffbc@cornell.edu and explain your situation.
We’ll ask to meet with all the group members and figure out how we can work together to move forward.
Please do not avoid confrontation if you have conflict. If there’s a conflict, the best way to handle it is to bring it into the open and address it.
Project grade adjustments
Remember, do not do the work for a slacking team member. This only rewards their bad behavior. Simply leave their work unfinished. (We will not increase your grade during adjustments for doing more than your fair share.)
Your team will initially receive a final grade assuming that all team members contributed to your project. If you have a 5-person team, but only 3 persons contributed, your team will likely receive a lower grade initially because only 3 persons worth of effort exists for a 5-person project. About a week after the initial project grades are released, adjustments will be made to each individual team member’s group project grade.
We use your project’s Git history (to view the contributions of each team member) and the peer evaluations to adjust each team members’ grades. Both adjustments to increase or decrease your grade are possible based on each individual’s contributions.
For example, if you have a 4-person team, but only 3 contributing members, the 3 contributing members may have their grades increased to reflect the effort of only 3 contributing members. The non-contributing member will likely have their grade decreased significantly.
I am serious about every member of the team equitably contributing to the project. Students who fail to contribute equitably may receive up to a 100% deduction on their project grade.
Please be patient for the grade adjustments. The adjustments take time to do them fairly. Please know that the instructor handles this entire process himself, and I take it very seriously. If you think your initial group project grade is unfair, please wait for your grade adjustment before you contact us.
The slacking team member
Please do not cover for a slacking/freeloading team member. Please do not do their work for them! This only rewards their bad behavior. Simply leave their work unfinished. (We will not increase your grade during adjustments for doing more than your fair share.)
Remember, we have your Git history. We can see who contributes to the project and who doesn’t. If a team member rarely commits to Git and only makes very small commits, we can see that they did not contribute their fair share.
All students should make their project contributions through their own GitHub account. Do not commit changes to the repository from another team member’s GitHub account. Your Git history should reflect your individual contributions to the project.
Proposal
There are two main purposes of the project proposal:
- To help you think about the project early, so you can get a head start on finding data, reading relevant literature, thinking about the questions you wish to answer, etc.
- To ensure that the topic you wish to analyze, methods you plan to use, and the scope of your analysis are feasible and will allow you to be successful for this project.
Identify 3 problems you’re interested in potentially solving for the project. These problems should be solved using an ML model and real-world data. If you’re unsure where to find data, you can use the list of potential data sources in the Tips + Resources section as a starting point. It may also help to think of topics you’re interested in investigating and find datasets on those topics.
Write the proposal in the proposal.qmd
file in your project repo.
You must use one of the topics in the proposal for the final project, unless instructed otherwise when given feedback.
Criteria for datasets
The datasets should meet the following criteria:
- At least 1000 observations
- At least 8 columns
- At least 6 of the columns must be useful and unique explanatory variables.
- Identifier variables such as “name”, “social security number”, etc. are not useful explanatory variables.
- If you have multiple columns with the same information (e.g. “state abbreviation” and “state name”), then they are not unique explanatory variables.
- You may not use data that has previously been used in any course materials, or any derivation of data that has been used in course materials.
You may not use data from a secondary data archive. In plainest terms, do not use datasets you find from Kaggle or the UCI Machine Learning Repository. Your data should come from your own collection process (e.g. API or web scraping) or the primary source (e.g. government agency, research group, etc.).
Please ask a member of the course staff if you’re unsure whether your dataset meets the criteria.
Questions for project mentor
Include specific, relevant questions you have for the project mentor about your proposed topics. These questions should be about the feasibility of the project, the quality of the data, the potential for interesting analysis, etc.
Resources for datasets
You can find data wherever you like, but here are some recommendations to get you started. You shouldn’t feel constrained to datasets that are already in a tidy format, you can start with data that needs cleaning and tidying, scrape data off the web, or collect your own data.
- Awesome public datasets
- CDC
- Chicago Open Data Portal
- Data.gov
- Data is Plural
- Election Studies
- European Statistics
- FiveThirtyEight
- General Social Survey
- Goodreads
- Google Dataset Search
- Harvard Dataverse
- International Monetary Fund
- IPUMS survey data from around the world
- Los Angeles Open Data
- National Weather Service
- NHS Scotland Open Data
- NYC OpenData
- Open access to Scotland’s official statistics
- Pew Research
- Project Gutenberg
- Reddit posts and/or comments
- Sports Reference
- Statistics Canada
- The National Bureau of Economic Research
- UK Government Data
- UNICEF Data
- United Nations Data
- United Nations Statistics Division
- US Census Data
- World Bank Data
- Youth Risk Behavior Surveillance System (YRBSS)
Proposal components
For each topic, include the following:
Problem
What is the problem you will solve?
For each topic, include the following:
- A well formulated objective(s).
- Statement on why this topic is important.
- Identify the types of variables you will use. Categorical? Quantitative?
- How will you make the model usable? An interactive web application a la Shiny? A deployable API?
Introduction and data
For each dataset:
Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.
Address ethical concerns about the data, if any.
Glimpse of data
For each dataset (if one is provided):
- Place the file containing your data in the
data
folder of the project repo. - Use the
skimr::skim()
function to provide a glimpse of the dataset.
Exploration
Settle on a single topic and state your objective(s) clearly. You will carry out most of your data collection and cleaning, compute some relevant summary statistics, and show some plots of your data as applicable to your objective(s).
Write up your explanation in the explore.qmd
file in your project repo. It should include the following sections:
- Objective(s). State the problem(s) you are solving clearly.
- Data collection and cleaning. Have an initial draft of your data cleaning appendix. Document every step that takes your raw data file(s) and turns it into the analysis-ready data set that you would submit with your final project. Include text narrative describing your data collection (downloading, scraping, surveys, etc) and any additional data curation/cleaning (merging data frames, filtering, transformations of variables, etc). Include code for data curation/cleaning, but not collection.1
- Data description. Have an initial draft of your data description section. Your data description should be about your analysis-ready data.
- Data limitations. Identify any potential problems with your dataset.
- Exploratory data analysis. Perform an (initial) exploratory data analysis.
- Questions for reviewers. List specific questions for your project mentor to answer in giving you feedback on this phase.
1 If you have written code to collect your data (e.g. using an API or web scraping), store this in a separate .qmd
file or .R
script in the repo.
Thorough EDA requires substantial review and analysis of your data. You should not expect to complete this phase in a single day. You should expect to iterate through 20-30 charts, sets of summary statistics, etc., to get a good understanding of your data.
Visualizations are not expected to look perfect at this point since they are mainly intended for you and your team members. Standard expectations for visualizations (e.g. clearly labeled charts and axes, optimized color palettes) are not necessary at the EDA stage.
- Questions for reviewers. List specific questions for your project mentor to answer in giving you feedback on this phase.
Draft
The purpose of the draft is to give you an opportunity to get early feedback on your analysis. Therefore, the draft will focus primarily on the exploratory analysis and initial drafts of the final product(s).
Write the draft write-up in the report.qmd
file in your project repo. This should document your modeling strategies to date. At minimum, you are expected to include:
- Objective(s). State the problem(s) you are solving clearly.
- Data description. Your data description should be about your analysis-ready data.
- Decisions based on EDA. Based on your exploratory analysis, explicitly identify the decisions you made about your data (e.g. what to exclude, what to include, what to transform, etc). These could be things you do to preprocess the data before partitioning into training/test set, or possible feature engineering steps you will evaluate in the modeling phase.
- Resampling strategy. All teams are expected to partition their data into training/test sets using an appropriate strategy. Many teams will further partition the training set using a resampling strategy. Document your resampling strategy here and justify the approach you chose.
- Overview of modeling strategies. Provide an overview of the modeling strategies you plan to use. This should include a brief description of the models you plan to use, potential preprocessing or feature engineering steps, tuning parameters, and the evaluation metrics you plan to use to compare models.
- Initial results. Report any initial results you have. This should at least include a null model, as well as any of the modeling strategies from above that you have already tested. Include any relevant evaluation metrics and techniques we have learned in this class.
report.qmd
file
You may fit all the models in a separate R script or Quarto file and save/export any appropriate model objects so you can report relevant metrics or create visualizations/tables to report on the models’ performance. You should include the results of the models in the report.qmd
file.
Standard R objects can be saved to disk using readr::write_rds()
or save()
.
For model fit()
objects, you will likely want to butcher()
the object to reduce its overall size (otherwise the file size may be hundreds of megabytes.)
# fit the best rf model
best_rf <- fit_best(tune_rf_res)
best_rf_lite <- butcher(best_rf)
Read the documentation for {butcher} for more information.
Deployed model
All teams will publish their model for public usage. This might could be an interactive web application where users can generate custom predictions from the model, a deployable API allowing developers to interface with the model to generate predictions, or some other form.
We will learn how to publish models using APIs in this course. At this time I do not expect to cover Shiny applications in the course, but your team is free to learn how to use Shiny or a similar tool independently for the project.
Report
Document your model for internal and external communication. The report should be written in the report.qmd
file in your project repo. The exact details and structure of the report will be provided later in the semester.
Presentation + slides
TODO
Reproducibility + organization
All written work should be reproducible, and the GitHub repo should be neatly organized.
- Points for reproducibility + organization will be based on the reproducibility of the entire repository and the organization of the project GitHub repo.
- The repo should be neatly organized as described above, there should be no extraneous files, all text in the README should be easily readable.
Teamwork
Every team member should make an equal contribution to all parts of the project. Every team member should have an equal experience designing, coding, testing, etc.
At the completion of the project, you will be asked to fill out a survey where you rate the contribution and teamwork of each team member by assigning a contribution percentage for each team member. Working as a team is every team member’s responsibility.
If you are suggesting that an individual did less than half the expected contribution given your team size (e.g., for a team of four students, if a student contributed less than 12.5% of the total effort), please provide some explanation. If any individual gets an average peer score indicating that they underperformed on the project, we will conduct further analysis and their overall project grade may be adjusted accordingly.
Overall grading
Evaluation criteria
Total | 150 pts |
---|---|
Project proposal | 10 pts |
Exploration | 15 pts |
Draft | 20 pts |
Deployed model | 50 pts |
Report | 20 pts |
Slides + presentation | 15 pts |
Slides + presentation (peer) | 5 pts |
Reproducibility + organization | 15 pts |
Project proposal
Category | Less developed projects | Typical projects | More developed projects |
---|---|---|---|
Dataset ideas | Fewer than three topics are included. Topic ideas are vague and impossible or excessively difficult to collect. |
Three topic ideas are included and all or most datasets could feasibly be collected or accessed by the end of the semester. Each dataset is described alongside a note about availability with a source cited. |
Three topic ideas are included and all or most datasets could feasibly be collected or accessed by the end of the semester. Each dataset is described alongside a note about availability with (possibly multiple) sources cited. Each dataset could reasonably be part of a data science project, driven by an interesting research question. |
Questions for reviewers | The questions for reviewers are vague or unclear. | The questions for reviewers are specific to the datasets and are based on group discussions between team members. | The questions for reviewers are specific to the datasets and are based on group discussions between team members. Questions for reviewers look toward the next stage of the project. |
Exploration
Category | Less developed projects | Typical projects | More developed projects |
Objective(s) | Objective is not clearly stated or significantly limits potential analysis. | Clearly states the objective(s), which have moderate potential for relevant impact. | Clearly states complex objective(s) that leads to significant potential for relevant impact. |
Data cleaning | Data is minimally cleaned, with little documentation and description of the steps undertaken. |
Completes all necessary data cleaning for subsequent analyses. Describes cleaning steps with some detail. |
Completes all necessary data cleaning for subsequent analyses. Describes all cleaning steps in full detail, so that the reader has an excellent grasp of how the raw data was transformed into the analysis-ready dataset. |
Data description |
Simple description of some aspects of the dataset, little consideration for sources. The description is missing answers to applicable questions detailed in the “Datasheets for Datasets” paper. |
Answers all relevant questions in the “Datasheets for Datasets” paper. | All expectations of typical projects + credits and values data sources. |
Data limitations |
The limitations are not explained in depth. There is no mention of how these limitations may affect the meaning of results. |
Identifies potential harms and data gaps, and describes how these could affect the meaning of results. | Creatively identifies potential harms and data gaps, and describes how these could affect the meaning of results, and the impact of results on people. It is evident that significant thought has been put into the limitations of the collected data. |
Exploratory data analysis |
Motivation for choice of analysis methods is unclear. Does not justify decisions to either confirm / update research questions and data description. |
Sufficient plots (20-30) and summary statistics to identify typical values in single variables and connections between pairs of variables. Uses exploratory analysis to confirm/update research questions and data description. |
All expectations of typical projects + analysis methods are carefully chosen to identify important characteristics of data. |
Draft
TODO
Deployed model
TODO
Report
TODO
Presentation + slides
TODO
Presentation + slides (peer)
TODO
Reproducibility + organization
Category | Less developed projects | Typical projects |
---|---|---|
Reproducibility | Required files are missing. Quarto files do not render successfully (except for if a package needs to be installed). | All required files are provided. Project files (e.g. Quarto, Shiny apps, R scripts) render without issues and reproduce the necessary outputs. |
Data documentation | Codebook is missing. No local copies of data files. | All datasets are stored in a data folder, a codebook is provided, and a local copy of the data file is used in the code where needed. |
File readability | Documents lack a clear structure. There are extraneous materials in the repo and/or files are not clearly organized. | Documents (Quarto files and R scripts) are well structured and easy to follow. No extraneous materials. |
Issues | Issues have been left open, or are closed mostly without specific commits addressing them. | All issues are closed, mostly with specific commits addressing them. |
Late work policy
There is no late work accepted on this project. Be sure to turn in your work early to avoid any technological mishaps.