HW 06 - Bring it together
This homework is due December 16 at 11:59pm ET.
Learning objectives
- Train and evaluate a machine learning model
- Utilize tuning parameters to increase model performance
- Evaluate the performance of machine learning models
- Explain how machine learning models make predictions
Getting started
Go to the info4940-fa24 organization on GitHub. Click on the repo with the prefix hw-06. It contains the starter documents you need to complete the homework.
Clone the repo and start a new project in RStudio.
Packages
Guidelines + tips
- Set your random seed to ensure reproducible results.
- Use caching to speed up the rendering process.
- Use parallel processing to speed up rendering time. Note that this works differently on different systems and operating systems, and it also makes it harder to debug code and track progress in model fitting. Use at your own discretion.
For the love of all that is pure in this world, please consider how to present the results of your modeling efforts. Do not just rely on raw output from R to tell us what we need to know.
- Your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.
- Maybe condense information into one or a handful of custom graphs.
- You can create simple formatted tables using {gt}
Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete this homework and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.
Workflow + formatting
Make sure to
- Update author name on your document.
- Label all code chunks informatively and concisely.
- Follow the Tidyverse code style guidelines.
- Make at least 3 commits.
- Resize figures where needed, avoid tiny or huge plots.
- Turn in an organized, well formatted document.
Tidy Tuesday
Tidy Tuesday is a weekly data project to promote wrangling and visualization skills. It is hosted by the Data Science Learning Community which aims to “create a supportive and responsive online space for learners” to improve their programming and data analysis skills.
Every week they post a raw dataset on GitHub and ask people to explore the data. The ultimate goal is to apply R skills, get feedback, explore other’s work, and connect with the greater #RStats
community. Contributors frequently publish their work on social media under the #TidyTuesday
hashtag. Datasets are posted on Mondays.
You are expected to solve a predictive problem using a Tidy Tuesday dataset published during 2024.
Exercise 1
Identify your objective. Describe the problem you are trying to solve and the dataset you are using.
Exercise 2
Exploratory analysis. Perform an exploratory analysis of the data. This should include summary statistics, visualizations, and any other relevant information.
Exercise 3
Data preprocessing. Perform and document any required preprocessing steps before you partition the data. Then use an appropriate method to partition the data into training/test sets, and use an appropriate resampling method on the training set.
Exercise 4
Model selection. Train and evaluate some set of machine learning models. This should include some combination of feature selection, feature reduction, feature engineering, and/or hyperparameter tuning. Document what techniques you utilized and report the results using appropriate metrics, tables, and/or visualizations.
You can render your model fitting code in the hw-06.qmd
file. However, you may also choose to fit the models in a separate R script/Quarto document and save the necessary objects in your repo, then import those objects in your submission to generate appropriate tables and figures.
I don’t need to see the code used to tune and fit all your models - your documentation should be sufficient to understand what you’ve done.
Exercise 5
Model evaluation. Fit a single, final model and evaluate its performance using the test set. Explain the model or individual predictions as necessary.
Generative AI (GAI) self-reflection
As stated in the syllabus, include a written reflection for this assignment of how you used GAI tools (e.g. what tools you used, how you used them to assist you with writing code), what skills you believe you acquired, and how you believe you demonstrated mastery of the learning objectives.
Render, commit, and push one last time.
Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Wrap up
Submission
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
- Click on your INFO 4940/5940 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
Grading
- Exercise 1: 5 points
- Exercise 2: 5 points
- Exercise 3: 5 points
- Exercise 4: 25 points
- Exercise 5: 10 points
- GAI self-reflection: 0 points
- Total: 50 points