HW 01 - Make a model
This homework is due September 10 at 11:59pm ET.
Learning objectives
- Conduct exploratory data analysis
- Implement resampling methods
- Define and fit machine learning models
Getting started
Go to the info4940-fa25 organization on GitHub. Click on the repo with the prefix hw-01. It contains the starter documents you need to complete the lab.
Clone the repo and start a new workspace in Positron. See the Homework 0 instructions for details on cloning a repo and starting a new R project.
General guidance
As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.
Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.
Make sure to
- Update author name on your document.
- Label all code chunks informatively and concisely.
- Follow consistent code style guidelines.
- Make at least 3 commits.
- Resize figures where needed, avoid tiny or huge plots.
- Turn in an organized, well formatted document.
Project FeederWatch
Project FeederWatch is a citizen science project that collects data on birds and other wildlife that visit feeders in North America.1 A subset of their data was published in 2021 as part of the Tidy Tuesday project. Here we will use the data to model the presence of squirrels at bird feeders.
1 Co-operated by our very own Cornell Lab of Ornithology!
data/squirrels-lite.csv
contains a lightly modified version of the raw data file, filtering for missing values and operationalizing our outcome of interest squirrels
as either “squirrels” or “no squirrels” based on whether squirrels take food from the feeder at least 3 times per week. Since the original data set contained over 200,000 rows, we have taken a balanced sample sample of 5,000 rows for each outcome to make the computations more manageable (total size of 10,000 rows).
As you age, you will develop Old Person interests. One of mine and my wife’s is watching the animals around our house. We have multiple bird feeders which attract songbirds. Unfortunately the squirrels also like to eat the birdseed, cluttering our front porch with seed shells and stealing food from the birds. We want to put up a squirrel-proof bird feeder, but they are expensive and we don’t want to put it up if there aren’t many squirrels around. I want to use your model to inform our decision about when and where to put a feeder.
Exercise 1
Import and partition the data. Import the data and reproducibly partition it into training and test sets. Briefly justify your choice of partitioning strategy.
Now is a good time to render, commit, and push.
Exercise 2
Inspect the distribution of the outcome variable squirrels
and conduct exploratory analysis with potential features of interest. Use visualizations and summary statistics to explore the data focusing on the outcome of interest and at least five other variables. Summarize your findings.
Now is a good time to render, commit, and push.
Exercise 3
Resample the training data and fit a null model. Resample the training data using an appropriate method (justify your choice) and fit a null model to establish a baseline of comparison. Interpret the results of the null model.
A null model (also known as a baseline model) is a model that includes no predictors and establishes a baseline by which we can evaluate the relative effectiveness of our machine learning models. In the absence of predictors, our best guess for a classification model is to predict the modal outcome for all observations.
Now is a good time to render, commit, and push.
Exercise 4
Train and compare at least three models. Fit at least three distinct models using the resampled training set. Explain your choice of models and predictors. At least one should be a parsimonious model (i.e. don’t just throw all the variables in as predictors). Compare the models using accuracy and ROC AUC.2
2 If you haven’t used a ROC AUC before, review its definition in ISL ch 4.4.2 (it’s near the end of the section).
Now is a good time to render, commit, and push.
Exercise 5
Finalize your model and evaluate on the test set. Pick the best model (explain why you decided it is the best). Train it using the entire training set (not resampled) and evaluate it using the test set. Report the model’s accuracy and ROC AUC. How does it compare to the null model? What about the metric from the resampled training set? How effective do you find this model to be?
Now is a good time to render, commit, and push.
Generative AI (GAI) self-reflection
As stated in the syllabus, include a written reflection for this assignment of how you used GAI tools (e.g. what tools you used, how you used them to assist you with writing code), what skills you believe you acquired, and how you believe you demonstrated mastery of the learning objectives.
Render, commit, and push one last time.
Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Wrap up
Submission
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
- Click on your INFO 4940/5940 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
Grading
- Exercise 1: 4 points
- Exercise 1: 10 points
- Exercise 3: 8 points
- Exercise 4: 20 points
- Exercise 5: 8 points
- GAI self-reflection: 0 points
- Total: 50 points