HW 04 - LLMs and explaining models
This homework is due Monday, November 18 at 11:59pm ET.
Learning objectives
- Utilize prompt engineering to improve LLM performance
- Create interactive web applications using Shiny
- Use LLMs to assist with development of data science tools
- Explain how machine learning models make predictions
Getting started
Go to the info4940-fa24 organization on GitHub. Click on the repo with the prefix hw-04. It contains the starter documents you need to complete the homework.
Clone the repo and start a new project in RStudio.
Packages
Guidelines + tips
- Set your random seed to ensure reproducible results.
- Use caching to speed up the rendering process.
- Use parallel processing to speed up rendering time. Note that this works differently on different systems and operating systems, and it also makes it harder to debug code and track progress in model fitting. Use at your own discretion.
For the love of all that is pure in this world, please consider how to present the results of your modeling efforts. Do not just rely on raw output from R to tell us what we need to know.
- Your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.
- Maybe condense information into one or a handful of custom graphs.
- You can create simple formatted tables using {gt}
Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete this homework and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.
Workflow + formatting
Make sure to
- Update author name on your document.
- Label all code chunks informatively and concisely.
- Follow the Tidyverse code style guidelines.
- Make at least 3 commits.
- Resize figures where needed, avoid tiny or huge plots.
- Turn in an organized, well formatted document.
Part 1 - Prompt engineering
Recall a previous application exercise where we attempted to develop a system prompt for a generative AI tutor for INFO 2950/5001. You will refine the prompt to be used for an R tutor chat bot. The design requirements are:
The chatbot will be deployed for INFO 2950 or INFO 5001 to assistant students in meeting the learning objectives for the courses. It should behave similar to a human TA in that it supports students without providing direct answers to assignments or exams. Test your new system prompt on the student prompts below and evaluate the responses it produces.
- It is a support tool for students in the class to improve their understanding of data science concepts and R programming, and meet the course learning objectives.
- It should not provide direct answers or fix students’ code on assignments. Instead, it should answer general questions about R programming, debugging, and data analysis.
- It can provide troubleshooting assistance on assignment questions, but it should not directly solve the problem for the student.
Exercise 1
Create a system prompt that meets these design goals and test it on the following example student queries:
I am running into issues with step_textfeature() in the following recipe:
lasso_improved_rec <- recipe(state ~ ., data = kickstarter_train) |>
update_role(.id, new_role = "id var") |>
step_tokenize(blurb) |>
step_stopwords(blurb) |>
step_stem(blurb) |>
step_tokenfilter(blurb, max_tokens = 500) |>
step_ngram(blurb, min_num_tokens = 1, num_tokens = 3) |>
step_textfeature(blurb) |>
step_tfidf(blurb) |>
step_normalize(all_predictors())
The error I am encoutering is as follows, and I am not sure what to do/have not been able to find a helpful answer in the documentation or online given my recipe:
i Fold01: preprocessor 1/1
x Fold01: preprocessor 1/1:
Error in step_textfeature():
Caused by error in prep():
✖ All columns selecte...
string, factor, or ordered.
• 1 tokenlist variable found: b... i Fold02: preprocessor 1/1 x Fold02: preprocessor 1/1: Error in step_textfeature(): Caused by error in prep(): ✖ All columns selecte... string, factor, or ordered. • 1 tokenlist variable found: b...
And so on ...
What is the difference between a for loop and a map function in R?
I was just curious about what constitutes a good "binwidth" for a histogram. Is a good binwidth one where the bars are separated and can each be individually visible (rather than the bars appearing stuck together)? Or is a good binwidth one where all of the bars are large enough to adequately fill up the space on the graph itself?
Part 2 - Using LLMs to build data science tools
{Shiny} is a software package in R and Python that allows users to create interactive web applications. In this exercise, you will use Shiny to build a dashboard for the Democracy and Dictatorship dataset.
But wait! I haven’t taught you anything about Shiny yet! How can you be expected to learn this new tool in such a short time frame? Let’s leverage the power of LLMs to help you build this dashboard.
Shiny Assistant is an online tool that helps you build Shiny apps by generating code based on your inputs. It uses Anthropic’s Claude 3.5 Sonnet model to generate custom Shiny apps based on your prompts while still allowing you to edit the source code directly.
Exercise 2
Use this tool to build a Shiny app for the Democracy and Dictatorship dataset. The data is a panel structure with multiple years of observations for each country. You can view the codebook on the Tidy Tuesday website.
Some important guidelines:
- You may build the app in R or Python. Use the toggle in the top-left corner of Shiny Assistant to switch between R and Python.
- You do not need to include all the information from the dataset in the app. Instead, focus on a few key variables that you think are important for understanding the data. You can use the Shiny Assistant to generate the initial code for the app and ask questions about Shiny concepts or techniques, but also feel free to customize the app to include additional features or functionality that you think are important.
- Your app should include at least one plot and one table or formatted text content.
- Be aware of how you are using Shiny Assistant to help build the application. You will need to reflect on your usage of the tool in your self-reflection at the end of the assignment.
In theory Shiny Assistant has the ability to upload data files as part of the editing process, though I found this to not be fully functional. What I found easier was to create a new file in the top-left panel of the R IDE, name it democracy_data.csv
, and manually paste in the contents of the CSV file. This way, you can reference the data file in your Shiny app code and every LLM prompt will include a copy of the dataset.
You cannot use the full CSV data file with Shiny Assistant. Because the entire data file is passed to the LLM model, you will get an error message that the user prompt exceeds the token limit. I have included a reduced size version of the dataset in the data
folder of the homework repository called democracy-lite.csv
which contains the last two rows for each country. Use this file to build your Shiny app with Shiny Assistant, but use the full dataset for your final submission.
If your browser goes idle, your session with Shiny Assistant will time out and there is no way to retrieve your chat history. Make sure to save a copy of the source code for your Shiny app locally so you can reuse it in a future Shiny Assistant session and publish the app.
Your working Shiny must be published online so we can access and test it. You can use the free shinyapps.io service to host your Shiny app. Follow the Getting Started guide to create a free account and publish your app directly from your IDE.
In the PDF submitted to Gradescope, include a link to your published Shiny app. Just paste the URL directly into the document. Gradescope does not allow us to click on embedded links, so we need to be able to read the URL directly in the document.
Your source code for the Shiny app should be included in the homework repository. Use either app.R
or app.py
depending on whether you use Shiny for R or Python. Your Quarto notebook will automatically embed the code from these files when you render the document.
Part 3 - Explain predicted danceability
You will interpret the result of a random forest model that predicts the “danceability” of a collection of songs.
The source of the data is Spotify and contains detailed song-level data for every song in a playlist created by or liked by the instructor.
spotify-model.RData
contains the training set, test set, and {tidymodels} workflow for a fitted random forest model. You do not need to estimate a new machine learning model for this assignment. Instead, you will interpret the random forest model to understand how it makes predictions about the danceability of songs. Likewise, for a handful of songs you will explain the model’s predictions to understand why it predicts a particular danceability score for a given song.
The dataset includes the following variables:
Column name | Variable description |
---|---|
.id |
Unique identification number for each song in the dataset |
acousticness |
A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
album |
Name of the album from which the song originates. |
artist |
The artist who recorded the song. |
danceability |
Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
energy |
Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
explicit |
A logical value which indicates whether or not the song contains explicit lyrics |
instrumentalness |
Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
key_name |
The key the track is in |
liveness |
Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
loudness |
The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db. |
mode_name |
Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. |
playlist_name |
The (anonymized) name of the Spotify playlist where the song is included. |
speechiness |
Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
tempo |
The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
time_signature |
An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of “3/4”, to “7/4”. |
track |
Name of the song |
valence |
A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
The dataset contains several ID variables that were not used to fit the model. These include .id
, album
, artist
, and track
. Because of how {tidymodels} combines the feature engineering recipe and the model specification, these columns are passed automatically to {DALEX}. If they appear in any of your interpretations/explanations, they should be ignored since they were not actually used to fit the model.
Exercise 3
Load the data files and model, and prepare the data for interpretation. Import the training set, test set, and {tidymodels} workflow from spotify-model.RData
. Create an appropriate explainer object using {DALEX}.
Exercise 4
What are the most important features in the model? Use permutation-based feature importance to identify the most relevant features in the model. Estimate all feature importance scores using a random sample of 1000 observations from the training set, and report your results as the ratio change in the RMSE. Provide a substantive written interpretation of the results.
Exercise 5
Evaluate the marginal effect of the most important features on the predicted danceability. Use the top-5 features from the previous exercise and interpret the marginal effect of each feature on the predicted danceability using a partial dependence profile (PDP) plot. Provide a substantive written interpretation of the results.
- Include the ICE curves for each feature.
- Choose an appropriate visualization type depending on if the feature is categorical or continuous.
Review Tidy Modeling with R for guidance on how to better display the results from the partial dependence calculations.
Exercise 6
Explain the predictions for specific songs in the test set using Shapley values. Explain why the random forest model generated its predicted danceability for these songs found in the test set:
- Bananaphone by Raffi
- Dance The Night - From Barbie The Album by Dua Lipa
- The Fire Of Eternal Glory by Dimitri Shostakovich
- I’m Good (Blue) by David Guetta
- Pokemon Theme by Pokémon
Feel free to pick other songs from the test set to explain.
Use Shapley values to generate your explanations. Include a graph for each song that includes the actual prediction from the model and the average prediction for the entire test set, along with the average contributions as determined by your Shapley values. Provide a written interpretation for each song.
Generative AI (GAI) self-reflection
As stated in the syllabus, include a written reflection for this assignment of how you used GAI tools (e.g. what tools you used, how you used them to assist you with writing code), what skills you believe you acquired, and how you believe you demonstrated mastery of the learning objectives.
Render, commit, and push one last time.
Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Wrap up
Submission
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
- Click on your INFO 4940/5940 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
Grading
- Exercise 1: 10 points
- Exercise 2: 18 points
- Exercise 3: 2 points
- Exercise 4: 6 points
- Exercise 5: 6 points
- Exercise 6: 8 points
- GAI self-reflection: 0 points
- Total: 50 points