HW 03 - Label misinformation
This homework is due Monday, November 4th at 11:59pm ET.
Learning objectives
- Apply feature engineering techniques to text data
- Fit machine learning models using text features
- Evaluate the performance of machine learning models
- Consider the effect of metric choice on model evaluation
Getting started
Go to the info4940-fa24 organization on GitHub. Click on the repo with the prefix hw-03. It contains the starter documents you need to complete the homework.
Clone the repo and start a new project in RStudio.
Packages
Guidelines + tips
- Set your random seed to ensure reproducible results.
- Use caching to speed up the rendering process.
- Use parallel processing to speed up rendering time. Note that this works differently on different systems and operating systems, and it also makes it harder to debug code and track progress in model fitting. Use at your own discretion.
For the love of all that is pure in this world, please consider how to present the results of your modeling efforts. Do not just rely on raw output from R to tell us what we need to know.
- Your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.
- Maybe condense information into one or a handful of custom graphs.
- You can create simple formatted tables using {gt}
Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete this homework and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.
Workflow + formatting
Make sure to
- Update author name on your document.
- Label all code chunks informatively and concisely.
- Follow the Tidyverse code style guidelines.
- Make at least 3 commits.
- Resize figures where needed, avoid tiny or huge plots.
- Turn in an organized, well formatted document.
Detecting misinformation
Misinformation and “fake news” has become a scourge on society over the past decade.1 Social media platforms have been particularly susceptible to the spread of misinformation, and the consequences can be dire.
1 See Calling Bullshit by Carl Bergstrom and Jevin D. West for a great introduction to this problem, as well as tactics for combating the spread of bullshit in your life.
In this homework, you will take on the role of a data scientist at an amorphous and vaguely described social media company. Your team is tasked with creating a machine learning model to detect misinformation in articles shared on the platform. You have been provided with a dataset of articles that have been labeled as either real or fake. Your job is to build a model that can predict whether an article is misinformation or not given the text of the article.
The data set comes from Verma et. al. (2021). The authors collated a data set of over 72,000 news articles, with roughly 50% of the articles labeled as “Real” and the others as “Fake”.
Exercise 1
What is reality anymore? What does it mean for news articles to be “real” or “fake”? How does that pose challenges to estimating a model? Respond to the questions with 2-3 thoughtful paragraphs that identify concrete challenges to the problem at hand.
This exercise requires you to use your brain for critical thinking. Don’t use ChatGPT or generative AI to write your response.
Exercise 2
Define success. Your company plans to use the real/fake labels you generate to label misinformation on the platform and reduce/deprioritize these types of posts so they are less visible. What are the potential consequences of mislabeling an article as fake when it is real? What are the potential consequences of mislabeling an article as real when it is fake? How do you balance these concerns when evaluating a model?
Respond to the questions with 2-3 thoughtful paragraphs and select an appropriate set of metrics to evaluate your models.
Ditto to this one. Use your brain, not a computer program.
Exercise 3
Partition the data set and estimate the null model. Split the data into training/test sets, with 75% allocated for training. Further partition the training set using an appropriate resampling method. You will use this resampled set for all classical ML training and evaluation. Provide a brief explanation for why you chose this specific technique.
Predict whether or not the article is fake using a null model. Calculate the accuracy, Brier score, and ROC AUC of the model, as well as any other metrics you determined to be appropriate from exercise 2. Generate a confusion matrix of the predictions. Interpret the results.
Exercise 4
Fit a naive Bayes model. Use the article text to fit a naive Bayes model predicting whether or not the article is fake.2
2 Don’t recognize this model type? Look back at the preparation materials from last week.
At minimum, create an appropriate feature engineering recipe to tokenize the data, retain the 5000 most frequently occurring tokens, and calculate tf-idf scores. But you are encouraged (and will likely be rewarded) for going beyond the minimum and using additional pre-processing steps based on your understanding of the data.
Calculate the accuracy, Brier score, and ROC AUC of the model, as well as any other metrics you determined to be appropriate from exercise 2. Generate a confusion matrix of the predictions.
How does this model perform? Which outcome is it more likely to predict incorrectly?
Exercise 5
Fit a lasso regression model. Estimate a lasso logistic regression model to predict whether or not the article is fake.
At minimum, create an appropriate feature engineering recipe to tokenize the data, retain the 5000 most frequently occurring tokens, and calculate tf-idf scores. But you are encouraged (and will likely be rewarded) for going beyond the minimum and using additional pre-processing steps based on your understanding of the data.
For this model, no! Lasso regression requires all features to be scaled consistently. Because of how they are constructed, tf-idf scores are directly comparable across tokens and do not need to be rescaled again. If you include features other than tf-idf scores, then you should still normalize those features.3
Note this allows us to leverage the sparse encoding blueprint for baking the recipe, improving the efficiency of the model fitting process.
3 Check out The effects of feature scaling for more thorough coverage of this topic.
Tune the model over the penalty
hyperparameter using a regular grid of at least 30 values.
Calculate the accuracy, Brier score, and ROC AUC of the model, as well as any other metrics you determined to be appropriate from exercise 2. Generate a confusion matrix of the predictions from the best parameter values. Identify the most influential tokens in the model using a variable importance plot.
How does this model perform? Which outcome is it more likely to predict incorrectly? Which tokens are most predictive of real/fake news? Do they make sense to you?
Exercise 6
Develop a better feature engineering strategy. Tune a new lasso regression model and improve on the previous feature engineering approach. This may include any or all of the following:
- Remove common stop words
- Remove domain-specific stop words
- Stem the tokens
- Incorporate \(n\)-grams
- Generate additional text features using
step_textfeature()
. This creates a series of numeric features based on the original character strings. Remember that any features that are not tf-idf scores will need to be rescaled.
But you are encouraged (and will likely be rewarded) for going beyond the minimum and using additional pre-processing steps based on your understanding of the data.
Calculate the accuracy, Brier score, and ROC AUC of the model, as well as any other metrics you determined to be appropriate from exercise 2. Generate a confusion matrix of the predictions from the best parameter values. Identify the most influential tokens in the model using a variable importance plot.
How does this model perform? Which outcome is it more likely to predict incorrectly? Which tokens are most predictive of real/fake news? Do they make sense to you?
Exercise 7
Fit a new model of your own choosing. Fit a new type of model to predict whether or not the article is fake. This could be a tree-based approach, gradient boosting machine, simple neural network, or deep learning model. You are responsible for implementing appropriate feature engineering and/or hyperparameter tuning for the model.
- Use an appropriate partitioning strategy. You should split the training set into training and validation sets, rather than using a resampling strategy.
- Carefully consider the architecture of the model. I highly recommend you spend some time with chapter 11 in Deep Learning with R, especially 11.3.
Document why you chose this feature engineering/model strategy. Calculate the accuracy, Brier score, and ROC AUC of the model, as well as any other metrics you determined to be appropriate from exercise 2. Generate a confusion matrix of the predictions from the best parameter values. Identify the most influential tokens in the model using a variable importance plot (if you choose a compatible model).
How does this model perform? Which outcome is it more likely to predict incorrectly? Which tokens are most predictive of real/fake news? Do they make sense to you?
Exercise 8
Finalize a predictive model. Choose a model from a previous exercise (or try something else) and finalize it. Report the final accuracy, Brier score, and ROC AUC of the model, and generate a calibration plot for the best performing model. Interpret the performance of your model in the context of the original objective: will this model be appropriate to use in a production environment?
Generative AI (GAI) self-reflection
As stated in the syllabus, include a written reflection for this assignment of how you used GAI tools (e.g. what tools you used, how you used them to assist you with writing code), what skills you believe you acquired, and how you believe you demonstrated mastery of the learning objectives.
Render, commit, and push one last time.
Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Wrap up
Submission
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
- Click on your INFO 4940/5940 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
Grading
- Exercise 1: 5 points
- Exercise 2: 5 points
- Exercise 3: 2 points
- Exercise 4: 5 points
- Exercise 5: 8 points
- Exercise 6: 10 points
- Exercise 7: 10 points
- Exercise 8: 5 points
- GAI self-reflection: 0 points
- Total: 50 points