AE 11: Slay - Predicting song artist based on lyrics
Import data
lyrics <- read_csv(file = "data/beyonce-swift-lyrics.csv") |>
mutate(artist = factor(artist))
lyrics
# A tibble: 355 × 19
album_name track_number track_name artist lyrics danceability energy loudness
<chr> <dbl> <chr> <fct> <chr> <dbl> <dbl> <dbl>
1 COWBOY CA… 1 AMERIICAN… Beyon… "Noth… 0.374 0.515 -6.48
2 COWBOY CA… 3 16 CARRIA… Beyon… "Sixt… 0.525 0.456 -7.04
3 COWBOY CA… 5 MY ROSE Beyon… "How … 0.384 0.177 -11.8
4 COWBOY CA… 7 TEXAS HOL… Beyon… "This… 0.727 0.711 -6.55
5 COWBOY CA… 8 BODYGUARD Beyon… "One,… 0.726 0.779 -5.43
6 COWBOY CA… 10 JOLENE Beyon… "(Jol… 0.567 0.812 -5.43
7 COWBOY CA… 11 DAUGHTER Beyon… "Your… 0.374 0.448 -10.0
8 COWBOY CA… 13 ALLIIGATO… Beyon… "High… 0.618 0.651 -9.66
9 COWBOY CA… 18 FLAMENCO Beyon… "My m… 0.497 0.351 -9.25
10 COWBOY CA… 20 YA YA Beyon… "Hell… 0.617 0.904 -5.37
# ℹ 345 more rows
# ℹ 11 more variables: speechiness <dbl>, acousticness <dbl>,
# instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
# time_signature <dbl>, duration_ms <dbl>, explicit <lgl>, key_name <chr>,
# mode_name <chr>
Split the data into analysis/assessment/test sets
Demonstration:
- Split the data into training/test sets with 75% allocated for training
- Split the training set into 10 cross-validation folds
# split into training/testing
set.seed(123)
lyrics_split <- initial_split(data = lyrics, strata = artist, prop = 0.75)
lyrics_train <- training(lyrics_split)
lyrics_test <- testing(lyrics_split)
# create cross-validation folds
lyrics_folds <- vfold_cv(data = lyrics_train, strata = artist)
Regularized regression
Define the feature engineering recipe
Demonstration:
- Define a feature engineering recipe to predict the song’s artist as a function of the lyrics + audio features
- Exclude the ID variables from the recipe
- Tokenize the song lyrics
- Calculate all possible 1-grams, 2-grams, 3-grams, 4-grams, and 5-grams
- Remove stop words
- Only keep the 2000 most frequently appearing tokens
- Calculate tf-idf scores for the remaining tokens
-
This will generate one column for every token. Each column will have the standardized name
tfidf_lyrics_*
where*
is the specific token. Instead we would prefer the column names simply be*
. You can remove thetfidf_lyrics_
prefix using# Simplify these names step_rename_at(starts_with("tfidf_lyrics_"), fn = \(x) str_replace_all( string = x, pattern = "tfidf_lyrics_", replacement = "" ) )
-
This does cause a conflict between the
energy
audio feature and the tokenenergy
. Before removing the"tfidf_lyrics_"
prefix, we will add a prefix to the audio features to avoid this conflict.# Simplify these names step_rename_at( all_predictors(), -starts_with("tfidf_lyrics_"), fn = \(x) str_glue("af_{x}") )
-
- Apply required steps for regularized regression models
- Convert the
explicit
variable to a factor - Convert nominal predictors to dummy variables
- Get rid of zero-variance predictors
- Normalize all predictors to mean of 0 and variance of 1
- Convert the
- Downsample the observations so there are an equal number of songs by Beyoncé and Taylor Swift in the analysis set
glmnet_rec <- recipe(artist ~ ., data = lyrics_train) |>
# exclude ID variables
update_role(album_name, track_number, track_name, new_role = "id vars") |>
# tokenize and prep lyrics
step_tokenize(lyrics) |>
step_stopwords(lyrics) |>
step_ngram(lyrics, num_tokens = 5L, min_num_tokens = 1L) |>
step_tokenfilter(lyrics, max_tokens = 4000) |>
step_tfidf(lyrics) |>
# Simplify these names
step_rename_at(
all_predictors(), -starts_with("tfidf_lyrics_"),
fn = \(x) str_glue("af_{x}")
) |>
step_rename_at(starts_with("tfidf_lyrics_"),
fn = \(x) str_replace_all(
string = x,
pattern = "tfidf_lyrics_",
replacement = ""
)
) |>
# fix explicit variable to factor
step_bin2factor(af_explicit) |>
# normalize for regularized regression
step_dummy(all_nominal_predictors()) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors()) |>
step_downsample(artist)
glmnet_rec
What does this produce?
# A tibble: 176 × 4,028
album_name track_number track_name af_danceability af_energy af_loudness
<fct> <dbl> <fct> <dbl> <dbl> <dbl>
1 COWBOY CARTER 7 TEXAS HOLD … 1.10 0.782 0.478
2 COWBOY CARTER 8 BODYGUARD 1.10 1.16 0.920
3 COWBOY CARTER 10 JOLENE -0.113 1.35 0.919
4 COWBOY CARTER 11 DAUGHTER -1.58 -0.689 -0.903
5 COWBOY CARTER 13 ALLIIGATOR … 0.275 0.446 -0.753
6 COWBOY CARTER 18 FLAMENCO -0.646 -1.23 -0.589
7 COWBOY CARTER 22 DESERT EAGLE 0.831 -0.459 -0.278
8 COWBOY CARTER 23 RIIVERDANCE 0.892 1.23 -0.304
9 COWBOY CARTER 24 II HANDS II… 0.701 0.172 -0.0427
10 COWBOY CARTER 27 AMEN -1.89 -0.666 0.266
# ℹ 166 more rows
# ℹ 4,022 more variables: af_speechiness <dbl>, af_acousticness <dbl>,
# af_instrumentalness <dbl>, af_liveness <dbl>, af_valence <dbl>,
# af_tempo <dbl>, af_time_signature <dbl>, af_duration_ms <dbl>,
# artist <fct>, `1` <dbl>, `2` <dbl>, `3` <dbl>, `4` <dbl>, across <dbl>,
# act <dbl>, actin <dbl>, acting <dbl>, actress <dbl>, adore <dbl>,
# adore_adore <dbl>, adore_adore_adore <dbl>, affair <dbl>, afraid <dbl>, …
Tune the regularized regression model
Demonstration:
- Define the regularized regression model specification, including tuning placeholders for
penalty
- Create the workflow object
- Define a tuning grid over
penalty
- Tune the model using the cross-validation folds
- Evaluate the tuning procedure and identify the best performing models based on ROC AUC
# define the regularized regression model specification
glmnet_spec <- logistic_reg(penalty = tune(), mixture = 1) |>
set_mode("classification") |>
set_engine("glmnet")
# define the new workflow
glmnet_wf <- workflow() |>
add_recipe(glmnet_rec) |>
add_model(glmnet_spec)
# create the tuning grid
glmnet_grid <- grid_regular(
penalty(),
levels = 30
)
# tune over the model hyperparameters
glmnet_tune <- tune_grid(
object = glmnet_wf,
resamples = lyrics_folds,
grid = glmnet_grid,
control = control_grid(save_workflow = TRUE)
)
# evaluate results
autoplot(glmnet_tune)
# identify the five best hyperparameter combinations
show_best(x = glmnet_tune, metric = "roc_auc")
# A tibble: 5 × 7
penalty .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 1.89e- 2 roc_auc binary 0.850 10 0.0260 Preprocessor1_Model25
2 3.86e- 3 roc_auc binary 0.846 10 0.0260 Preprocessor1_Model23
3 4.18e- 2 roc_auc binary 0.846 10 0.0264 Preprocessor1_Model26
4 1 e-10 roc_auc binary 0.846 10 0.0262 Preprocessor1_Model01
5 2.21e-10 roc_auc binary 0.846 10 0.0262 Preprocessor1_Model02
Variable importance
Your turn: Identify the most relevant features for predicting whether a song is by Beyoncé or Taylor Swift using the regularized regression model. How many features are used in the final model?
# extract parsnip model fit
glmnet_imp <- glmnet_tune |>
fit_best() |>
extract_fit_parsnip() |>
vi(method = "model", lambda = select_best(x = glmnet_tune, metric = "roc_auc")$penalty)
# top features
vip(glmnet_imp)
# features with non-zero coefficients
glmnet_imp |>
# importance must be greater than 0
filter(Importance > 0)
# A tibble: 108 × 3
Variable Importance Sign
<chr> <dbl> <chr>
1 af_speechiness 1.46 NEG
2 problems 0.806 NEG
3 love 0.623 NEG
4 tear 0.580 NEG
5 missed 0.574 NEG
6 af_key_name_C. 0.506 NEG
7 one 0.350 POS
8 just 0.338 POS
9 af_mode_name_minor 0.299 NEG
10 wanna_say 0.252 NEG
# ℹ 98 more rows
Add response here.
Your turn: Interpret the most relevant features for predicting whether a song is by Beyoncé or Taylor Swift. As a gut check, do these results make sense to you based on your knowledge of these artists?
# clean up the data frame for visualization
glmnet_imp |>
# because Taylor Swift is labeled the "success" outcome,
# positive coefficients indicate higher probability of being a Taylor Swift song
mutate(Sign = if_else(Sign == "POS", "Taylor Swift", "Beyoncé")) |>
# importance must be greater than 0
filter(Importance > 0) |>
# keep top 30 features for each artist
slice_max(n = 30, order_by = Importance, by = Sign) |>
mutate(Variable = fct_reorder(.f = Variable, .x = Importance)) |>
ggplot(mapping = aes(
x = Importance,
y = Variable,
fill = Sign
)) +
geom_col(show.legend = FALSE) +
scale_fill_brewer(type = "qual") +
facet_wrap(facets = vars(Sign), scales = "free_y") +
labs(
y = NULL,
title = "Most relevant features for predicting whether a song\nis by Beyoncé or Taylor Swift",
subtitle = "Regularized regression model"
)
Add response here.
Fit a random forest model
Define the feature engineering recipe
Demonstration:
- Define a feature engineering recipe to predict the song’s artist as a function of the lyrics + audio features
- Exclude the ID variables from the recipe
- Tokenize the song lyrics
- Remove stop words
- Only keep the 500 most frequently appearing tokens
- Calculate tf-idf scores for the remaining tokens
- Rename audio feature and tf-idf as before
- Downsample the observations so there are an equal number of songs by Beyoncé and Taylor Swift in the analysis set
# define preprocessing recipe
rf_rec <- recipe(artist ~ ., data = lyrics_train) |>
# exclude ID variables
update_role(album_name, track_number, track_name, new_role = "id vars") |>
step_tokenize(lyrics) |>
step_stopwords(lyrics) |>
step_tokenfilter(lyrics, max_tokens = 500) |>
step_tfidf(lyrics) |>
# Simplify these names
step_rename_at(
all_predictors(), -starts_with("tfidf_lyrics_"),
fn = \(x) str_glue("af_{x}")
) |>
step_rename_at(starts_with("tfidf_lyrics_"),
fn = \(x) str_replace_all(
string = x,
pattern = "tfidf_lyrics_",
replacement = ""
)
) |>
step_downsample(artist)
rf_rec
What does this produce?
# A tibble: 176 × 518
album_name track_number track_name af_danceability af_energy af_loudness
<fct> <dbl> <fct> <dbl> <dbl> <dbl>
1 COWBOY CARTER 7 TEXAS HOLD … 0.727 0.711 -6.55
2 COWBOY CARTER 8 BODYGUARD 0.726 0.779 -5.43
3 COWBOY CARTER 10 JOLENE 0.567 0.812 -5.43
4 COWBOY CARTER 11 DAUGHTER 0.374 0.448 -10.0
5 COWBOY CARTER 13 ALLIIGATOR … 0.618 0.651 -9.66
6 COWBOY CARTER 18 FLAMENCO 0.497 0.351 -9.25
7 COWBOY CARTER 22 DESERT EAGLE 0.691 0.489 -8.46
8 COWBOY CARTER 23 RIIVERDANCE 0.699 0.791 -8.53
9 COWBOY CARTER 24 II HANDS II… 0.674 0.602 -7.87
10 COWBOY CARTER 27 AMEN 0.333 0.452 -7.09
# ℹ 166 more rows
# ℹ 512 more variables: af_speechiness <dbl>, af_acousticness <dbl>,
# af_instrumentalness <dbl>, af_liveness <dbl>, af_valence <dbl>,
# af_tempo <dbl>, af_time_signature <dbl>, af_duration_ms <dbl>,
# af_explicit <lgl>, af_key_name <fct>, af_mode_name <fct>, artist <fct>,
# adore <dbl>, afraid <dbl>, ah <dbl>, `ain't` <dbl>, air <dbl>, alive <dbl>,
# almost <dbl>, alone <dbl>, along <dbl>, alright <dbl>, always <dbl>, …
Fit the model
Demonstration:
- Define a random forest model grown with 1000 trees using the
ranger
engine. - Define a workflow using the feature engineering recipe and random forest model specification. Fit the workflow using the entire training set (no cross-validation necessary).
# define the model specification
rf_spec <- rand_forest(trees = 1000) |>
set_mode("classification") |>
# calculate feature importance metrics using the ranger engine
set_engine("ranger", importance = "permutation")
# define the workflow
rf_wf <- workflow() |>
add_recipe(rf_rec) |>
add_model(rf_spec)
# fit the model to the training set
rf_fit <- rf_wf |>
fit(data = lyrics_train)
Feature importance
Your turn: Interpret the most relevant features for predicting whether a song is by Beyoncé or Taylor Swift using the random forest approach. As a gut check, do these results make sense to you based on your knowledge of these artists? How do they compare to the results from regularized regression?
# extract parsnip model fit
rf_imp <- rf_fit |>
extract_fit_parsnip() |>
vi(method = "model")
# naive attempt to interpret
vip(rf_imp)
# clean up the data frame for visualization
rf_imp |>
# extract 30 most important n-grams
slice_max(order_by = Importance, n = 30) |>
mutate(Variable = fct_reorder(.f = Variable, .x = Importance)) |>
ggplot(mapping = aes(
x = Importance,
y = Variable
)) +
geom_col() +
scale_x_continuous(labels = label_percent()) +
labs(
x = "Increase in OOB error rate",
y = NULL,
title = "Most relevant features for predicting whether a song\nis by Beyoncé or Taylor Swift",
subtitle = "Random forest model"
)
Add response here.