AE 03: Predicting forestation (II)

Application exercise
Modified

September 5, 2024

Important

Go to the course GitHub organization and locate the repo titled ae-03-YOUR_GITHUB_USERNAME to get started.

Setup

library(tidymodels)
library(forested)

set.seed(123)
forested_split <- initial_split(forested, prop = 0.8)
forested_train <- training(forested_split)
forested_test <- testing(forested_split)

# decrease cost_complexity from its default 0.01 to make a more
# complex and performant tree. see `?decision_tree()` to learn more.
tree_spec <- decision_tree(cost_complexity = 0.0001, mode = "classification")
forested_wflow <- workflow(forested ~ ., tree_spec)
forested_fit <- fit(forested_wflow, forested_train)

Metrics for model performance

Metric sets are a way to combine multiple similar metric functions together into a new function.

forested_metrics <- metric_set(accuracy, specificity, sensitivity)

Your turn: Apply the forested_metrics metric set to augment() output for the decision tree model grouped by tree_no_tree.

Do any metrics differ substantially between groups?

augment(forested_fit, new_data = ______) |>
  group_by(______) |>
  ______(truth = forested, estimate = .pred_class)

Your turn: Compute and plot an ROC curve for the decision tree model.

What data are being used for this ROC curve plot?

# Your code here!
augment(forested_fit, new_data = ______) |>
  ______(truth = forested, .pred_Yes) |>
  autoplot()

Add response here.

Dangers of overfitting

Your turn: Use augment() and a metric function to compute a classification metric like brier_class().

Compute the metrics for both training and testing data to demonstrate overfitting!

Notice the evidence of overfitting!

augment(forested_fit, new_data = forested_train) |>
  TODO

augment(forested_fit, new_data = forested_test) |>
  TODO

Resampling and cross-validation

Your turn: If we use 10 folds, what percent of the training data:

  • ends up in analysis
  • ends up in assessment

for each fold?

Add response here.

We’ll use this setup:

set.seed(123)
forested_folds <- vfold_cv(forested_train, v = 10)
forested_folds

Create a random forest model

rf_spec <- rand_forest(trees = 1000, mode = "classification")
rf_spec
rf_wflow <- workflow(forested ~ ., rf_spec)
rf_wflow

Your turn: Use fit_resamples() and rf_wflow to:

  • Keep predictions
  • Compute metrics
# control option to keep predictions
ctrl_forested <- control_resamples(TODO)

# Random forest uses random numbers so set the seed first
set.seed(234)
rf_res <- fit_resamples(TODO)
collect_metrics(rf_res)

Tuning models

Your turn: Modify your model workflow to tune one or more parameters.

Use grid search to find the best parameter(s).

rf_spec <- rand_forest(min_n = ______) |> 
  set_mode("classification")

rf_wflow <- workflow(forested ~ ., rf_spec)
rf_wflow

set.seed(22)
rf_res <- ______(
  rf_wflow,
  resamples = forested_folds,
  grid = 5
)

show_best(rf_res)

best_parameter <- select_best(rf_res)
best_parameter

Acknowledgments