library(tidymodels)
library(forested)
set.seed(123)
forested_split <- initial_split(forested, prop = 0.8)
forested_train <- training(forested_split)
forested_test <- testing(forested_split)
# decrease cost_complexity from its default 0.01 to make a more
# complex and performant tree. see `?decision_tree()` to learn more.
tree_spec <- decision_tree(cost_complexity = 0.0001, mode = "classification")
forested_wflow <- workflow(forested ~ ., tree_spec)
forested_fit <- fit(forested_wflow, forested_train)
AE 03: Predicting forestation (II)
Go to the course GitHub organization and locate the repo titled ae-03-YOUR_GITHUB_USERNAME
to get started.
Setup
Metrics for model performance
Metric sets are a way to combine multiple similar metric functions together into a new function.
forested_metrics <- metric_set(accuracy, specificity, sensitivity)
Your turn: Apply the forested_metrics
metric set to augment()
output for the decision tree model grouped by tree_no_tree
.
Do any metrics differ substantially between groups?
augment(forested_fit, new_data = ______) |>
group_by(______) |>
______(truth = forested, estimate = .pred_class)
Your turn: Compute and plot an ROC curve for the decision tree model.
What data are being used for this ROC curve plot?
# Your code here!
augment(forested_fit, new_data = ______) |>
______(truth = forested, .pred_Yes) |>
autoplot()
Add response here.
Dangers of overfitting
Your turn: Use augment()
and a metric function to compute a classification metric like brier_class()
.
Compute the metrics for both training and testing data to demonstrate overfitting!
Notice the evidence of overfitting!
augment(forested_fit, new_data = forested_train) |>
TODO
augment(forested_fit, new_data = forested_test) |>
TODO
Resampling and cross-validation
Your turn: If we use 10 folds, what percent of the training data:
- ends up in analysis
- ends up in assessment
for each fold?
Add response here.
We’ll use this setup:
set.seed(123)
forested_folds <- vfold_cv(forested_train, v = 10)
forested_folds
Create a random forest model
rf_spec <- rand_forest(trees = 1000, mode = "classification")
rf_spec
rf_wflow <- workflow(forested ~ ., rf_spec)
rf_wflow
Your turn: Use fit_resamples()
and rf_wflow
to:
- Keep predictions
- Compute metrics
# control option to keep predictions
ctrl_forested <- control_resamples(TODO)
# Random forest uses random numbers so set the seed first
set.seed(234)
rf_res <- fit_resamples(TODO)
collect_metrics(rf_res)
Tuning models
Your turn: Modify your model workflow to tune one or more parameters.
Use grid search to find the best parameter(s).
<- rand_forest(min_n = ______) |>
rf_spec set_mode("classification")
<- workflow(forested ~ ., rf_spec)
rf_wflow
rf_wflow
set.seed(22)
<- ______(
rf_res
rf_wflow,resamples = forested_folds,
grid = 5
)
show_best(rf_res)
<- select_best(rf_res)
best_parameter best_parameter
Acknowledgments
- Materials derived in part from Machine learning with {tidymodels} and licensed under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA) License.