AE 03: Predicting forestation (II)

Setup

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──

✔ broom        1.0.6      ✔ recipes      1.0.10
✔ dials        1.3.0      ✔ rsample      1.2.1 
✔ dplyr        1.1.4      ✔ tibble       3.2.1 
✔ ggplot2      3.5.1      ✔ tidyr        1.3.1 
✔ infer        1.0.7      ✔ tune         1.2.1 
✔ modeldata    1.4.0      ✔ workflows    1.1.4 
✔ parsnip      1.2.1      ✔ workflowsets 1.1.0 
✔ purrr        1.0.2      ✔ yardstick    1.3.1

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::%||%()    masks base::%||%()
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/

library(forested)

set.seed(123)
forested_split <- initial_split(forested, prop = 0.8)
forested_train <- training(forested_split)
forested_test <- testing(forested_split)

# decrease cost_complexity from its default 0.01 to make a more
# complex and performant tree. see `?decision_tree()` to learn more.
tree_spec <- decision_tree(cost_complexity = 0.0001, mode = "classification")
forested_wflow <- workflow(forested ~ ., tree_spec)
forested_fit <- fit(forested_wflow, forested_train)

Metrics for model performance

Metric sets are a way to combine multiple similar metric functions together into a new function.

forested_metrics <- metric_set(accuracy, specificity, sensitivity)

Your turn: Apply the forested_metrics metric set to augment() output for the decision tree model grouped by tree_no_tree.

Do any metrics differ substantially between groups?

augment(forested_fit, new_data = forested_train) |>
  group_by(tree_no_tree) |>
  forested_metrics(truth = forested, estimate = .pred_class)

# A tibble: 6 × 4
  tree_no_tree .metric     .estimator .estimate
  <fct>        <chr>       <chr>          <dbl>
1 Tree         accuracy    binary         0.946
2 No tree      accuracy    binary         0.941
3 Tree         specificity binary         0.582
4 No tree      specificity binary         0.974
5 Tree         sensitivity binary         0.984
6 No tree      sensitivity binary         0.762

Add response here. Specificity and sensitivity differ substantially between groups. The model does much better at predicting forested land when there are trees present. The model does much better at predicting non-forested land when trees are not present.

Your turn: Compute and plot an ROC curve for the decision tree model.

What data are being used for this ROC curve plot?

# Your code here!
augment(forested_fit, new_data = forested_train) |>
  roc_curve(truth = forested, .pred_Yes) |>
  autoplot()

Add response here. This is constructed using the training set.

Dangers of overfitting

Your turn: Use augment() and a metric function to compute a classification metric like brier_class().

Compute the metrics for both training and testing data to demonstrate overfitting!

Notice the evidence of overfitting!

augment(forested_fit, new_data = forested_train) |>
  brier_class(truth = forested, .pred_Yes)

# A tibble: 1 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 brier_class binary        0.0469

augment(forested_fit, new_data = forested_test) |>
  brier_class(truth = forested, .pred_Yes)

# A tibble: 1 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 brier_class binary        0.0888

Resampling and cross-validation

Your turn: If we use 10 folds, what percent of the training data:

ends up in analysis
ends up in assessment

for each fold?

Add response here. 90% of observations are used for analysis, 10% of observations are used for assessment.

We’ll use this setup:

set.seed(123)
forested_folds <- vfold_cv(forested_train, v = 10)
forested_folds

#  10-fold cross-validation 
# A tibble: 10 × 2
   splits             id    
   <list>             <chr> 
 1 <split [5116/569]> Fold01
 2 <split [5116/569]> Fold02
 3 <split [5116/569]> Fold03
 4 <split [5116/569]> Fold04
 5 <split [5116/569]> Fold05
 6 <split [5117/568]> Fold06
 7 <split [5117/568]> Fold07
 8 <split [5117/568]> Fold08
 9 <split [5117/568]> Fold09
10 <split [5117/568]> Fold10

Create a random forest model

rf_spec <- rand_forest(trees = 1000, mode = "classification")
rf_spec

Random Forest Model Specification (classification)

Main Arguments:
  trees = 1000

Computational engine: ranger

rf_wflow <- workflow(forested ~ ., rf_spec)
rf_wflow

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
forested ~ .

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Main Arguments:
  trees = 1000

Computational engine: ranger

Your turn: Use fit_resamples() and rf_wflow to:

Keep predictions
Compute metrics

ctrl_forested <- control_resamples(save_pred = TRUE)

# Random forest uses random numbers so set the seed first

set.seed(234)
rf_res <- fit_resamples(rf_wflow, forested_folds, control = ctrl_forested)
collect_metrics(rf_res)

# A tibble: 3 × 6
  .metric     .estimator   mean     n std_err .config             
  <chr>       <chr>       <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.918     10 0.00563 Preprocessor1_Model1
2 brier_class binary     0.0616    10 0.00338 Preprocessor1_Model1
3 roc_auc     binary     0.972     10 0.00308 Preprocessor1_Model1

Tuning models

Your turn: Modify your model workflow to tune one or more parameters.

Use grid search to find the best parameter(s).

rf_spec <- rand_forest(min_n = tune()) |> 
  set_mode("classification")

rf_wflow <- workflow(forested ~ ., rf_spec)
rf_wflow

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
forested ~ .

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Main Arguments:
  min_n = tune()

Computational engine: ranger

set.seed(22)
rf_res <- tune_grid(
  rf_wflow,
  resamples = forested_folds,
  grid = 5
)

show_best(rf_res)

Warning in show_best(rf_res): No value of `metric` was given; "roc_auc" will be
used.

# A tibble: 5 × 7
  min_n .metric .estimator  mean     n std_err .config             
  <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1    21 roc_auc binary     0.972    10 0.00295 Preprocessor1_Model4
2     6 roc_auc binary     0.972    10 0.00303 Preprocessor1_Model2
3    31 roc_auc binary     0.972    10 0.00317 Preprocessor1_Model3
4    13 roc_auc binary     0.972    10 0.00311 Preprocessor1_Model5
5    33 roc_auc binary     0.972    10 0.00322 Preprocessor1_Model1

best_parameter <- select_best(rf_res)

Warning in select_best(rf_res): No value of `metric` was given; "roc_auc" will
be used.

best_parameter

# A tibble: 1 × 2
  min_n .config             
  <int> <chr>               
1    21 Preprocessor1_Model4

Acknowledgments

Materials derived in part from Machine learning with {tidymodels} and licensed under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA) License.

Session information

sessioninfo::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.1 (2024-06-14)
 os       macOS Sonoma 14.6.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2024-09-11
 pandoc   3.3 @ /usr/local/bin/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package      * version    date (UTC) lib source
 backports      1.5.0      2024-05-23 [1] CRAN (R 4.4.0)
 broom        * 1.0.6      2024-05-17 [1] CRAN (R 4.4.0)
 class          7.3-22     2023-05-03 [1] CRAN (R 4.4.0)
 cli            3.6.3      2024-06-21 [1] CRAN (R 4.4.0)
 codetools      0.2-20     2024-03-31 [1] CRAN (R 4.4.1)
 colorspace     2.1-1      2024-07-26 [1] CRAN (R 4.4.0)
 data.table     1.15.4     2024-03-30 [1] CRAN (R 4.3.1)
 dials        * 1.3.0      2024-07-30 [1] CRAN (R 4.4.0)
 DiceDesign     1.10       2023-12-07 [1] CRAN (R 4.3.1)
 digest         0.6.35     2024-03-11 [1] CRAN (R 4.3.1)
 dplyr        * 1.1.4      2023-11-17 [1] CRAN (R 4.3.1)
 evaluate       0.24.0     2024-06-10 [1] CRAN (R 4.4.0)
 fansi          1.0.6      2023-12-08 [1] CRAN (R 4.3.1)
 farver         2.1.2      2024-05-13 [1] CRAN (R 4.3.3)
 fastmap        1.2.0      2024-05-15 [1] CRAN (R 4.4.0)
 foreach        1.5.2      2022-02-02 [1] CRAN (R 4.3.0)
 forested     * 0.1.0      2024-07-31 [1] CRAN (R 4.4.0)
 furrr          0.3.1      2022-08-15 [1] CRAN (R 4.3.0)
 future         1.33.2     2024-03-26 [1] CRAN (R 4.3.1)
 future.apply   1.11.2     2024-03-28 [1] CRAN (R 4.3.1)
 generics       0.1.3      2022-07-05 [1] CRAN (R 4.3.0)
 ggplot2      * 3.5.1      2024-04-23 [1] CRAN (R 4.3.1)
 globals        0.16.3     2024-03-08 [1] CRAN (R 4.3.1)
 glue           1.7.0      2024-01-09 [1] CRAN (R 4.3.1)
 gower          1.0.1      2022-12-22 [1] CRAN (R 4.3.0)
 GPfit          1.0-8      2019-02-08 [1] CRAN (R 4.3.0)
 gtable         0.3.5      2024-04-22 [1] CRAN (R 4.3.1)
 hardhat        1.4.0      2024-06-02 [1] CRAN (R 4.4.0)
 here           1.0.1      2020-12-13 [1] CRAN (R 4.3.0)
 htmltools      0.5.8.1    2024-04-04 [1] CRAN (R 4.3.1)
 htmlwidgets    1.6.4      2023-12-06 [1] CRAN (R 4.3.1)
 infer        * 1.0.7      2024-03-25 [1] CRAN (R 4.3.1)
 ipred          0.9-14     2023-03-09 [1] CRAN (R 4.3.0)
 iterators      1.0.14     2022-02-05 [1] CRAN (R 4.3.0)
 jsonlite       1.8.8      2023-12-04 [1] CRAN (R 4.3.1)
 knitr          1.47       2024-05-29 [1] CRAN (R 4.4.0)
 labeling       0.4.3      2023-08-29 [1] CRAN (R 4.3.0)
 lattice        0.22-6     2024-03-20 [1] CRAN (R 4.4.0)
 lava           1.8.0      2024-03-05 [1] CRAN (R 4.3.1)
 lhs            1.1.6      2022-12-17 [1] CRAN (R 4.3.0)
 lifecycle      1.0.4      2023-11-07 [1] CRAN (R 4.3.1)
 listenv        0.9.1      2024-01-29 [1] CRAN (R 4.3.1)
 lubridate      1.9.3      2023-09-27 [1] CRAN (R 4.3.1)
 magrittr       2.0.3      2022-03-30 [1] CRAN (R 4.3.0)
 MASS           7.3-61     2024-06-13 [1] CRAN (R 4.4.0)
 Matrix         1.7-0      2024-03-22 [1] CRAN (R 4.4.0)
 modeldata    * 1.4.0      2024-06-19 [1] CRAN (R 4.4.0)
 modelenv       0.1.1      2023-03-08 [1] CRAN (R 4.3.0)
 munsell        0.5.1      2024-04-01 [1] CRAN (R 4.3.1)
 nnet           7.3-19     2023-05-03 [1] CRAN (R 4.4.0)
 parallelly     1.37.1     2024-02-29 [1] CRAN (R 4.3.1)
 parsnip      * 1.2.1      2024-03-22 [1] CRAN (R 4.3.1)
 pillar         1.9.0      2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.3.0)
 prodlim        2023.08.28 2023-08-28 [1] CRAN (R 4.3.0)
 purrr        * 1.0.2      2023-08-10 [1] CRAN (R 4.3.0)
 R6             2.5.1      2021-08-19 [1] CRAN (R 4.3.0)
 ranger       * 0.16.0     2023-11-12 [1] CRAN (R 4.3.1)
 Rcpp           1.0.12     2024-01-09 [1] CRAN (R 4.3.1)
 recipes      * 1.0.10     2024-02-18 [1] CRAN (R 4.3.1)
 rlang          1.1.4      2024-06-04 [1] CRAN (R 4.3.3)
 rmarkdown      2.27       2024-05-17 [1] CRAN (R 4.4.0)
 rpart          4.1.23     2023-12-05 [1] CRAN (R 4.4.0)
 rprojroot      2.0.4      2023-11-05 [1] CRAN (R 4.3.1)
 rsample      * 1.2.1      2024-03-25 [1] CRAN (R 4.3.1)
 rstudioapi     0.16.0     2024-03-24 [1] CRAN (R 4.3.1)
 scales       * 1.3.0.9000 2024-05-07 [1] Github (r-lib/scales@c0f79d3)
 sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.3.0)
 survival       3.7-0      2024-06-05 [1] CRAN (R 4.4.0)
 tibble       * 3.2.1      2023-03-20 [1] CRAN (R 4.3.0)
 tidymodels   * 1.2.0      2024-03-25 [1] CRAN (R 4.3.1)
 tidyr        * 1.3.1      2024-01-24 [1] CRAN (R 4.3.1)
 tidyselect     1.2.1      2024-03-11 [1] CRAN (R 4.3.1)
 timechange     0.3.0      2024-01-18 [1] CRAN (R 4.3.1)
 timeDate       4032.109   2023-12-14 [1] CRAN (R 4.3.1)
 tune         * 1.2.1      2024-04-18 [1] CRAN (R 4.3.1)
 utf8           1.2.4      2023-10-22 [1] CRAN (R 4.3.1)
 vctrs          0.6.5      2023-12-01 [1] CRAN (R 4.3.1)
 withr          3.0.1      2024-07-31 [1] CRAN (R 4.4.0)
 workflows    * 1.1.4      2024-02-19 [1] CRAN (R 4.3.1)
 workflowsets * 1.1.0      2024-03-21 [1] CRAN (R 4.3.1)
 xfun           0.45       2024-06-16 [1] CRAN (R 4.4.0)
 yaml           2.3.8      2023-12-11 [1] CRAN (R 4.3.1)
 yardstick    * 1.3.1      2024-03-21 [1] CRAN (R 4.3.1)

 [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────