AE 02: Predicting forestation (I)

Application exercise

Modified

August 30, 2024

Important

Go to the course GitHub organization and locate the repo titled ae-02-YOUR_GITHUB_USERNAME to get started.

Data on forests in Washington

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──

✔ broom        1.0.6      ✔ recipes      1.0.10
✔ dials        1.3.0      ✔ rsample      1.2.1 
✔ dplyr        1.1.4      ✔ tibble       3.2.1 
✔ ggplot2      3.5.1      ✔ tidyr        1.3.1 
✔ infer        1.0.7      ✔ tune         1.2.1 
✔ modeldata    1.4.0      ✔ workflows    1.1.4 
✔ parsnip      1.2.1      ✔ workflowsets 1.1.0 
✔ purrr        1.0.2      ✔ yardstick    1.3.1

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::%||%()    masks base::%||%()
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/

library(forested)

forested

# A tibble: 7,107 × 19
   forested  year elevation eastness northness roughness tree_no_tree dew_temp
   <fct>    <dbl>     <dbl>    <dbl>     <dbl>     <dbl> <fct>           <dbl>
 1 Yes       2005       881       90        43        63 Tree             0.04
 2 Yes       2005       113      -25        96        30 Tree             6.4 
 3 No        2005       164      -84        53        13 Tree             6.06
 4 Yes       2005       299       93        34         6 No tree          4.43
 5 Yes       2005       806       47       -88        35 Tree             1.06
 6 Yes       2005       736      -27       -96        53 Tree             1.35
 7 Yes       2005       636      -48        87         3 No tree          1.42
 8 Yes       2005       224      -65       -75         9 Tree             6.39
 9 Yes       2005        52      -62        78        42 Tree             6.5 
10 Yes       2005      2240      -67       -74        99 No tree         -5.63
# ℹ 7,097 more rows
# ℹ 11 more variables: precip_annual <dbl>, temp_annual_mean <dbl>,
#   temp_annual_min <dbl>, temp_annual_max <dbl>, temp_january_min <dbl>,
#   vapor_min <dbl>, vapor_max <dbl>, canopy_cover <dbl>, lon <dbl>, lat <dbl>,
#   land_type <fct>

Your turn

When is a good time to split your data?

Add response here

Data splitting and spending

set.seed(123)

forested_split <- initial_split(forested)
forested_split

<Training/Testing/Total>
<5330/1777/7107>

Extract the training and testing sets

forested_train <- training(forested_split)
forested_test <- testing(forested_split)

Your turn

Split your data so 20% is held out for the test set.

Try out different values in set.seed() to see how the results change.

Hint: Which argument in initial_split() handles the proportion split into training vs testing?

# add code here

Exploratory data analysis

Explore the forested_train data on your own!

What’s the distribution of the outcome, forested?
What’s the distribution of numeric variables like precip_annual?
How does the distribution of forested differ across the categorical variables?

# add code here

Fitting models

To specify a model

# Model
linear_reg()

Linear Regression Model Specification (regression)

Computational engine: lm

# Engine
linear_reg() |>
  set_engine("glmnet")

Linear Regression Model Specification (regression)

Computational engine: glmnet

# Mode - Some models have a default mode, others don't
decision_tree() |> 
  set_mode("regression")

Decision Tree Model Specification (regression)

Computational engine: rpart

Your turn

Edit the chunk below to use a logistic regression model.

Extension/Challenge: Edit this code to use a different model. For example, try using a conditional inference tree as implemented in the partykit package by changing the engine - or try an entirely different model type!

All available models are listed at https://www.tidymodels.org/find/parsnip/

tree_spec <- decision_tree() |> 
  set_mode("classification")

tree_spec

Decision Tree Model Specification (classification)

Computational engine: rpart

A model workflow

tree_spec <- decision_tree() |> 
  set_mode("classification")

Fit parsnip specification:

tree_spec |> 
  fit(forested ~ ., data = forested_train)

parsnip model object

n= 5330 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 5330 2398 Yes (0.55009381 0.44990619)  
   2) land_type=Tree 2868  282 Yes (0.90167364 0.09832636) *
   3) land_type=Barren,Non-tree vegetation 2462  346 No (0.14053615 0.85946385)  
     6) temp_annual_max< 13.385 323  143 Yes (0.55727554 0.44272446)  
      12) tree_no_tree=Tree 84    6 Yes (0.92857143 0.07142857) *
      13) tree_no_tree=No tree 239  102 No (0.42677824 0.57322176) *
     7) temp_annual_max>=13.385 2139  166 No (0.07760636 0.92239364) *

Fit with a workflow:

workflow() |>
  add_formula(forested ~ .) |>
  add_model(tree_spec) |>
  fit(data = forested_train)

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
forested ~ .

── Model ───────────────────────────────────────────────────────────────────────
n= 5330 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 5330 2398 Yes (0.55009381 0.44990619)  
   2) land_type=Tree 2868  282 Yes (0.90167364 0.09832636) *
   3) land_type=Barren,Non-tree vegetation 2462  346 No (0.14053615 0.85946385)  
     6) temp_annual_max< 13.385 323  143 Yes (0.55727554 0.44272446)  
      12) tree_no_tree=Tree 84    6 Yes (0.92857143 0.07142857) *
      13) tree_no_tree=No tree 239  102 No (0.42677824 0.57322176) *
     7) temp_annual_max>=13.385 2139  166 No (0.07760636 0.92239364) *

“Shortcut” by specifying the preprocessor and model spec directly in the workflow() call:

workflow(forested ~ ., tree_spec) |> 
  fit(data = forested_train)

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
forested ~ .

── Model ───────────────────────────────────────────────────────────────────────
n= 5330 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 5330 2398 Yes (0.55009381 0.44990619)  
   2) land_type=Tree 2868  282 Yes (0.90167364 0.09832636) *
   3) land_type=Barren,Non-tree vegetation 2462  346 No (0.14053615 0.85946385)  
     6) temp_annual_max< 13.385 323  143 Yes (0.55727554 0.44272446)  
      12) tree_no_tree=Tree 84    6 Yes (0.92857143 0.07142857) *
      13) tree_no_tree=No tree 239  102 No (0.42677824 0.57322176) *
     7) temp_annual_max>=13.385 2139  166 No (0.07760636 0.92239364) *

Your turn

Edit the chunk below to make a workflow with your own model of choice!

Extension/Challenge: Other than formulas, what kinds of preprocessors are supported?

tree_spec <- decision_tree() |> 
  set_mode("classification")

tree_wflow <- workflow() |>
  add_formula(forested ~ .) |>
  add_model(tree_spec)

tree_wflow

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
forested ~ .

── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (classification)

Computational engine: rpart

Predict with your model

tree_fit <- workflow(forested ~ ., tree_spec) |> 
  fit(data = forested_train)

Your turn

Run:

predict(tree_fit, new_data = forested_test)

# A tibble: 1,777 × 1
   .pred_class
   <fct>      
 1 Yes        
 2 Yes        
 3 No         
 4 No         
 5 No         
 6 Yes        
 7 Yes        
 8 Yes        
 9 Yes        
10 Yes        
# ℹ 1,767 more rows

What do you notice about the structure of the result?

Add response here.

Your turn

Run:

augment(tree_fit, new_data = forested_test)

# A tibble: 1,777 × 22
   .pred_class .pred_Yes .pred_No forested  year elevation eastness northness
   <fct>           <dbl>    <dbl> <fct>    <dbl>     <dbl>    <dbl>     <dbl>
 1 Yes            0.902    0.0983 No        2005       164      -84        53
 2 Yes            0.902    0.0983 Yes       2005       299       93        34
 3 No             0.0776   0.922  Yes       2005       636      -48        87
 4 No             0.0776   0.922  Yes       2005       224      -65       -75
 5 No             0.427    0.573  Yes       2005      2240      -67       -74
 6 Yes            0.902    0.0983 Yes       2004      1044       96       -26
 7 Yes            0.902    0.0983 Yes       2003      1031      -49        86
 8 Yes            0.902    0.0983 Yes       2005      1330       99         7
 9 Yes            0.902    0.0983 Yes       2005      1423       46        88
10 Yes            0.902    0.0983 Yes       2014       546      -92       -38
# ℹ 1,767 more rows
# ℹ 14 more variables: roughness <dbl>, tree_no_tree <fct>, dew_temp <dbl>,
#   precip_annual <dbl>, temp_annual_mean <dbl>, temp_annual_min <dbl>,
#   temp_annual_max <dbl>, temp_january_min <dbl>, vapor_min <dbl>,
#   vapor_max <dbl>, canopy_cover <dbl>, lon <dbl>, lat <dbl>, land_type <fct>

How does the output compare to the output from predict()?

Add response here.

Understand your model

library(rpart.plot)

Loading required package: rpart


Attaching package: 'rpart'

The following object is masked from 'package:dials':

    prune

tree_fit |>
  extract_fit_engine() |>
  rpart.plot(roundint = FALSE)

Your turn

Try extracting the model engine object from your fitted workflow!

# add code here

What kind of object is it? What can you do with it?

Add response here

Acknowledgments

Materials derived in part from Machine learning with {tidymodels} and licensed under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA) License.