Data on forests in Washington
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom 1.0.6 ✔ recipes 1.0.10
✔ dials 1.3.0 ✔ rsample 1.2.1
✔ dplyr 1.1.4 ✔ tibble 3.2.1
✔ ggplot2 3.5.1 ✔ tidyr 1.3.1
✔ infer 1.0.7 ✔ tune 1.2.1
✔ modeldata 1.4.0 ✔ workflows 1.1.4
✔ parsnip 1.2.1 ✔ workflowsets 1.1.0
✔ purrr 1.0.2 ✔ yardstick 1.3.1
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::%||%() masks base::%||%()
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ recipes::step() masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/
# A tibble: 7,107 × 19
forested year elevation eastness northness roughness tree_no_tree dew_temp
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 Yes 2005 881 90 43 63 Tree 0.04
2 Yes 2005 113 -25 96 30 Tree 6.4
3 No 2005 164 -84 53 13 Tree 6.06
4 Yes 2005 299 93 34 6 No tree 4.43
5 Yes 2005 806 47 -88 35 Tree 1.06
6 Yes 2005 736 -27 -96 53 Tree 1.35
7 Yes 2005 636 -48 87 3 No tree 1.42
8 Yes 2005 224 -65 -75 9 Tree 6.39
9 Yes 2005 52 -62 78 42 Tree 6.5
10 Yes 2005 2240 -67 -74 99 No tree -5.63
# ℹ 7,097 more rows
# ℹ 11 more variables: precip_annual <dbl>, temp_annual_mean <dbl>,
# temp_annual_min <dbl>, temp_annual_max <dbl>, temp_january_min <dbl>,
# vapor_min <dbl>, vapor_max <dbl>, canopy_cover <dbl>, lon <dbl>, lat <dbl>,
# land_type <fct>
Your turn
When is a good time to split your data?
Add response here
Data splitting and spending
set.seed ( 123 )
forested_split <- initial_split ( forested )
forested_split
<Training/Testing/Total>
<5330/1777/7107>
Extract the training and testing sets
forested_train <- training ( forested_split )
forested_test <- testing ( forested_split )
Your turn
Split your data so 20% is held out for the test set.
Try out different values in set.seed()
to see how the results change.
Hint: Which argument in initial_split()
handles the proportion split into training vs testing?
Exploratory data analysis
Explore the forested_train
data on your own!
What’s the distribution of the outcome, forested
?
What’s the distribution of numeric variables like precip_annual
?
How does the distribution of forested
differ across the categorical variables?
Fitting models
To specify a model
Linear Regression Model Specification (regression)
Computational engine: lm
# Engine
linear_reg ( ) |>
set_engine ( "glmnet" )
Linear Regression Model Specification (regression)
Computational engine: glmnet
# Mode - Some models have a default mode, others don't
decision_tree ( ) |>
set_mode ( "regression" )
Decision Tree Model Specification (regression)
Computational engine: rpart
Your turn
Edit the chunk below to use a logistic regression model.
Extension/Challenge: Edit this code to use a different model. For example, try using a conditional inference tree as implemented in the partykit package by changing the engine - or try an entirely different model type!
All available models are listed at https://www.tidymodels.org/find/parsnip/
tree_spec <- decision_tree ( ) |>
set_mode ( "classification" )
tree_spec
Decision Tree Model Specification (classification)
Computational engine: rpart
A model workflow
tree_spec <- decision_tree ( ) |>
set_mode ( "classification" )
Fit parsnip specification:
tree_spec |>
fit ( forested ~ . , data = forested_train )
parsnip model object
n= 5330
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 5330 2398 Yes (0.55009381 0.44990619)
2) land_type=Tree 2868 282 Yes (0.90167364 0.09832636) *
3) land_type=Barren,Non-tree vegetation 2462 346 No (0.14053615 0.85946385)
6) temp_annual_max< 13.385 323 143 Yes (0.55727554 0.44272446)
12) tree_no_tree=Tree 84 6 Yes (0.92857143 0.07142857) *
13) tree_no_tree=No tree 239 102 No (0.42677824 0.57322176) *
7) temp_annual_max>=13.385 2139 166 No (0.07760636 0.92239364) *
Fit with a workflow:
workflow ( ) |>
add_formula ( forested ~ . ) |>
add_model ( tree_spec ) |>
fit ( data = forested_train )
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
forested ~ .
── Model ───────────────────────────────────────────────────────────────────────
n= 5330
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 5330 2398 Yes (0.55009381 0.44990619)
2) land_type=Tree 2868 282 Yes (0.90167364 0.09832636) *
3) land_type=Barren,Non-tree vegetation 2462 346 No (0.14053615 0.85946385)
6) temp_annual_max< 13.385 323 143 Yes (0.55727554 0.44272446)
12) tree_no_tree=Tree 84 6 Yes (0.92857143 0.07142857) *
13) tree_no_tree=No tree 239 102 No (0.42677824 0.57322176) *
7) temp_annual_max>=13.385 2139 166 No (0.07760636 0.92239364) *
“Shortcut” by specifying the preprocessor and model spec directly in the workflow()
call:
workflow ( forested ~ . , tree_spec ) |>
fit ( data = forested_train )
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
forested ~ .
── Model ───────────────────────────────────────────────────────────────────────
n= 5330
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 5330 2398 Yes (0.55009381 0.44990619)
2) land_type=Tree 2868 282 Yes (0.90167364 0.09832636) *
3) land_type=Barren,Non-tree vegetation 2462 346 No (0.14053615 0.85946385)
6) temp_annual_max< 13.385 323 143 Yes (0.55727554 0.44272446)
12) tree_no_tree=Tree 84 6 Yes (0.92857143 0.07142857) *
13) tree_no_tree=No tree 239 102 No (0.42677824 0.57322176) *
7) temp_annual_max>=13.385 2139 166 No (0.07760636 0.92239364) *
Your turn
Edit the chunk below to make a workflow with your own model of choice!
Extension/Challenge: Other than formulas, what kinds of preprocessors are supported?
tree_spec <- decision_tree ( ) |>
set_mode ( "classification" )
tree_wflow <- workflow ( ) |>
add_formula ( forested ~ . ) |>
add_model ( tree_spec )
tree_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
forested ~ .
── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (classification)
Computational engine: rpart
Predict with your model
tree_fit <- workflow ( forested ~ . , tree_spec ) |>
fit ( data = forested_train )
Your turn
Run:
predict ( tree_fit , new_data = forested_test )
# A tibble: 1,777 × 1
.pred_class
<fct>
1 Yes
2 Yes
3 No
4 No
5 No
6 Yes
7 Yes
8 Yes
9 Yes
10 Yes
# ℹ 1,767 more rows
What do you notice about the structure of the result?
Add response here.
Your turn
Run:
augment ( tree_fit , new_data = forested_test )
# A tibble: 1,777 × 22
.pred_class .pred_Yes .pred_No forested year elevation eastness northness
<fct> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 Yes 0.902 0.0983 No 2005 164 -84 53
2 Yes 0.902 0.0983 Yes 2005 299 93 34
3 No 0.0776 0.922 Yes 2005 636 -48 87
4 No 0.0776 0.922 Yes 2005 224 -65 -75
5 No 0.427 0.573 Yes 2005 2240 -67 -74
6 Yes 0.902 0.0983 Yes 2004 1044 96 -26
7 Yes 0.902 0.0983 Yes 2003 1031 -49 86
8 Yes 0.902 0.0983 Yes 2005 1330 99 7
9 Yes 0.902 0.0983 Yes 2005 1423 46 88
10 Yes 0.902 0.0983 Yes 2014 546 -92 -38
# ℹ 1,767 more rows
# ℹ 14 more variables: roughness <dbl>, tree_no_tree <fct>, dew_temp <dbl>,
# precip_annual <dbl>, temp_annual_mean <dbl>, temp_annual_min <dbl>,
# temp_annual_max <dbl>, temp_january_min <dbl>, vapor_min <dbl>,
# vapor_max <dbl>, canopy_cover <dbl>, lon <dbl>, lat <dbl>, land_type <fct>
How does the output compare to the output from predict()
?
Add response here.
Understand your model
Loading required package: rpart
Attaching package: 'rpart'
The following object is masked from 'package:dials':
prune
tree_fit |>
extract_fit_engine ( ) |>
rpart.plot ( roundint = FALSE )
Your turn
Try extracting the model engine object from your fitted workflow!
What kind of object is it? What can you do with it?
Add response here