library(tidymodels)
library(textrecipes)
reg_metrics <- metric_set(mae, rsq)
data(hotel_rates)
set.seed(295)
hotel_rates <- hotel_rates |>
sample_n(5000) |>
arrange(arrival_date) |>
select(-arrival_date) |>
mutate(
company = factor(as.character(company)),
country = factor(as.character(country)),
agent = factor(as.character(agent))
)
set.seed(4028)
hotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)
hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)
AE 04: Predicting hotel price
Application exercise
Setup
Explore the data
Your turn: Investigate the training data. The outcome is avg_price_per_room
. What trends or patterns do you see?
# add code here
Add response here.
Resampling Strategy
set.seed(472)
hotel_rs <- vfold_cv(hotel_train, strata = avg_price_per_room)
hotel_rs
A first recipe
Your turn
What do you think are in the type
vectors for the lead_time
and country
columns?
# add code here
Create a recipe
Your turn: Create a recipe()
for the hotel data to:
- use a Yeo-Johnson (YJ) transformation on
lead_time
- convert factors to indicator variables
- remove zero-variance variables
- add the spline technique shown previously
# add code here
<- recipe(avg_price_per_room ~ ., data = hotel_train) |>
hotel_indicators ...
Measuring Performance
We’ll compute two measures, mean absolute error (MAE) and the coefficient of determination (a.k.a \(R^2\)), and focus on the MAE for parameter optimization.
reg_metrics <- metric_set(mae, rsq)
Your turn: Use fit_resamples()
to fit your workflow with a recipe.
Collect the predictions from the results. How did you do?
set.seed(9)
# add code here
hotel_lm_wflow <- workflow() |>
add_recipe(hotel_indicators) |>
add_model(linear_reg())
ctrl <- control_resamples(TODO)
hotel_lm_res <- hotel_lm_wflow |>
fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)
collect_metrics(hotel_lm_res)
Holdout predictions
# Since we used `save_pred = TRUE`
lm_cv_pred <- collect_predictions(hotel_lm_res)
lm_cv_pred |> slice(1:7)
Calibration Plot
library(probably)
cal_plot_regression(hotel_lm_res)
Your turn: What does this plot tell us about the performance of our model?
Add response here.
Acknowledgments
- Materials derived in part from Machine learning with {tidymodels} and licensed under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA) License.