AE 04: Predicting hotel price

Application exercise
Modified

September 13, 2024

Setup

library(tidymodels)
library(textrecipes)

reg_metrics <- metric_set(mae, rsq)

data(hotel_rates)
set.seed(295)
hotel_rates <- hotel_rates |> 
  sample_n(5000) |> 
  arrange(arrival_date) |> 
  select(-arrival_date) |> 
  mutate(
    company = factor(as.character(company)),
    country = factor(as.character(country)),
    agent = factor(as.character(agent))
  )

set.seed(4028)
hotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

Explore the data

Your turn: Investigate the training data. The outcome is avg_price_per_room. What trends or patterns do you see?

# add code here

Add response here.

Resampling Strategy

set.seed(472)
hotel_rs <- vfold_cv(hotel_train, strata = avg_price_per_room)
hotel_rs

A first recipe

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train)

summary(hotel_rec)

Your turn

What do you think are in the type vectors for the lead_time and country columns?

# add code here

Create a recipe

Your turn: Create a recipe() for the hotel data to:

  • use a Yeo-Johnson (YJ) transformation on lead_time
  • convert factors to indicator variables
  • remove zero-variance variables
  • add the spline technique shown previously
# add code here
hotel_indicators <- recipe(avg_price_per_room ~ ., data = hotel_train) |> 
  ...

Measuring Performance

We’ll compute two measures, mean absolute error (MAE) and the coefficient of determination (a.k.a \(R^2\)), and focus on the MAE for parameter optimization.

reg_metrics <- metric_set(mae, rsq)

Your turn: Use fit_resamples() to fit your workflow with a recipe.

Collect the predictions from the results. How did you do?

set.seed(9)

# add code here
hotel_lm_wflow <- workflow() |>
  add_recipe(hotel_indicators) |>
  add_model(linear_reg())
 
ctrl <- control_resamples(TODO)
hotel_lm_res <- hotel_lm_wflow |>
  fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)

collect_metrics(hotel_lm_res)

Holdout predictions

# Since we used `save_pred = TRUE`
lm_cv_pred <- collect_predictions(hotel_lm_res)
lm_cv_pred |> slice(1:7)

Calibration Plot

Your turn: What does this plot tell us about the performance of our model?

Add response here.

Acknowledgments