AE 05: Predicting hotel price (with numeric engineering!)

Application exercise
Modified

September 17, 2024

Setup

── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.6      ✔ recipes      1.0.10
✔ dials        1.3.0      ✔ rsample      1.2.1 
✔ dplyr        1.1.4      ✔ tibble       3.2.1 
✔ ggplot2      3.5.1      ✔ tidyr        1.3.1 
✔ infer        1.0.7      ✔ tune         1.2.1 
✔ modeldata    1.4.0      ✔ workflows    1.1.4 
✔ parsnip      1.2.1      ✔ workflowsets 1.1.0 
✔ purrr        1.0.2      ✔ yardstick    1.3.1 
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::%||%()    masks base::%||%()
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
reg_metrics <- metric_set(mae, rmse, rsq)

data(hotel_rates)
set.seed(295)
hotel_rates <- hotel_rates |> 
  sample_n(5000) |> 
  arrange(arrival_date) |> 
  select(-arrival_date) |> 
  mutate(
    company = factor(as.character(company)),
    country = factor(as.character(country)),
    agent = factor(as.character(agent))
  )

set.seed(4028)
hotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)
set.seed(472)
hotel_folds <- vfold_cv(hotel_train, strata = avg_price_per_room)
hotel_folds
#  10-fold cross-validation using stratification 
# A tibble: 10 × 2
   splits             id    
   <list>             <chr> 
 1 <split [3372/377]> Fold01
 2 <split [3373/376]> Fold02
 3 <split [3373/376]> Fold03
 4 <split [3373/376]> Fold04
 5 <split [3373/376]> Fold05
 6 <split [3374/375]> Fold06
 7 <split [3375/374]> Fold07
 8 <split [3376/373]> Fold08
 9 <split [3376/373]> Fold09
10 <split [3376/373]> Fold10

Adjust for skewness

Your turn: Examine hotel_train and identify a numeric predictor that is skewed. Incorporate an appropriate transformation into the recipe below and estimate a linear regression model using 10-fold cross-validation. How does the model perform with and without the transformation?

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())
Tip

Use GGally::ggpairs() to generate bivariate comparisons for all variables in your dataset.

# add code here

Spline functions

Your turn: Implement a natural spline for lead_time and historical_adr. Use grid tuning to determine the optimal value for deg_free. Evaluate the model’s performance.

# add code here

MARS model

Your turn: Implement a MARS model. Use grid tuning to determine the optimal value for num_terms and prod_degree. Evaluate the model’s performance.

# add code here