AE 04: Predicting hotel price

Suggested answers

Application exercise

September 13, 2024


reg_metrics <- metric_set(mae, rsq)

hotel_rates <- hotel_rates |> 
  sample_n(5000) |> 
  arrange(arrival_date) |> 
  select(-arrival_date) |> 
    company = factor(as.character(company)),
    country = factor(as.character(country)),
    agent = factor(as.character(agent))

hotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

Explore the data

Your turn: Investigate the training data. The outcome is avg_price_per_room. What trends or patterns do you see?

Data summary
Name hotel_train
Number of rows 3749
Number of columns 27
Column type frequency:
factor 9
numeric 18
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
meal 0 1 FALSE 4 bed: 2865, bre: 743, no_: 116, bre: 25
country 0 1 FALSE 66 prt: 1206, gbr: 814, esp: 353, irl: 216
market_segment 0 1 FALSE 5 onl: 1617, dir: 756, off: 736, gro: 416
distribution_channel 0 1 FALSE 3 ta_: 2616, dir: 812, cor: 321, und: 0
reserved_room_type 0 1 FALSE 7 a: 2034, d: 787, e: 500, g: 138
assigned_room_type 0 1 FALSE 9 a: 1395, d: 1066, e: 562, c: 250
agent 0 1 FALSE 98 dev: 1132, not: 834, ale: 360, cha: 208
company 0 1 FALSE 100 not: 3416, par: 83, lin: 24, ber: 14
customer_type 0 1 FALSE 4 tra: 2699, tra: 811, con: 207, gro: 32

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
avg_price_per_room 0 1 105.62 66.08 19.35 55.00 83.00 142.76 426.25 ▇▃▂▁▁
lead_time 0 1 88.65 101.77 0.00 7.00 43.00 152.00 542.00 ▇▂▁▁▁
stays_in_weekend_nights 0 1 1.17 1.14 0.00 0.00 1.00 2.00 13.00 ▇▁▁▁▁
stays_in_week_nights 0 1 3.08 2.41 0.00 1.00 3.00 5.00 32.00 ▇▁▁▁▁
adults 0 1 1.85 0.47 1.00 2.00 2.00 2.00 4.00 ▂▇▁▁▁
children 0 1 0.13 0.43 0.00 0.00 0.00 0.00 2.00 ▇▁▁▁▁
babies 0 1 0.01 0.12 0.00 0.00 0.00 0.00 2.00 ▇▁▁▁▁
is_repeated_guest 0 1 0.07 0.25 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
previous_cancellations 0 1 0.01 0.11 0.00 0.00 0.00 0.00 4.00 ▇▁▁▁▁
previous_bookings_not_canceled 0 1 0.25 1.37 0.00 0.00 0.00 0.00 29.00 ▇▁▁▁▁
booking_changes 0 1 0.36 0.77 0.00 0.00 0.00 0.00 7.00 ▇▁▁▁▁
days_in_waiting_list 0 1 0.57 7.76 0.00 0.00 0.00 0.00 125.00 ▇▁▁▁▁
required_car_parking_spaces 0 1 0.20 0.42 0.00 0.00 0.00 0.00 8.00 ▇▁▁▁▁
total_of_special_requests 0 1 0.74 0.86 0.00 0.00 1.00 1.00 5.00 ▇▂▁▁▁
arrival_date_num 0 1 2017.08 0.33 2016.50 2016.80 2017.09 2017.36 2017.66 ▇▇▇▇▇
near_christmas 0 1 0.01 0.08 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
near_new_years 0 1 0.01 0.09 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
historical_adr 0 1 86.64 39.48 41.96 52.27 71.35 116.78 167.49 ▇▃▂▂▂
ggplot(data = hotel_train, mapping = aes(x = avg_price_per_room)) +
  geom_histogram(color = "white", binwidth = 20, boundary = 0)

ggplot(data = hotel_train, mapping = aes(x = lead_time)) +
  geom_histogram(color = "white")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = hotel_train, mapping = aes(x = lead_time)) +
  geom_histogram(color = "white") +
  scale_x_continuous(transform = boxcox_trans(p = 0.4))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = hotel_train, mapping = aes(x = historical_adr, y = avg_price_per_room)) +
  geom_point(alpha = 0.2) +
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

hotel_train |>
    cols = starts_with("near"),
    names_to = "holiday",
    values_to = "value"
  ) |>
  ggplot(mapping = aes(
    x = factor(value),
    y = avg_price_per_room
  )) +
  geom_boxplot() +
  facet_wrap(facets = vars(holiday), ncol = 1)

ggplot(data = hotel_train, mapping = aes(x = company, y = avg_price_per_room)) +

  • The avg_price_per_room variable is right-skewed.
  • The lead_time variable is right-skewed.
  • The lead_time variable is right-skewed, but a Box-Cox transformation with \(\lambda = 0.4\) makes it more symmetric.
  • There is a positive relationship between historical_adr and avg_price_per_room.
  • The near variables show some differences in the distribution of avg_price_per_room.
  • The company variable shows some differences in the distribution of avg_price_per_room, but also a ton of possible values.

Resampling Strategy

hotel_rs <- vfold_cv(hotel_train, strata = avg_price_per_room)
#  10-fold cross-validation using stratification 
# A tibble: 10 × 2
   splits             id    
   <list>             <chr> 
 1 <split [3372/377]> Fold01
 2 <split [3373/376]> Fold02
 3 <split [3373/376]> Fold03
 4 <split [3373/376]> Fold04
 5 <split [3373/376]> Fold05
 6 <split [3374/375]> Fold06
 7 <split [3375/374]> Fold07
 8 <split [3376/373]> Fold08
 9 <split [3376/373]> Fold09
10 <split [3376/373]> Fold10

A first recipe

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train)

# A tibble: 27 × 4
   variable                type      role      source  
   <chr>                   <list>    <chr>     <chr>   
 1 lead_time               <chr [2]> predictor original
 2 stays_in_weekend_nights <chr [2]> predictor original
 3 stays_in_week_nights    <chr [2]> predictor original
 4 adults                  <chr [2]> predictor original
 5 children                <chr [2]> predictor original
 6 babies                  <chr [2]> predictor original
 7 meal                    <chr [3]> predictor original
 8 country                 <chr [3]> predictor original
 9 market_segment          <chr [3]> predictor original
10 distribution_channel    <chr [3]> predictor original
# ℹ 17 more rows

Your turn

What do you think are in the type vectors for the lead_time and country columns?

# add code here
[1] "double"  "numeric"
[1] "factor"    "unordered" "nominal"  

Add response here. Contains information on both the R data type as well as the substantive type of variable for {recipes} (e.g. numeric, nominal, ordinal).

Create a recipe

Your turn: Create a recipe() for the hotel data to:

  • use a Yeo-Johnson (YJ) transformation on lead_time
  • convert factors to indicator variables
  • remove zero-variance variables
  • add the spline technique shown previously
# add code here
hotel_indicators <- recipe(avg_price_per_room ~ ., data = hotel_train) |> 
  step_YeoJohnson(lead_time) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |> 
  step_spline_natural(arrival_date_num, deg_free = 10)

Measuring Performance

We’ll compute two measures, mean absolute error (MAE) and the coefficient of determination (a.k.a \(R^2\)), and focus on the MAE for parameter optimization.

reg_metrics <- metric_set(mae, rsq)

Your turn: Use fit_resamples() to fit your workflow with a recipe.

Collect the predictions from the results. How did you do?


# add code here
hotel_lm_wflow <- workflow() |>
  add_recipe(hotel_indicators) |>
ctrl <- control_resamples(save_pred = TRUE)
hotel_lm_res <- hotel_lm_wflow |>
  fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)
→ A | warning: prediction from rank-deficient fit; consider predict(., rankdeficient="NA")
There were issues with some computations   A: x1
There were issues with some computations   A: x6
There were issues with some computations   A: x9
# A tibble: 2 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 mae     standard   16.6      10 0.214   Preprocessor1_Model1
2 rsq     standard    0.884    10 0.00339 Preprocessor1_Model1

Fine. MAE is around €16. Not too bad in raw terms. High \(R^2\) value as well.

Holdout predictions

# Since we used `save_pred = TRUE`
lm_cv_pred <- collect_predictions(hotel_lm_res)
lm_cv_pred |> slice(1:7)
# A tibble: 7 × 5
  .pred id      .row avg_price_per_room .config             
  <dbl> <chr>  <int>              <dbl> <chr>               
1  75.1 Fold01    20                 40 Preprocessor1_Model1
2  49.3 Fold01    28                 54 Preprocessor1_Model1
3  64.9 Fold01    45                 50 Preprocessor1_Model1
4  52.8 Fold01    49                 42 Preprocessor1_Model1
5  48.6 Fold01    61                 49 Preprocessor1_Model1
6  29.8 Fold01    66                 40 Preprocessor1_Model1
7  36.9 Fold01    88                 49 Preprocessor1_Model1

Calibration Plot

Attaching package: 'probably'
The following objects are masked from 'package:base':

    as.factor, as.ordered

Your turn: What does this plot tell us about the performance of our model?

Add response here. The model is mostly calibrated successfully against the true values, but does have a tendency to under-predict prices for true average rates above 200.


