Feature engineering for numeric predictors and nonlinear relationships

Lecture 6

Dr. Benjamin Soltoff

Cornell University
INFO 4940/5940 - Fall 2024

September 17, 2024

Announcements

Announcements

  • Dr. Soltoff office hours tomorrow 9:30-11:30am
  • Homework 1 due tomorrow

Learning objectives

  • Identify 1-to-1 and 1-to-many transformation methods for encoding numeric predictors
  • Implement transformations for numeric predictors to reduce skewness
  • Introduce basis expansions and splines to incorporate non-linear relationships in predictors
  • Define multivariate adaptive regression spline (MARS) for modeling non-linear relationships
  • Model nonlinear relationships between predictors and the response variable

Application exercise

ae-05

  • Go to the course GitHub org and find your ae-05 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio, open the Quarto document in the repo, install the required packages, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of the day

From last time

library(tidymodels)

# Add another package:
library(textrecipes)

data(hotel_rates)
set.seed(295)

hotel_rates <- hotel_rates |>
  sample_n(5000) |>
  arrange(arrival_date) |>
  select(-arrival_date) |>
  mutate(
    company = factor(as.character(company)),
    country = factor(as.character(country)),
    agent = factor(as.character(agent))
  )
set.seed(4028)
hotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

Problems with numeric predictors

Potential issues with numeric predictors

  • Skewness
  • Outliers
  • Correlation

Mitigating issues with numeric predictors

  • Choose a robust model
  • 1 : 1 transformations
  • 1 : many transformations
  • Many : many transformations (in a couple of weeks)

1:1 transformations

1:1 transformations

  • Scale
  • Center

Skewed variables

Outliers

Something (such as a geologic feature) that is situated away from or classed differently from a main or related body.1

A statistical observation that is markedly different in value from the others of the sample.2

Extreme values are not inherently “wrong”. Removing them can also skew our analysis by excluding relevant data points.

But we also do not want to give abnormal weight to these outliers.

What do we do with outliers?

  • Remove them
  • Truncate them
  • Impute them
  • Transform them

Transformation functions

Transformation functions

Box-Cox procedure

\[ x^* = \begin{cases} \lambda^{-1}(x^\lambda-1) & \text{if $\lambda \ne 0$,} \\[3pt] \log(x) &\text{if $\lambda = 0$.} \end{cases} \]

Box-Cox procedure

Yeo-Johnson

  • Box-Cox transformations only work on positive values
  • Yeo-Johnson extends this to work with all numeric values

\[ x^* = \begin{cases} \lambda^{-1}\left[(x + 1)^\lambda-1\right] & \text{if $\lambda \ne 0$ and $x \ge 0$,} \\[3pt] \log(x + 1) &\text{if $\lambda = 0$ and $x \ge 0$.} \\[3pt] -(2 - \lambda)^{-1}\left[(-x + 1)^{2 - \lambda}-1\right] & \text{if $\lambda \ne 2$ and $x < 0$,} \\[3pt] -\log(-x + 1) &\text{if $\lambda = 2$ and $x < 0$.} \end{cases} \]

How is \(\lambda\) determined?

\(\lambda\) is a parameter to be learned from the data

recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_YeoJohnson(lead_time) |>
  prep() |>
  tidy(number = 1)
# A tibble: 1 × 3
  terms     value id              
  <chr>     <dbl> <chr>           
1 lead_time 0.173 YeoJohnson_YKdYz
  • Training data used to estimate model parameters
  • Training data used to estimate transformation parameters

Resampling Strategy

We’ll use simple 10-fold cross-validation (stratified sampling):

set.seed(472)
hotel_folds <- vfold_cv(hotel_train, strata = avg_price_per_room)
hotel_folds
#  10-fold cross-validation using stratification 
# A tibble: 10 × 2
   splits             id    
   <list>             <chr> 
 1 <split [3372/377]> Fold01
 2 <split [3373/376]> Fold02
 3 <split [3373/376]> Fold03
 4 <split [3373/376]> Fold04
 5 <split [3373/376]> Fold05
 6 <split [3374/375]> Fold06
 7 <split [3375/374]> Fold07
 8 <split [3376/373]> Fold08
 9 <split [3376/373]> Fold09
10 <split [3376/373]> Fold10
reg_metrics <- metric_set(mae, rmse, rsq)

⏱️ Your turn

Examine hotel_train and identify a numeric predictor that is skewed. Incorporate an appropriate transformation into the recipe below and estimate a linear regression model using 10-fold cross-validation. How does the model perform with and without the transformation?

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())
08:00

hotel_yj_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_YeoJohnson(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

# check the recipe
hotel_yj_rec |>
  prep() |>
  tidy(number = 1)
# A tibble: 7 × 3
  terms                        value id              
  <chr>                        <dbl> <chr>           
1 lead_time                  0.173   YeoJohnson_f7mCz
2 stays_in_weekend_nights    0.00547 YeoJohnson_f7mCz
3 stays_in_week_nights       0.121   YeoJohnson_f7mCz
4 booking_changes           -4.16    YeoJohnson_f7mCz
5 total_of_special_requests -0.672   YeoJohnson_f7mCz
6 arrival_date_num          -3.21    YeoJohnson_f7mCz
7 historical_adr            -0.591   YeoJohnson_f7mCz

set.seed(9)

hotel_lm_wflow <- workflow() |>
  add_recipe(hotel_rec) |>
  add_model(linear_reg())

hotel_lm_res <- hotel_lm_wflow |>
  fit_resamples(hotel_folds,
    metrics = reg_metrics,
    control = control_resamples(save_pred = TRUE)
  )

collect_metrics(hotel_lm_res)
# A tibble: 3 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 mae     standard   17.1      10 0.176   Preprocessor1_Model1
2 rmse    standard   22.8      10 0.296   Preprocessor1_Model1
3 rsq     standard    0.881    10 0.00359 Preprocessor1_Model1
set.seed(9)

hotel_lm_skew_wflow <- workflow() |>
  add_recipe(hotel_yj_rec) |>
  add_model(linear_reg())

hotel_lm_skew_res <- hotel_lm_skew_wflow |>
  fit_resamples(hotel_folds,
    metrics = reg_metrics,
    control = control_resamples(save_pred = TRUE)
  )

collect_metrics(hotel_lm_skew_res)
# A tibble: 3 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 mae     standard   18.9      10 0.261   Preprocessor1_Model1
2 rmse    standard   25.3      10 0.374   Preprocessor1_Model1
3 rsq     standard    0.853    10 0.00384 Preprocessor1_Model1

Standardizing to a common scale

Some models require predictors to be on a common scale (data preprocessing)

  • \(k\) nearest neighbors
  • Support vector machines
  • Penalized regression

Standardizing to a common scale

Standardizing to a common scale

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

1:many transformations

Basis expansion

Features of a predictor \(x\) derived from a set of functions \(f_1(x), f_2(x), \ldots, f_i(x)\) that can be combined using a linear combination.

Polynomial expansion

Simple linear regression

\[ y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i \]

Cubic polynomial expansion

\[ y_i = \beta_0 + \beta_1 x_{i} + \beta_2 x_{i}^2 + \beta_3 x_{i}^3 + \epsilon_i \]

Polynomial expansion

#| label: fig-global-polynomial
#| viewerHeight: 700
#| viewerWidth: "100%"
#| standalone: true
library(shiny)
library(patchwork)
library(dplyr)
library(tidyr)
library(ggplot2)
library(splines2)
library(bslib)
library(viridis)
library(aspline)

data(fossil)

ui <- page_fillable(
  theme = bs_theme(bg = "#fcfefe", fg = "#595959",
                   base_font = "Atkinson Hyperlegible", font_scale = 1.5),
  padding = "1rem",
  layout_columns(
    fill = FALSE,
    col_widths = breakpoints(sm = c(-3, 6, -3)),
    sliderInput(
      "global_deg",
      label = "Polynomial Degree",
      min = 1L, max = 21L, step = 1L, value = 3L,
      ticks = TRUE
    ) # sliderInput
  ), # layout_columns
  layout_columns(
    fill = FALSE,
    col_widths = breakpoints(sm = c(-1, 10, -1)),
    as_fill_carrier(plotOutput('global', height = "500px"))
  )
)


server <- function(input, output, session) {
  
  light_bg <- "#fcfefe" # from aml4td.scss
  grid_theme <- bs_theme(
    bg = "#fcfefe", fg = "#595959"
  )
  
  # ------------------------------------------------------------------------------
  
  theme_light_bl<- function(...) {
    
    ret <- ggplot2::theme_bw(...)
    
    col_rect <- ggplot2::element_rect(fill = light_bg, colour = light_bg)
    ret$panel.background  <- col_rect
    ret$plot.background   <- col_rect
    ret$legend.background <- col_rect
    ret$legend.key        <- col_rect
    
    larger_x_text <- ggplot2::element_text(size = rel(1.25))
    larger_y_text <- ggplot2::element_text(size = rel(1.25), angle = 90)
    ret$axis.text.x <- larger_x_text
    ret$axis.text.y <- larger_y_text
    ret$axis.title.x <- larger_x_text
    ret$axis.title.y <- larger_y_text  
    
    ret$legend.position <- "top"
    
    ret
  }
  
  col_rect <- ggplot2::element_rect(fill = light_bg, colour = light_bg)
  
  maybe_lm <- function(x) {
    try(lm(y ~ poly(x, input$piecewise_deg), data = x), silent = TRUE)
  }
  
  expansion_to_tibble <- function(x, original, prefix = "term ") {
    cls <- class(x)[1]
    nms <- recipes::names0(ncol(x), prefix)
    colnames(x) <- nms
    x <- as_tibble(x)
    x$variable <- original
    res <- tidyr::pivot_longer(x, cols = c(-variable))
    if (cls != "poly") {
      res <- res[res$value > .Machine$double.eps,]
    }
    res
  } 
  
  mult_poly <- function(dat, degree = 4) {
    rng <- extendrange(dat$x, f = .025)
    grid <- seq(rng[1], rng[2], length.out = 1000)
    grid_df <- tibble(x = grid)
    feat <- poly(grid_df$x, degree)
    res <- expansion_to_tibble(feat, grid_df$x)
    
    # make some random names so that we can plot the features with distinct colors
    rand_names <- lapply(1:degree, function(x) paste0(sample(letters)[1:10], collapse = ""))
    rand_names<- unlist(rand_names)
    rand_names <- tibble(name = unique(res$name), name2 = rand_names)
    res <- 
      dplyr::inner_join(res, rand_names, by = dplyr::join_by(name)) |> 
      dplyr::select(-name) |> 
      dplyr::rename(name = name2)
    res
  }
  
  # ------------------------------------------------------------------------------
  
  spline_example <- tibble(x = fossil$age, y = fossil$strontium.ratio)
  rng <- extendrange(fossil$age, f = .025)
  grid <- seq(rng[1], rng[2], length.out = 1000)
  grid_df <- tibble(x = grid)
  alphas <- 1 / 4
  line_wd <- 1.0
  
  base_p <-
    spline_example |>
    ggplot(aes(x = x, y = y)) +
    geom_point(alpha = 3 / 4, pch = 1, cex = 3) +
    labs(x = "Age", y = "Isotope Ratio") +
    lims(x = rng) +
    theme_light_bl(base_size = 16)
  
  output$global <- renderPlot({
    
    poly_fit <- lm(y ~ poly(x, input$global_deg), data = spline_example)
    poly_pred <- 
      predict(poly_fit, grid_df, interval = "confidence", level = .90) |> 
      bind_cols(grid_df)
    
    global_p <- base_p
    
    if (input$global_deg > 0) {
      global_p <-
        global_p +
        geom_ribbon(
          data = poly_pred,
          aes(y = NULL, ymin = lwr, ymax = upr),
          alpha = 1 / 8) +
        geom_line(
          data = poly_pred,
          aes(y = fit),
          col = "black",
          linewidth = line_wd) +
        theme(
          plot.margin = margin(t = -20, r = 0, b = 0, l = 0),
          panel.background = col_rect,
          plot.background = col_rect,
          legend.background = col_rect,
          legend.key = col_rect
        )
      
      feature_p <-
        poly(grid_df$x,  input$global_deg) |> 
        expansion_to_tibble(grid_df$x) |> 
        ggplot(aes(variable, y = value, group = name, col = name)) +
        geom_line(show.legend = FALSE) + # , linewidth = 1, alpha = 1 / 2
        theme_void() +
        theme(
          plot.margin = margin(t = 0, r = 0, b = -20, l = 0),
          panel.background = col_rect,
          plot.background = col_rect,
          legend.background = col_rect,
          legend.key = col_rect
        ) +
        scale_color_viridis(discrete = TRUE, option = "turbo")
      
      p <- (feature_p / global_p) + plot_layout(heights = c(1.5, 4))
    }
    
    print(p)
    
  })
  
}

app <- shinyApp(ui = ui, server = server)

Issues with polynomial expansion

  • Variance is high as number of polynomial terms increases
  • Confidence intervals expand rapidly
  • Overfitting
  • Global pattern not always appropriate

Split polynomial fits to different regions

#| label: fig-piecewise-polynomials
#| viewerHeight: 700
#| viewerWidth: "100%"
#| standalone: true

library(shiny)
library(patchwork)
library(dplyr)
library(tidyr)
library(ggplot2)
library(splines2)
library(bslib)
library(viridis)
library(aspline)

data(fossil)

ui <- page_fillable(
  theme = bs_theme(bg = "#fcfefe", fg = "#595959",
                   base_font = "Atkinson Hyperlegible", font_scale = 1.5),
  padding = "1rem",
  layout_columns(
    fill = FALSE,
    col_widths = breakpoints(sm = c(-1, 5, 5, -1)),
    sliderInput(
      "piecewise_deg",
      label = "Polynomial Degree",
      min = 0L, max = 6L, step = 1L, value = 4L
    ), # sliderInput
    sliderInput(
      "cuts",
      label = "Cutpoints",
      min = 93L, max = 122L, step = 1, value = c(101, 118)
    ) # sliderInput
  ), # layout_columns
  layout_columns(
    fill = FALSE,
    col_widths = breakpoints(sm = c(-1, 10, -1)),
    as_fill_carrier(plotOutput('pieces', height = "500px"))
  )      
)

server <- function(input, output, session) {
  
  light_bg <- "#fcfefe" # from aml4td.scss
  grid_theme <- bs_theme(
    bg = "#fcfefe", fg = "#595959"
  )
  
  # ------------------------------------------------------------------------------
  
  theme_light_bl<- function(...) {
    
    ret <- ggplot2::theme_bw(...)
    
    col_rect <- ggplot2::element_rect(fill = light_bg, colour = light_bg)
    ret$panel.background  <- col_rect
    ret$plot.background   <- col_rect
    ret$legend.background <- col_rect
    ret$legend.key        <- col_rect
    
    larger_x_text <- ggplot2::element_text(size = rel(1.25))
    larger_y_text <- ggplot2::element_text(size = rel(1.25), angle = 90)
    ret$axis.text.x <- larger_x_text
    ret$axis.text.y <- larger_y_text
    ret$axis.title.x <- larger_x_text
    ret$axis.title.y <- larger_y_text  
    
    ret$legend.position <- "top"
    
    ret
  }
  
  col_rect <- ggplot2::element_rect(fill = light_bg, colour = light_bg)
  
  maybe_lm <- function(x) {
    try(lm(y ~ poly(x, input$piecewise_deg), data = x), silent = TRUE)
  }
  
  expansion_to_tibble <- function(x, original, prefix = "term ") {
    cls <- class(x)[1]
    nms <- recipes::names0(ncol(x), prefix)
    colnames(x) <- nms
    x <- as_tibble(x)
    x$variable <- original
    res <- tidyr::pivot_longer(x, cols = c(-variable))
    if (cls != "poly") {
      res <- res[res$value > .Machine$double.eps,]
    }
    res
  } 
  
  mult_poly <- function(dat, degree = 4) {
    rng <- extendrange(dat$x, f = .025)
    grid <- seq(rng[1], rng[2], length.out = 1000)
    grid_df <- tibble(x = grid)
    feat <- poly(grid_df$x, degree)
    res <- expansion_to_tibble(feat, grid_df$x)
    
    # make some random names so that we can plot the features with distinct colors
    rand_names <- lapply(1:degree, function(x) paste0(sample(letters)[1:10], collapse = ""))
    rand_names<- unlist(rand_names)
    rand_names <- tibble(name = unique(res$name), name2 = rand_names)
    res <- 
      dplyr::inner_join(res, rand_names, by = dplyr::join_by(name)) |> 
      dplyr::select(-name) |> 
      dplyr::rename(name = name2)
    res
  }
  
  # ------------------------------------------------------------------------------
  
  spline_example <- tibble(x = fossil$age, y = fossil$strontium.ratio)
  rng <- extendrange(fossil$age, f = .025)
  grid <- seq(rng[1], rng[2], length.out = 1000)
  grid_df <- tibble(x = grid)
  alphas <- 1 / 4
  line_wd <- 1.0
  
  base_p <-
    spline_example |>
    ggplot(aes(x = x, y = y)) +
    geom_point(alpha = 3 / 4, pch = 1, cex = 3) +
    labs(x = "Age", y = "Isotope Ratio") +
    theme_light_bl(base_size = 16)
  
  output$pieces <- renderPlot({
    
    cuts <- c(0, sort(input$cuts), 60)
    piece_cols <- c("#1B9E77", "#D95F02", "#7570B3")
    piece_p <- base_p
    
    if (input$piecewise_deg > 0) {
      data_splt <-
        spline_example |>
        dplyr::mutate(x_cut = cut(x, breaks = cuts, include.lowest = TRUE)) |>
        tidyr::nest(.by = x_cut) |>
        mutate(
          fit = lapply(data, maybe_lm),
          features = lapply(data, mult_poly, degree = input$piecewise_deg)
        )
      grid_splt <-
        dplyr::tibble(x = grid) |>
        dplyr::mutate(x_cut = cut(x, breaks = cuts, include.lowest = TRUE))  |>
        tidyr::nest(.by = x_cut)
      
      for (i in 1:3) {
        sub_pred <- grid_splt$data[[i]]
        if (!inherits(data_splt$fit[[i]], "try-error")) {
          sub_pred <-
            sub_pred |>
            dplyr::bind_cols(predict(data_splt$fit[[i]], sub_pred, 
                                     interval = "confidence", level = .90))
          
          piece_p <-
            piece_p +
            geom_ribbon(
              data = sub_pred,
              aes(y = NULL, ymin = lwr, ymax = upr),
              alpha = 1 / 15
            ) +
            geom_line(
              data = sub_pred,
              aes(y = fit),
              linewidth = line_wd
            )
        }
      }
      
      set.seed(383) # to control colors
      feature_p <- 
        data_splt |> 
        dplyr::select(features) |> 
        tidyr::unnest(features) |> 
        ggplot(aes(x = variable, y = value, col = name)) + 
        geom_line(show.legend = FALSE) +
        theme_void() +
        theme(
          plot.margin = margin(t = 0, r = 0, b = -20, l = 0),
          panel.background = col_rect,
          plot.background = col_rect,
          legend.background = col_rect,
          legend.key = col_rect
        ) +
        scale_color_viridis(discrete = TRUE, option = "turbo")
      
      p <- (feature_p / piece_p) + plot_layout(heights = c(1, 4))
      
    }
    
    print(p)
    
  })
}

app <- shinyApp(ui = ui, server = server)

Polynomial splines

  • Basis expansion creates different regions of the predictor space using boundaries called knots
  • Within each region, use polynomial functions
  • Ensure globally we have a smooth, continuous function

Simple spline functions

#| label: fig-simple-spline
#| viewerHeight: 700
#| viewerWidth: "100%"
#| standalone: true
library(shiny)
library(patchwork)
library(dplyr)
library(tidyr)
library(ggplot2)
library(bslib)
library(aspline)

data(fossil)

ui <- page_fillable(
  theme = bs_theme(bg = "#fcfefe", fg = "#595959",
                   base_font = "Atkinson Hyperlegible", font_scale = 1.5),
  padding = "1rem",
  layout_columns(
    fill = FALSE,
    col_widths = breakpoints(sm = c(-2, 8, -2)),
    sliderInput(
      "cuts",
      label = "Cutpoints",
      min = 93L, max = 122L, step = 1, value = c(107, 114)
    ) # sliderInput
  ), # layout_columns
  layout_columns(
    fill = FALSE,
    col_widths = breakpoints(sm = c(-1, 10, -1)),
    as_fill_carrier(plotOutput('splime_spline', height = "500px"))
  )      
)

server <- function(input, output, session) {
  
  light_bg <- "#fcfefe" # from aml4td.scss
  grid_theme <- bs_theme(
    bg = "#fcfefe", fg = "#595959"
  )
  
  # ------------------------------------------------------------------------------
  
  theme_light_bl<- function(...) {
    
    ret <- ggplot2::theme_bw(...)
    
    col_rect <- ggplot2::element_rect(fill = light_bg, colour = light_bg)
    ret$panel.background  <- col_rect
    ret$plot.background   <- col_rect
    ret$legend.background <- col_rect
    ret$legend.key        <- col_rect
    
    larger_x_text <- ggplot2::element_text(size = rel(1.25))
    larger_y_text <- ggplot2::element_text(size = rel(1.25), angle = 90)
    ret$axis.text.x <- larger_x_text
    ret$axis.text.y <- larger_y_text
    ret$axis.title.x <- larger_x_text
    ret$axis.title.y <- larger_y_text  
    
    ret$legend.position <- "top"
    
    ret
  }
  
  col_rect <- ggplot2::element_rect(fill = light_bg, colour = light_bg)
  
  # ------------------------------------------------------------------------------
  
  spline_example <- tibble(x = fossil$age, y = fossil$strontium.ratio)
  rng <- extendrange(fossil$age, f = .025)
  grid <- seq(rng[1], rng[2], length.out = 1000)
  grid_df <- tibble(x = grid)
  alphas <- 1 / 4
  line_wd <- 1.0
  
  base_p <-
    spline_example |>
    ggplot(aes(x = x, y = y)) +
    geom_point(alpha = 3 / 4, pch = 1, cex = 3) +
    labs(x = "Age", y = "Isotope Ratio") +
    theme_light_bl(base_size = 16)
  
  output$splime_spline <- renderPlot({
    
    spline_p <- base_p
    
    h <- function(x) {
      ifelse(x > 0, x, 0)
    }
    
    mod_dat <- 
      spline_example |> 
      mutate(
        x_2 = x^2, 
        x_3 = x^3,
        x_k_1 = pmax(h(x-min(input$cuts))^3, 0),
        x_k_2 = pmax(h(x-max(input$cuts))^3, 0)
      )
    
    grid_spln <- 
      grid_df |> 
      mutate(
        x_2 = x^2, 
        x_3 = x^3,
        x_k_1 = pmax(h(x-min(input$cuts))^3, 0),
        x_k_2 = pmax(h(x-max(input$cuts))^3, 0)
      )
    
    features <- 
      rbind(
        tibble::tibble(x = grid_spln$x, value = grid_spln$x_k_1, term = "4") |> 
          filter(value != 0),
        tibble::tibble(x = grid_spln$x, value = grid_spln$x_k_2, term = "5") |> 
          filter(value != 0)
      )
    
    fit_1 <- lm(y ~ ., data = mod_dat)
    spline_pred <- 
      predict(fit_1, grid_spln, interval = "confidence", level = .90) |> 
      bind_cols(grid_df)
    
    spline_p <-
      spline_p +
      geom_ribbon(
        data = spline_pred,
        aes(y = NULL, ymin = lwr, ymax = upr),
        alpha = 1 / 15
      ) +
      geom_line(
        data = spline_pred,
        aes(y = fit),
        linewidth = line_wd
      ) +
      geom_vline(xintercept = min(input$cuts), col = "#A6CEE3", lty = 2, linewidth = 1) +
      geom_vline(xintercept = max(input$cuts), col = "#1F78B4", lty = 2, linewidth = 1)
    
    term_p <- 
      features |> 
      ggplot(aes(x, value, col = term)) +
      geom_line(show.legend = FALSE, linewidth = 1) +
      lims(x = rng) +
      theme_void() +
      theme(
        plot.margin = margin(t = 0, r = 0, b = -20, l = 0),
        panel.background = col_rect,
        plot.background = col_rect,
        legend.background = col_rect,
        legend.key = col_rect
      ) +
      scale_color_brewer(palette = "Paired")
    
    p <- (term_p / spline_p) + plot_layout(heights = c(1, 4))
    
    print(p)
    
  })
}

app <- shinyApp(ui = ui, server = server)

How to define the knots

Knot locations

  1. \(n\) groups with equal range
  2. \(n\) groups with equal number of observations

How many knots?

Tuning parameter

Types of spline functions

  • Orthogonal polynomial basis functions: step_poly()
  • Basis splines: step_spline_b()
  • B-spline basis functions: step_bs()
  • Natural splines: step_spline_natural()
  • Natural spline basis functions: step_ns()

Common tuning parameters

  • Number of knots
  • Degrees of freedom - number of individual basis functions

Natural cubic spline

#| label: fig-natural-cubic-spline
#| viewerHeight: 700
#| viewerWidth: "100%"
#| standalone: true

library(shiny)
library(patchwork)
library(dplyr)
library(tidyr)
library(ggplot2)
library(splines2)
library(bslib)
library(viridis)
library(aspline)

data(fossil)

ui <- page_fillable(
  theme = bs_theme(bg = "#fcfefe", fg = "#595959",
                   base_font = "Atkinson Hyperlegible", font_scale = 1.5),
  padding = "1rem",
  layout_columns(
    fill = FALSE,
    col_widths = breakpoints(sm = c(-3, 6, -3)),
    sliderInput(
      "spline_df",
      label = "# Spline Terms",
      min = 3L, max = 20L, step = 1L, value = 9L
    ), # sliderInput
  ), # layout_columns
  layout_columns(
    fill = FALSE,
    col_widths = breakpoints(sm = c(-1, 10, -1)),
    as_fill_carrier(plotOutput('spline', height = "500px"))
  )         
)

server <- function(input, output, session) {
  
  light_bg <- "#fcfefe" # from aml4td.scss
  grid_theme <- bs_theme(
    bg = "#fcfefe", fg = "#595959"
  )
  
  # ------------------------------------------------------------------------------
  
  theme_light_bl<- function(...) {
    
    ret <- ggplot2::theme_bw(...)
    
    col_rect <- ggplot2::element_rect(fill = light_bg, colour = light_bg)
    ret$panel.background  <- col_rect
    ret$plot.background   <- col_rect
    ret$legend.background <- col_rect
    ret$legend.key        <- col_rect
    
    larger_x_text <- ggplot2::element_text(size = rel(1.25))
    larger_y_text <- ggplot2::element_text(size = rel(1.25), angle = 90)
    ret$axis.text.x <- larger_x_text
    ret$axis.text.y <- larger_y_text
    ret$axis.title.x <- larger_x_text
    ret$axis.title.y <- larger_y_text  
    
    ret$legend.position <- "top"
    
    ret
  }
  
  col_rect <- ggplot2::element_rect(fill = light_bg, colour = light_bg)
  
  maybe_lm <- function(x) {
    try(lm(y ~ poly(x, input$piecewise_deg), data = x), silent = TRUE)
  }
  
  expansion_to_tibble <- function(x, original, prefix = "term ") {
    cls <- class(x)[1]
    nms <- recipes::names0(ncol(x), prefix)
    colnames(x) <- nms
    x <- as_tibble(x)
    x$variable <- original
    res <- tidyr::pivot_longer(x, cols = c(-variable))
    if (cls != "poly") {
      res <- res[res$value > .Machine$double.eps,]
    }
    res
  } 
  
  mult_poly <- function(dat, degree = 4) {
    rng <- extendrange(dat$x, f = .025)
    grid <- seq(rng[1], rng[2], length.out = 1000)
    grid_df <- tibble(x = grid)
    feat <- poly(grid_df$x, degree)
    res <- expansion_to_tibble(feat, grid_df$x)
    
    # make some random names so that we can plot the features with distinct colors
    rand_names <- lapply(1:degree, function(x) paste0(sample(letters)[1:10], collapse = ""))
    rand_names<- unlist(rand_names)
    rand_names <- tibble(name = unique(res$name), name2 = rand_names)
    res <- 
      dplyr::inner_join(res, rand_names, by = dplyr::join_by(name)) |> 
      dplyr::select(-name) |> 
      dplyr::rename(name = name2)
    res
  }
  
  # ------------------------------------------------------------------------------
  
  spline_example <- tibble(x = fossil$age, y = fossil$strontium.ratio)
  rng <- extendrange(fossil$age, f = .025)
  grid <- seq(rng[1], rng[2], length.out = 1000)
  grid_df <- tibble(x = grid)
  alphas <- 1 / 4
  line_wd <- 1.0
  
  base_p <-
    spline_example |>
    ggplot(aes(x = x, y = y)) +
    geom_point(alpha = 3 / 4, pch = 1, cex = 3) +
    labs(x = "Age", y = "Isotope Ratio") +
    theme_light_bl(base_size = 16)
  
  output$spline <- renderPlot({
    
    spline_fit <- lm(y ~ naturalSpline(x, df = input$spline_df), data = spline_example)
    spline_pred <- 
      predict(spline_fit, grid_df, interval = "confidence", level = .90) |> 
      bind_cols(grid_df)
    
    spline_p <- base_p +
      geom_ribbon(
        data = spline_pred,
        aes(y = NULL, ymin = lwr, ymax = upr),
        alpha = 1 / 15) +
      geom_line(
        data = spline_pred,
        aes(y = fit),
        col = "black",
        linewidth = line_wd)
    
    feature_p <-
      naturalSpline(grid_df$x, df = input$spline_df) |> 
      expansion_to_tibble(grid_df$x) |> 
      ggplot(aes(variable, y = value, group = name, col = name)) +
      geom_line(show.legend = FALSE) + 
      lims(x = rng) +
      theme_void() +
      theme(
        plot.margin = margin(t = 0, r = 0, b = -20, l = 0),
        panel.background = col_rect,
        plot.background = col_rect,
        legend.background = col_rect,
        legend.key = col_rect
      ) +
      scale_color_viridis(discrete = TRUE, option = "turbo")
    
    p <- (feature_p / spline_p) + plot_layout(heights = c(1, 4))
    
    print(p)
    
  })
  
}

app <- shinyApp(ui = ui, server = server)

⏱️ Your turn

Implement a natural spline for lead_time and historical_adr. Use grid tuning to determine the optimal value for deg_free. Evaluate the model’s performance.

08:00

ns_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_ns(lead_time, historical_adr, deg_free = tune()) |>
  step_novel(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors())

ns_wf <- workflow() |>
  add_recipe(ns_rec) |>
  add_model(linear_reg())

ns_res <- ns_wf |>
  tune_grid(
    hotel_folds,
    grid = 10,
    metrics = reg_metrics
  )

Locally weighted smoothing (LOESS)

#| label: fig-loess
#| viewerHeight: 700
#| viewerWidth: "100%"
#| standalone: true

library(shiny)
library(ggplot2)
library(bslib)

assess_color <- "#006699"

# Generate sample data
set.seed(123)
n <- 100
x <- seq(0, 10, length.out = n)
y <- sin(x) + rnorm(n, sd = 0.3)
data <- data.frame(x = x, y = y)

ui <- page_fillable(
  theme = bs_theme(bg = "#fcfefe", fg = "#595959",
                   base_font = "Atkinson Hyperlegible", font_scale = 1.5),
  padding = "1rem",
  title = NULL,
  layout_columns(
    fill = FALSE,
    col_widths = breakpoints(sm = c(-3, 6, -3)),
    sliderInput("span", "Span",
                min = 0.1, max = 1, value = 0.3, step = 0.05)
  ),
  layout_columns(
    plotOutput("loess_plot")
  )
)

server <- function(input, output, session) {
  light_bg <- "#fcfefe" # from aml4td.scss
  grid_theme <- bs_theme(
    bg = "#fcfefe", fg = "#595959"
  )
  
  # ------------------------------------------------------------------------------
  
  theme_light_bl<- function(...) {
    
    ret <- ggplot2::theme_bw(...)
    
    col_rect <- ggplot2::element_rect(fill = light_bg, colour = light_bg)
    ret$panel.background  <- col_rect
    ret$plot.background   <- col_rect
    ret$legend.background <- col_rect
    ret$legend.key        <- col_rect
    
    larger_x_text <- ggplot2::element_text(size = rel(1.25))
    larger_y_text <- ggplot2::element_text(size = rel(1.25), angle = 90)
    ret$axis.text.x <- larger_x_text
    ret$axis.text.y <- larger_y_text
    ret$axis.title.x <- larger_x_text
    ret$axis.title.y <- larger_y_text  
    
    ret$legend.position <- "top"
    
    ret
  }
  
  output$loess_plot <- renderPlot({
    ggplot(data, aes(x, y)) +
      geom_point(alpha = 0.5) +
      geom_smooth(method = "loess", 
                  formula = y ~ x,
                  se = FALSE,
                  span = input$span,
                  color = assess_color) +
      labs(subtitle = paste("Span:", input$span),
           x = "X", y = "Y") +
      theme_light_bl(base_size = 16)
  })
}

shinyApp(ui, server)

Multivariate adaptive regression spline (MARS)

Simple spline functions

MARS

  • Combines
    • Stepwise regression
    • Polynomial regression
    • Step functions
  • Good for high-dimensional data
  • Efficient estimation with cross-validation

Piecewise basis functions

\[(x - t)_+ = \begin{cases} x - t, & \text{if } x > t, \\ 0, & \text{otherwise} \end{cases}\]

and

\[(t - x)_+ = \begin{cases} t - x, & \text{if } x < t, \\ 0, & \text{otherwise} \end{cases}\]

MARS

  • Piecewise linear basis functions
  • Reflected pairs
  • Form pairs for inputs \(X_j\) with knots at \(x_{ij}\) based on selection criteria
  • Typically overfit to some maximum number of terms, then prune model back

MARS model

mars()

mars_spec <- mars(mode = "regression", num_terms = tune(), prod_degree = tune())
  • num_terms - number of model terms (e.g. how many hinge functions)
  • prod_degree - degree of interaction terms

MARS

Advantages

  • Accurate if local linear relationships are correct
  • Quick computation
  • Automated feature selection
  • Intuitive non-linear relationships

Disadvantages

  • Not accurate if local linear relationships are incorrect
  • Not as accurate as more advanced non-linear algorithms
  • Difficult to implement more advanced spline features

⏱️ Your turn

Implement a MARS model. Use grid tuning to determine the optimal value for num_terms and prod_degree. Evaluate the model’s performance.

08:00

# recipe
mars_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_novel(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors())

# model specification
mars_spec <- mars(mode = "regression", num_terms = tune(), prod_degree = tune())

# workflow
mars_wf <- workflow() |>
  add_recipe(mars_rec) |>
  add_model(mars_spec)

# space filling grid
set.seed(749)
mars_params <- extract_parameter_set_dials(mars_wf) |>
  grid_space_filling(size = 10)

# tune the model
mars_res <- mars_wf |>
  tune_grid(
    hotel_folds,
    grid = mars_params,
    metrics = reg_metrics
  )

Wrap-up

Recap

  • Variables can be rescaled to reduce skewness, sometimes improving model performance
  • Basis expansions and splines can be used to model non-linear relationships
  • MARS is a flexible method for modeling non-linear relationships
  • Always compare outcomes of different models to ensure transformations or nonlinear components improve model performance

My sunflower

Dr Soltoff standing next to a homegrown sunflower. Notice it's height, it's vivid color. Very demure.