Feature engineering for numeric predictors and nonlinear relationships

Lecture 6

Dr. Benjamin Soltoff

Cornell University
INFO 4940/5940 - Fall 2024

September 17, 2024

Announcements

Dr. Soltoff office hours tomorrow 9:30-11:30am
Homework 1 due tomorrow

Learning objectives

Identify 1-to-1 and 1-to-many transformation methods for encoding numeric predictors
Implement transformations for numeric predictors to reduce skewness
Introduce basis expansions and splines to incorporate non-linear relationships in predictors
Define multivariate adaptive regression spline (MARS) for modeling non-linear relationships
Model nonlinear relationships between predictors and the response variable

Application exercise

`ae-05`

Go to the course GitHub org and find your ae-05 (repo name will be suffixed with your GitHub name).
Clone the repo in RStudio, open the Quarto document in the repo, install the required packages, and follow along and complete the exercises.
Render, commit, and push your edits by the AE deadline – end of the day

From last time

library(tidymodels)

# Add another package:
library(textrecipes)

data(hotel_rates)
set.seed(295)

hotel_rates <- hotel_rates |>
  sample_n(5000) |>
  arrange(arrival_date) |>
  select(-arrival_date) |>
  mutate(
    company = factor(as.character(company)),
    country = factor(as.character(country)),
    agent = factor(as.character(agent))
  )

set.seed(4028)
hotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

Problems with numeric predictors

Potential issues with numeric predictors

Skewness
Outliers
Correlation

Mitigating issues with numeric predictors

Choose a robust model
1 : 1 transformations
1 : many transformations
Many : many transformations (in a couple of weeks)

1:1 transformations

Scale
Center

Skewed variables

Outliers

Something (such as a geologic feature) that is situated away from or classed differently from a main or related body.¹

A statistical observation that is markedly different in value from the others of the sample.²

Extreme values are not inherently “wrong”. Removing them can also skew our analysis by excluding relevant data points.

But we also do not want to give abnormal weight to these outliers.

What do we do with outliers?

Remove them
Truncate them
Impute them
Transform them

Transformation functions

Box-Cox procedure

\[ x^* = \begin{cases} \lambda^{-1}(x^\lambda-1) & \text{if $\lambda \ne 0$,} \\[3pt] \log(x) &\text{if $\lambda = 0$.} \end{cases} \]

Box-Cox procedure

Yeo-Johnson

Box-Cox transformations only work on positive values
Yeo-Johnson extends this to work with all numeric values

\[ x^* = \begin{cases} \lambda^{-1}\left[(x + 1)^\lambda-1\right] & \text{if $\lambda \ne 0$ and $x \ge 0$,} \\[3pt] \log(x + 1) &\text{if $\lambda = 0$ and $x \ge 0$.} \\[3pt] -(2 - \lambda)^{-1}\left[(-x + 1)^{2 - \lambda}-1\right] & \text{if $\lambda \ne 2$ and $x < 0$,} \\[3pt] -\log(-x + 1) &\text{if $\lambda = 2$ and $x < 0$.} \end{cases} \]

How is $\lambda$ determined?

$\lambda$ is a parameter to be learned from the data

recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_YeoJohnson(lead_time) |>
  prep() |>
  tidy(number = 1)

# A tibble: 1 × 3
  terms     value id              
  <chr>     <dbl> <chr>           
1 lead_time 0.173 YeoJohnson_YKdYz

Training data used to estimate model parameters
Training data used to estimate transformation parameters

Resampling Strategy

We’ll use simple 10-fold cross-validation (stratified sampling):

set.seed(472)
hotel_folds <- vfold_cv(hotel_train, strata = avg_price_per_room)
hotel_folds

#  10-fold cross-validation using stratification 
# A tibble: 10 × 2
   splits             id    
   <list>             <chr> 
 1 <split [3372/377]> Fold01
 2 <split [3373/376]> Fold02
 3 <split [3373/376]> Fold03
 4 <split [3373/376]> Fold04
 5 <split [3373/376]> Fold05
 6 <split [3374/375]> Fold06
 7 <split [3375/374]> Fold07
 8 <split [3376/373]> Fold08
 9 <split [3376/373]> Fold09
10 <split [3376/373]> Fold10

reg_metrics <- metric_set(mae, rmse, rsq)

⏱️ Your turn

Examine hotel_train and identify a numeric predictor that is skewed. Incorporate an appropriate transformation into the recipe below and estimate a linear regression model using 10-fold cross-validation. How does the model perform with and without the transformation?

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

08:00

hotel_yj_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_YeoJohnson(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

# check the recipe
hotel_yj_rec |>
  prep() |>
  tidy(number = 1)

# A tibble: 7 × 3
  terms                        value id              
  <chr>                        <dbl> <chr>           
1 lead_time                  0.173   YeoJohnson_f7mCz
2 stays_in_weekend_nights    0.00547 YeoJohnson_f7mCz
3 stays_in_week_nights       0.121   YeoJohnson_f7mCz
4 booking_changes           -4.16    YeoJohnson_f7mCz
5 total_of_special_requests -0.672   YeoJohnson_f7mCz
6 arrival_date_num          -3.21    YeoJohnson_f7mCz
7 historical_adr            -0.591   YeoJohnson_f7mCz

set.seed(9)

hotel_lm_wflow <- workflow() |>
  add_recipe(hotel_rec) |>
  add_model(linear_reg())

hotel_lm_res <- hotel_lm_wflow |>
  fit_resamples(hotel_folds,
    metrics = reg_metrics,
    control = control_resamples(save_pred = TRUE)
  )

collect_metrics(hotel_lm_res)

# A tibble: 3 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 mae     standard   17.1      10 0.176   Preprocessor1_Model1
2 rmse    standard   22.8      10 0.296   Preprocessor1_Model1
3 rsq     standard    0.881    10 0.00359 Preprocessor1_Model1

set.seed(9)

hotel_lm_skew_wflow <- workflow() |>
  add_recipe(hotel_yj_rec) |>
  add_model(linear_reg())

hotel_lm_skew_res <- hotel_lm_skew_wflow |>
  fit_resamples(hotel_folds,
    metrics = reg_metrics,
    control = control_resamples(save_pred = TRUE)
  )

collect_metrics(hotel_lm_skew_res)

# A tibble: 3 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 mae     standard   18.9      10 0.261   Preprocessor1_Model1
2 rmse    standard   25.3      10 0.374   Preprocessor1_Model1
3 rsq     standard    0.853    10 0.00384 Preprocessor1_Model1

Standardizing to a common scale

Some models require predictors to be on a common scale (data preprocessing)

$k$ nearest neighbors
Support vector machines
Penalized regression

Standardizing to a common scale

Standardizing to a common scale

hotel_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

1:many transformations

Basis expansion

Features of a predictor $x$ derived from a set of functions $f_1(x), f_2(x), \ldots, f_i(x)$ that can be combined using a linear combination.

Polynomial expansion

Simple linear regression

\[ y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i \]

Cubic polynomial expansion

\[ y_i = \beta_0 + \beta_1 x_{i} + \beta_2 x_{i}^2 + \beta_3 x_{i}^3 + \epsilon_i \]

Polynomial expansion

#| label: fig-global-polynomial
#| viewerHeight: 700
#| viewerWidth: "100%"
#| standalone: true
library(shiny)
library(patchwork)
library(dplyr)
library(tidyr)
library(ggplot2)
library(splines2)
library(bslib)
library(viridis)
library(aspline)

data(fossil)

ui <- page_fillable(
  theme = bs_theme(bg = "#fcfefe", fg = "#595959",
                   base_font = "Atkinson Hyperlegible", font_scale = 1.5),
  padding = "1rem",
  layout_columns(
    fill = FALSE,
    col_widths = breakpoints(sm = c(-3, 6, -3)),
    sliderInput(
      "global_deg",
      label = "Polynomial Degree",
      min = 1L, max = 21L, step = 1L, value = 3L,
      ticks = TRUE
    ) # sliderInput
  ), # layout_columns
  layout_columns(
    fill = FALSE,
    col_widths = breakpoints(sm = c(-1, 10, -1)),
    as_fill_carrier(plotOutput('global', height = "500px"))
  )
)


server <- function(input, output, session) {
  
  light_bg <- "#fcfefe" # from aml4td.scss
  grid_theme <- bs_theme(
    bg = "#fcfefe", fg = "#595959"
  )
  
  # ------------------------------------------------------------------------------
  
  theme_light_bl<- function(...) {
    
    ret <- ggplot2::theme_bw(...)
    
    col_rect <- ggplot2::element_rect(fill = light_bg, colour = light_bg)
    ret$panel.background  <- col_rect
    ret$plot.background   <- col_rect
    ret$legend.background <- col_rect
    ret$legend.key        <- col_rect
    
    larger_x_text <- ggplot2::element_text(size = rel(1.25))
    larger_y_text <- ggplot2::element_text(size = rel(1.25), angle = 90)
    ret$axis.text.x <- larger_x_text
    ret$axis.text.y <- larger_y_text
    ret$axis.title.x <- larger_x_text
    ret$axis.title.y <- larger_y_text  
    
    ret$legend.position <- "top"
    
    ret
  }
  
  col_rect <- ggplot2::element_rect(fill = light_bg, colour = light_bg)
  
  maybe_lm <- function(x) {
    try(lm(y ~ poly(x, input$piecewise_deg), data = x), silent = TRUE)
  }
  
  expansion_to_tibble <- function(x, original, prefix = "term ") {
    cls <- class(x)[1]
    nms <- recipes::names0(ncol(x), prefix)
    colnames(x) <- nms
    x <- as_tibble(x)
    x$variable <- original
    res <- tidyr::pivot_longer(x, cols = c(-variable))
    if (cls != "poly") {
      res <- res[res$value > .Machine$double.eps,]
    }
    res
  } 
  
  mult_poly <- function(dat, degree = 4) {
    rng <- extendrange(dat$x, f = .025)
    grid <- seq(rng[1], rng[2], length.out = 1000)
    grid_df <- tibble(x = grid)
    feat <- poly(grid_df$x, degree)
    res <- expansion_to_tibble(feat, grid_df$x)
    
    # make some random names so that we can plot the features with distinct colors
    rand_names <- lapply(1:degree, function(x) paste0(sample(letters)[1:10], collapse = ""))
    rand_names<- unlist(rand_names)
    rand_names <- tibble(name = unique(res$name), name2 = rand_names)
    res <- 
      dplyr::inner_join(res, rand_names, by = dplyr::join_by(name)) |> 
      dplyr::select(-name) |> 
      dplyr::rename(name = name2)
    res
  }
  
  # ------------------------------------------------------------------------------
  
  spline_example <- tibble(x = fossil$age, y = fossil$strontium.ratio)
  rng <- extendrange(fossil$age, f = .025)
  grid <- seq(rng[1], rng[2], length.out = 1000)
  grid_df <- tibble(x = grid)
  alphas <- 1 / 4
  line_wd <- 1.0
  
  base_p <-
    spline_example |>
    ggplot(aes(x = x, y = y)) +
    geom_point(alpha = 3 / 4, pch = 1, cex = 3) +
    labs(x = "Age", y = "Isotope Ratio") +
    lims(x = rng) +
    theme_light_bl(base_size = 16)
  
  output$global <- renderPlot({
    
    poly_fit <- lm(y ~ poly(x, input$global_deg), data = spline_example)
    poly_pred <- 
      predict(poly_fit, grid_df, interval = "confidence", level = .90) |> 
      bind_cols(grid_df)
    
    global_p <- base_p
    
    if (input$global_deg > 0) {
      global_p <-
        global_p +
        geom_ribbon(
          data = poly_pred,
          aes(y = NULL, ymin = lwr, ymax = upr),
          alpha = 1 / 8) +
        geom_line(
          data = poly_pred,
          aes(y = fit),
          col = "black",
          linewidth = line_wd) +
        theme(
          plot.margin = margin(t = -20, r = 0, b = 0, l = 0),
          panel.background = col_rect,
          plot.background = col_rect,
          legend.background = col_rect,
          legend.key = col_rect
        )
      
      feature_p <-
        poly(grid_df$x,  input$global_deg) |> 
        expansion_to_tibble(grid_df$x) |> 
        ggplot(aes(variable, y = value, group = name, col = name)) +
        geom_line(show.legend = FALSE) + # , linewidth = 1, alpha = 1 / 2
        theme_void() +
        theme(
          plot.margin = margin(t = 0, r = 0, b = -20, l = 0),
          panel.background = col_rect,
          plot.background = col_rect,
          legend.background = col_rect,
          legend.key = col_rect
        ) +
        scale_color_viridis(discrete = TRUE, option = "turbo")
      
      p <- (feature_p / global_p) + plot_layout(heights = c(1.5, 4))
    }
    
    print(p)
    
  })
  
}

app <- shinyApp(ui = ui, server = server)

Issues with polynomial expansion

Variance is high as number of polynomial terms increases
Confidence intervals expand rapidly
Overfitting
Global pattern not always appropriate

Split polynomial fits to different regions

#| label: fig-piecewise-polynomials
#| viewerHeight: 700
#| viewerWidth: "100%"
#| standalone: true

library(shiny)
library(patchwork)
library(dplyr)
library(tidyr)
library(ggplot2)
library(splines2)
library(bslib)
library(viridis)
library(aspline)

data(fossil)

ui <- page_fillable(
  theme = bs_theme(bg = "#fcfefe", fg = "#595959",
                   base_font = "Atkinson Hyperlegible", font_scale = 1.5),
  padding = "1rem",
  layout_columns(
    fill = FALSE,
    col_widths = breakpoints(sm = c(-1, 5, 5, -1)),
    sliderInput(
      "piecewise_deg",
      label = "Polynomial Degree",
      min = 0L, max = 6L, step = 1L, value = 4L
    ), # sliderInput
    sliderInput(
      "cuts",
      label = "Cutpoints",
      min = 93L, max = 122L, step = 1, value = c(101, 118)
    ) # sliderInput
  ), # layout_columns
  layout_columns(
    fill = FALSE,
    col_widths = breakpoints(sm = c(-1, 10, -1)),
    as_fill_carrier(plotOutput('pieces', height = "500px"))
  )      
)

server <- function(input, output, session) {
  
  light_bg <- "#fcfefe" # from aml4td.scss
  grid_theme <- bs_theme(
    bg = "#fcfefe", fg = "#595959"
  )
  
  # ------------------------------------------------------------------------------
  
  theme_light_bl<- function(...) {
    
    ret <- ggplot2::theme_bw(...)
    
    col_rect <- ggplot2::element_rect(fill = light_bg, colour = light_bg)
    ret$panel.background  <- col_rect
    ret$plot.background   <- col_rect
    ret$legend.background <- col_rect
    ret$legend.key        <- col_rect
    
    larger_x_text <- ggplot2::element_text(size = rel(1.25))
    larger_y_text <- ggplot2::element_text(size = rel(1.25), angle = 90)
    ret$axis.text.x <- larger_x_text
    ret$axis.text.y <- larger_y_text
    ret$axis.title.x <- larger_x_text
    ret$axis.title.y <- larger_y_text  
    
    ret$legend.position <- "top"
    
    ret
  }
  
  col_rect <- ggplot2::element_rect(fill = light_bg, colour = light_bg)
  
  maybe_lm <- function(x) {
    try(lm(y ~ poly(x, input$piecewise_deg), data = x), silent = TRUE)
  }
  
  expansion_to_tibble <- function(x, original, prefix = "term ") {
    cls <- class(x)[1]
    nms <- recipes::names0(ncol(x), prefix)
    colnames(x) <- nms
    x <- as_tibble(x)
    x$variable <- original
    res <- tidyr::pivot_longer(x, cols = c(-variable))
    if (cls != "poly") {
      res <- res[res$value > .Machine$double.eps,]
    }
    res
  } 
  
  mult_poly <- function(dat, degree = 4) {
    rng <- extendrange(dat$x, f = .025)
    grid <- seq(rng[1], rng[2], length.out = 1000)
    grid_df <- tibble(x = grid)
    feat <- poly(grid_df$x, degree)
    res <- expansion_to_tibble(feat, grid_df$x)
    
    # make some random names so that we can plot the features with distinct colors
    rand_names <- lapply(1:degree, function(x) paste0(sample(letters)[1:10], collapse = ""))
    rand_names<- unlist(rand_names)
    rand_names <- tibble(name = unique(res$name), name2 = rand_names)
    res <- 
      dplyr::inner_join(res, rand_names, by = dplyr::join_by(name)) |> 
      dplyr::select(-name) |> 
      dplyr::rename(name = name2)
    res
  }
  
  # ------------------------------------------------------------------------------
  
  spline_example <- tibble(x = fossil$age, y = fossil$strontium.ratio)
  rng <- extendrange(fossil$age, f = .025)
  grid <- seq(rng[1], rng[2], length.out = 1000)
  grid_df <- tibble(x = grid)
  alphas <- 1 / 4
  line_wd <- 1.0
  
  base_p <-
    spline_example |>
    ggplot(aes(x = x, y = y)) +
    geom_point(alpha = 3 / 4, pch = 1, cex = 3) +
    labs(x = "Age", y = "Isotope Ratio") +
    theme_light_bl(base_size = 16)
  
  output$pieces <- renderPlot({
    
    cuts <- c(0, sort(input$cuts), 60)
    piece_cols <- c("#1B9E77", "#D95F02", "#7570B3")
    piece_p <- base_p
    
    if (input$piecewise_deg > 0) {
      data_splt <-
        spline_example |>
        dplyr::mutate(x_cut = cut(x, breaks = cuts, include.lowest = TRUE)) |>
        tidyr::nest(.by = x_cut) |>
        mutate(
          fit = lapply(data, maybe_lm),
          features = lapply(data, mult_poly, degree = input$piecewise_deg)
        )
      grid_splt <-
        dplyr::tibble(x = grid) |>
        dplyr::mutate(x_cut = cut(x, breaks = cuts, include.lowest = TRUE))  |>
        tidyr::nest(.by = x_cut)
      
      for (i in 1:3) {
        sub_pred <- grid_splt$data[[i]]
        if (!inherits(data_splt$fit[[i]], "try-error")) {
          sub_pred <-
            sub_pred |>
            dplyr::bind_cols(predict(data_splt$fit[[i]], sub_pred, 
                                     interval = "confidence", level = .90))
          
          piece_p <-
            piece_p +
            geom_ribbon(
              data = sub_pred,
              aes(y = NULL, ymin = lwr, ymax = upr),
              alpha = 1 / 15
            ) +
            geom_line(
              data = sub_pred,
              aes(y = fit),
              linewidth = line_wd
            )
        }
      }
      
      set.seed(383) # to control colors
      feature_p <- 
        data_splt |> 
        dplyr::select(features) |> 
        tidyr::unnest(features) |> 
        ggplot(aes(x = variable, y = value, col = name)) + 
        geom_line(show.legend = FALSE) +
        theme_void() +
        theme(
          plot.margin = margin(t = 0, r = 0, b = -20, l = 0),
          panel.background = col_rect,
          plot.background = col_rect,
          legend.background = col_rect,
          legend.key = col_rect
        ) +
        scale_color_viridis(discrete = TRUE, option = "turbo")
      
      p <- (feature_p / piece_p) + plot_layout(heights = c(1, 4))
      
    }
    
    print(p)
    
  })
}

app <- shinyApp(ui = ui, server = server)

Polynomial splines

Basis expansion creates different regions of the predictor space using boundaries called knots
Within each region, use polynomial functions
Ensure globally we have a smooth, continuous function

Simple spline functions

#| label: fig-simple-spline
#| viewerHeight: 700
#| viewerWidth: "100%"
#| standalone: true
library(shiny)
library(patchwork)
library(dplyr)
library(tidyr)
library(ggplot2)
library(bslib)
library(aspline)

data(fossil)

ui <- page_fillable(
  theme = bs_theme(bg = "#fcfefe", fg = "#595959",
                   base_font = "Atkinson Hyperlegible", font_scale = 1.5),
  padding = "1rem",
  layout_columns(
    fill = FALSE,
    col_widths = breakpoints(sm = c(-2, 8, -2)),
    sliderInput(
      "cuts",
      label = "Cutpoints",
      min = 93L, max = 122L, step = 1, value = c(107, 114)
    ) # sliderInput
  ), # layout_columns
  layout_columns(
    fill = FALSE,
    col_widths = breakpoints(sm = c(-1, 10, -1)),
    as_fill_carrier(plotOutput('splime_spline', height = "500px"))
  )      
)

server <- function(input, output, session) {
  
  light_bg <- "#fcfefe" # from aml4td.scss
  grid_theme <- bs_theme(
    bg = "#fcfefe", fg = "#595959"
  )
  
  # ------------------------------------------------------------------------------
  
  theme_light_bl<- function(...) {
    
    ret <- ggplot2::theme_bw(...)
    
    col_rect <- ggplot2::element_rect(fill = light_bg, colour = light_bg)
    ret$panel.background  <- col_rect
    ret$plot.background   <- col_rect
    ret$legend.background <- col_rect
    ret$legend.key        <- col_rect
    
    larger_x_text <- ggplot2::element_text(size = rel(1.25))
    larger_y_text <- ggplot2::element_text(size = rel(1.25), angle = 90)
    ret$axis.text.x <- larger_x_text
    ret$axis.text.y <- larger_y_text
    ret$axis.title.x <- larger_x_text
    ret$axis.title.y <- larger_y_text  
    
    ret$legend.position <- "top"
    
    ret
  }
  
  col_rect <- ggplot2::element_rect(fill = light_bg, colour = light_bg)
  
  # ------------------------------------------------------------------------------
  
  spline_example <- tibble(x = fossil$age, y = fossil$strontium.ratio)
  rng <- extendrange(fossil$age, f = .025)
  grid <- seq(rng[1], rng[2], length.out = 1000)
  grid_df <- tibble(x = grid)
  alphas <- 1 / 4
  line_wd <- 1.0
  
  base_p <-
    spline_example |>
    ggplot(aes(x = x, y = y)) +
    geom_point(alpha = 3 / 4, pch = 1, cex = 3) +
    labs(x = "Age", y = "Isotope Ratio") +
    theme_light_bl(base_size = 16)
  
  output$splime_spline <- renderPlot({
    
    spline_p <- base_p
    
    h <- function(x) {
      ifelse(x > 0, x, 0)
    }
    
    mod_dat <- 
      spline_example |> 
      mutate(
        x_2 = x^2, 
        x_3 = x^3,
        x_k_1 = pmax(h(x-min(input$cuts))^3, 0),
        x_k_2 = pmax(h(x-max(input$cuts))^3, 0)
      )
    
    grid_spln <- 
      grid_df |> 
      mutate(
        x_2 = x^2, 
        x_3 = x^3,
        x_k_1 = pmax(h(x-min(input$cuts))^3, 0),
        x_k_2 = pmax(h(x-max(input$cuts))^3, 0)
      )
    
    features <- 
      rbind(
        tibble::tibble(x = grid_spln$x, value = grid_spln$x_k_1, term = "4") |> 
          filter(value != 0),
        tibble::tibble(x = grid_spln$x, value = grid_spln$x_k_2, term = "5") |> 
          filter(value != 0)
      )
    
    fit_1 <- lm(y ~ ., data = mod_dat)
    spline_pred <- 
      predict(fit_1, grid_spln, interval = "confidence", level = .90) |> 
      bind_cols(grid_df)
    
    spline_p <-
      spline_p +
      geom_ribbon(
        data = spline_pred,
        aes(y = NULL, ymin = lwr, ymax = upr),
        alpha = 1 / 15
      ) +
      geom_line(
        data = spline_pred,
        aes(y = fit),
        linewidth = line_wd
      ) +
      geom_vline(xintercept = min(input$cuts), col = "#A6CEE3", lty = 2, linewidth = 1) +
      geom_vline(xintercept = max(input$cuts), col = "#1F78B4", lty = 2, linewidth = 1)
    
    term_p <- 
      features |> 
      ggplot(aes(x, value, col = term)) +
      geom_line(show.legend = FALSE, linewidth = 1) +
      lims(x = rng) +
      theme_void() +
      theme(
        plot.margin = margin(t = 0, r = 0, b = -20, l = 0),
        panel.background = col_rect,
        plot.background = col_rect,
        legend.background = col_rect,
        legend.key = col_rect
      ) +
      scale_color_brewer(palette = "Paired")
    
    p <- (term_p / spline_p) + plot_layout(heights = c(1, 4))
    
    print(p)
    
  })
}

app <- shinyApp(ui = ui, server = server)

How to define the knots

Knot locations

$n$ groups with equal range
$n$ groups with equal number of observations

How many knots?

Tuning parameter

Types of spline functions

Orthogonal polynomial basis functions: step_poly()
Basis splines: step_spline_b()
B-spline basis functions: step_bs()
Natural splines: step_spline_natural()
Natural spline basis functions: step_ns()

Common tuning parameters

Number of knots
Degrees of freedom - number of individual basis functions

Natural cubic spline

#| label: fig-natural-cubic-spline
#| viewerHeight: 700
#| viewerWidth: "100%"
#| standalone: true

library(shiny)
library(patchwork)
library(dplyr)
library(tidyr)
library(ggplot2)
library(splines2)
library(bslib)
library(viridis)
library(aspline)

data(fossil)

ui <- page_fillable(
  theme = bs_theme(bg = "#fcfefe", fg = "#595959",
                   base_font = "Atkinson Hyperlegible", font_scale = 1.5),
  padding = "1rem",
  layout_columns(
    fill = FALSE,
    col_widths = breakpoints(sm = c(-3, 6, -3)),
    sliderInput(
      "spline_df",
      label = "# Spline Terms",
      min = 3L, max = 20L, step = 1L, value = 9L
    ), # sliderInput
  ), # layout_columns
  layout_columns(
    fill = FALSE,
    col_widths = breakpoints(sm = c(-1, 10, -1)),
    as_fill_carrier(plotOutput('spline', height = "500px"))
  )         
)

server <- function(input, output, session) {
  
  light_bg <- "#fcfefe" # from aml4td.scss
  grid_theme <- bs_theme(
    bg = "#fcfefe", fg = "#595959"
  )
  
  # ------------------------------------------------------------------------------
  
  theme_light_bl<- function(...) {
    
    ret <- ggplot2::theme_bw(...)
    
    col_rect <- ggplot2::element_rect(fill = light_bg, colour = light_bg)
    ret$panel.background  <- col_rect
    ret$plot.background   <- col_rect
    ret$legend.background <- col_rect
    ret$legend.key        <- col_rect
    
    larger_x_text <- ggplot2::element_text(size = rel(1.25))
    larger_y_text <- ggplot2::element_text(size = rel(1.25), angle = 90)
    ret$axis.text.x <- larger_x_text
    ret$axis.text.y <- larger_y_text
    ret$axis.title.x <- larger_x_text
    ret$axis.title.y <- larger_y_text  
    
    ret$legend.position <- "top"
    
    ret
  }
  
  col_rect <- ggplot2::element_rect(fill = light_bg, colour = light_bg)
  
  maybe_lm <- function(x) {
    try(lm(y ~ poly(x, input$piecewise_deg), data = x), silent = TRUE)
  }
  
  expansion_to_tibble <- function(x, original, prefix = "term ") {
    cls <- class(x)[1]
    nms <- recipes::names0(ncol(x), prefix)
    colnames(x) <- nms
    x <- as_tibble(x)
    x$variable <- original
    res <- tidyr::pivot_longer(x, cols = c(-variable))
    if (cls != "poly") {
      res <- res[res$value > .Machine$double.eps,]
    }
    res
  } 
  
  mult_poly <- function(dat, degree = 4) {
    rng <- extendrange(dat$x, f = .025)
    grid <- seq(rng[1], rng[2], length.out = 1000)
    grid_df <- tibble(x = grid)
    feat <- poly(grid_df$x, degree)
    res <- expansion_to_tibble(feat, grid_df$x)
    
    # make some random names so that we can plot the features with distinct colors
    rand_names <- lapply(1:degree, function(x) paste0(sample(letters)[1:10], collapse = ""))
    rand_names<- unlist(rand_names)
    rand_names <- tibble(name = unique(res$name), name2 = rand_names)
    res <- 
      dplyr::inner_join(res, rand_names, by = dplyr::join_by(name)) |> 
      dplyr::select(-name) |> 
      dplyr::rename(name = name2)
    res
  }
  
  # ------------------------------------------------------------------------------
  
  spline_example <- tibble(x = fossil$age, y = fossil$strontium.ratio)
  rng <- extendrange(fossil$age, f = .025)
  grid <- seq(rng[1], rng[2], length.out = 1000)
  grid_df <- tibble(x = grid)
  alphas <- 1 / 4
  line_wd <- 1.0
  
  base_p <-
    spline_example |>
    ggplot(aes(x = x, y = y)) +
    geom_point(alpha = 3 / 4, pch = 1, cex = 3) +
    labs(x = "Age", y = "Isotope Ratio") +
    theme_light_bl(base_size = 16)
  
  output$spline <- renderPlot({
    
    spline_fit <- lm(y ~ naturalSpline(x, df = input$spline_df), data = spline_example)
    spline_pred <- 
      predict(spline_fit, grid_df, interval = "confidence", level = .90) |> 
      bind_cols(grid_df)
    
    spline_p <- base_p +
      geom_ribbon(
        data = spline_pred,
        aes(y = NULL, ymin = lwr, ymax = upr),
        alpha = 1 / 15) +
      geom_line(
        data = spline_pred,
        aes(y = fit),
        col = "black",
        linewidth = line_wd)
    
    feature_p <-
      naturalSpline(grid_df$x, df = input$spline_df) |> 
      expansion_to_tibble(grid_df$x) |> 
      ggplot(aes(variable, y = value, group = name, col = name)) +
      geom_line(show.legend = FALSE) + 
      lims(x = rng) +
      theme_void() +
      theme(
        plot.margin = margin(t = 0, r = 0, b = -20, l = 0),
        panel.background = col_rect,
        plot.background = col_rect,
        legend.background = col_rect,
        legend.key = col_rect
      ) +
      scale_color_viridis(discrete = TRUE, option = "turbo")
    
    p <- (feature_p / spline_p) + plot_layout(heights = c(1, 4))
    
    print(p)
    
  })
  
}

app <- shinyApp(ui = ui, server = server)

⏱️ Your turn

Implement a natural spline for lead_time and historical_adr. Use grid tuning to determine the optimal value for deg_free. Evaluate the model’s performance.

08:00

ns_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_ns(lead_time, historical_adr, deg_free = tune()) |>
  step_novel(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors())

ns_wf <- workflow() |>
  add_recipe(ns_rec) |>
  add_model(linear_reg())

ns_res <- ns_wf |>
  tune_grid(
    hotel_folds,
    grid = 10,
    metrics = reg_metrics
  )

Locally weighted smoothing (LOESS)

#| label: fig-loess
#| viewerHeight: 700
#| viewerWidth: "100%"
#| standalone: true

library(shiny)
library(ggplot2)
library(bslib)

assess_color <- "#006699"

# Generate sample data
set.seed(123)
n <- 100
x <- seq(0, 10, length.out = n)
y <- sin(x) + rnorm(n, sd = 0.3)
data <- data.frame(x = x, y = y)

ui <- page_fillable(
  theme = bs_theme(bg = "#fcfefe", fg = "#595959",
                   base_font = "Atkinson Hyperlegible", font_scale = 1.5),
  padding = "1rem",
  title = NULL,
  layout_columns(
    fill = FALSE,
    col_widths = breakpoints(sm = c(-3, 6, -3)),
    sliderInput("span", "Span",
                min = 0.1, max = 1, value = 0.3, step = 0.05)
  ),
  layout_columns(
    plotOutput("loess_plot")
  )
)

server <- function(input, output, session) {
  light_bg <- "#fcfefe" # from aml4td.scss
  grid_theme <- bs_theme(
    bg = "#fcfefe", fg = "#595959"
  )
  
  # ------------------------------------------------------------------------------
  
  theme_light_bl<- function(...) {
    
    ret <- ggplot2::theme_bw(...)
    
    col_rect <- ggplot2::element_rect(fill = light_bg, colour = light_bg)
    ret$panel.background  <- col_rect
    ret$plot.background   <- col_rect
    ret$legend.background <- col_rect
    ret$legend.key        <- col_rect
    
    larger_x_text <- ggplot2::element_text(size = rel(1.25))
    larger_y_text <- ggplot2::element_text(size = rel(1.25), angle = 90)
    ret$axis.text.x <- larger_x_text
    ret$axis.text.y <- larger_y_text
    ret$axis.title.x <- larger_x_text
    ret$axis.title.y <- larger_y_text  
    
    ret$legend.position <- "top"
    
    ret
  }
  
  output$loess_plot <- renderPlot({
    ggplot(data, aes(x, y)) +
      geom_point(alpha = 0.5) +
      geom_smooth(method = "loess", 
                  formula = y ~ x,
                  se = FALSE,
                  span = input$span,
                  color = assess_color) +
      labs(subtitle = paste("Span:", input$span),
           x = "X", y = "Y") +
      theme_light_bl(base_size = 16)
  })
}

shinyApp(ui, server)

Multivariate adaptive regression spline (MARS)

Simple spline functions

MARS

Combines
- Stepwise regression
- Polynomial regression
- Step functions
Good for high-dimensional data
Efficient estimation with cross-validation

Piecewise basis functions

\[(x - t)_+ = \begin{cases} x - t, & \text{if } x > t, \\ 0, & \text{otherwise} \end{cases}\]

and

\[(t - x)_+ = \begin{cases} t - x, & \text{if } x < t, \\ 0, & \text{otherwise} \end{cases}\]

MARS

Piecewise linear basis functions
Reflected pairs
Form pairs for inputs $X_j$ with knots at $x_{ij}$ based on selection criteria
Typically overfit to some maximum number of terms, then prune model back

MARS model

`mars()`

mars_spec <- mars(mode = "regression", num_terms = tune(), prod_degree = tune())

num_terms - number of model terms (e.g. how many hinge functions)
prod_degree - degree of interaction terms

MARS

Advantages

Accurate if local linear relationships are correct
Quick computation
Automated feature selection
Intuitive non-linear relationships

Disadvantages

Not accurate if local linear relationships are incorrect
Not as accurate as more advanced non-linear algorithms
Difficult to implement more advanced spline features

⏱️ Your turn

Implement a MARS model. Use grid tuning to determine the optimal value for num_terms and prod_degree. Evaluate the model’s performance.

08:00

# recipe
mars_rec <- recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_novel(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors())

# model specification
mars_spec <- mars(mode = "regression", num_terms = tune(), prod_degree = tune())

# workflow
mars_wf <- workflow() |>
  add_recipe(mars_rec) |>
  add_model(mars_spec)

# space filling grid
set.seed(749)
mars_params <- extract_parameter_set_dials(mars_wf) |>
  grid_space_filling(size = 10)

# tune the model
mars_res <- mars_wf |>
  tune_grid(
    hotel_folds,
    grid = mars_params,
    metrics = reg_metrics
  )

Wrap-up

Recap

Variables can be rescaled to reduce skewness, sometimes improving model performance
Basis expansions and splines can be used to model non-linear relationships
MARS is a flexible method for modeling non-linear relationships
Always compare outcomes of different models to ensure transformations or nonlinear components improve model performance

My sunflower

Dr Soltoff standing next to a homegrown sunflower. Notice it's height, it's vivid color. Very demure.

Feature engineering for numeric predictors and nonlinear relationships

Announcements

Announcements

Learning objectives

Application exercise

ae-05

From last time

Problems with numeric predictors

Potential issues with numeric predictors

Mitigating issues with numeric predictors

1:1 transformations

1:1 transformations

Skewed variables

Outliers

What do we do with outliers?

Transformation functions

Transformation functions

Box-Cox procedure

Box-Cox procedure

Yeo-Johnson

How is \(\lambda\) determined?

Resampling Strategy

⏱️ Your turn

Standardizing to a common scale

Standardizing to a common scale

Standardizing to a common scale

1:many transformations

Basis expansion

Polynomial expansion

Polynomial expansion

Issues with polynomial expansion

Split polynomial fits to different regions

Polynomial splines

Simple spline functions

How to define the knots

Knot locations

How many knots?

Types of spline functions

Natural cubic spline

⏱️ Your turn

Locally weighted smoothing (LOESS)

Multivariate adaptive regression spline (MARS)

Simple spline functions

MARS

Piecewise basis functions

MARS

MARS model

mars()

MARS

Advantages

Disadvantages

⏱️ Your turn

Wrap-up

Recap

My sunflower

`ae-05`

`mars()`