Author Archives: Patrick

TidyX Episode 113: S4 objects introduction

This week, Ellis Hughes and move from our discussion on s3 objects to s4 objects. Here, we talk about how to set up s4 objects and discuss the reasons for using them, particularly when designing packages.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

Shiny – User Defined Chart Parameters

A colleague was working on a web app for his basketball team and asked me if there was a way to create a {shiny} web app that allowed the user to define which parameters they would like to see on the plot. I figured this would be something others might be interested in as well, so here we go!

Load Packages, Helper Functions & Data

I’ll use data from the Lahman baseball database (seasons 2017 – 2019). I’m also going to create two helper functions, one for calculating the z-scores for our stats of interest and one for calculating the t-value from the z-score. The t-value will put the z-score on a 0 to 100 scale for plotting purposes in our polar plot. Additionally, we will use these standardized scores to conditionally format colors on our {gt} table (but we will hide the standardized columns so that the user only sees the raw data and colors). Finally, I’m going to create both a wide and long format of the data as it will be easier to use one or the other, depending on the type of plot or table I am building.

#### Load packages ------------------------------------------------
library(tidyverse)
library(shiny)
library(Lahman)
library(gt)
library(plotly)

theme_set(theme_minimal() + 
            theme(
              axis.text = element_text(face = "bold", size = 12),
              legend.title = element_blank(),
              legend.position = "none"
            ) )

#### helper functions -------------------------------------------

z_score <- function(x){
  z = (x - mean(x, na.rm = T)) / sd(x, na.rm = T)
  return(z)
}

t_score <- function(x){ t = (x * 10) + 50 t = ifelse(t > 100, 100, 
             ifelse(t < 0, 0, t))
  return(t)
}


#### Get Data ---------------------------------------------------

dat <- Batting %>%
  filter(between(yearID, left = 2017, right = 2019),
         AB >= 200) %>% 
  group_by(yearID, playerID) %>%
  summarize(across(.cols = G:GIDP,
         ~sum(.x)),
         .groups = "drop") %>%
  mutate(ba = H / AB,
         obp = (H + BB + HBP) / (AB + HBP + SF),
         slg = ((H - X2B - X3B - HR) + X2B*2 + X3B*3 + HR*4) / AB,
         ops = obp + slg,
         hr_rate = H / AB) %>%
  select(playerID, yearID, AB, ba:hr_rate) %>%
  mutate(across(.cols = ba:hr_rate,
                list(z = z_score)),
         across(.cols = ba_z:hr_rate_z,
                list(t = t_score))) %>%
  left_join(People %>%
              mutate(name = paste(nameLast, nameFirst, sep = ", ")) %>%
              select(playerID, name)) %>%
  relocate(name, .before = yearID)


dat_long <- Batting %>%
  filter(between(yearID, left = 2017, right = 2019),
         AB >= 200) %>% 
  group_by(playerID) %>%
  summarize(across(.cols = G:GIDP,
                   ~sum(.x)),
            .groups = "drop") %>%
  mutate(ba = H / AB,
         obp = (H + BB + HBP) / (AB + HBP + SF),
         slg = ((H - X2B - X3B - HR) + X2B*2 + X3B*3 + HR*4) / AB,
         ops = obp + slg,
         hr_rate = H / AB) %>%
  select(playerID, AB, ba:hr_rate) %>%
  mutate(across(.cols = ba:hr_rate,
                list(z = z_score)),
         across(.cols = ba_z:hr_rate_z,
                list(t = t_score))) %>%
  left_join(People %>%
              mutate(name = paste(nameLast, nameFirst, sep = ", ")) %>%
              select(playerID, name)) %>%
  relocate(name, .before = AB) %>%
  select(playerID:AB, ends_with("z_t")) %>%
  pivot_longer(cols = -c(playerID, name, AB),
               names_to = "stat") %>%
  mutate(stat = case_when(stat == "ba_z_t" ~ "BA",
                          stat == "obp_z_t" ~ "OBP",
                          stat == "slg_z_t" ~ "SLG",
                          stat == "ops_z_t" ~ "OPS",
                          stat == "hr_rate_z_t" ~ "HR Rate"))


dat %>%
  head()

dat_long %>%
  head()

The Figures for our App

Before I build the {shiny} app, I wanted to first construct the three figures I will include. The code for these will be accessible in Github, but here is what they look like:

For the polar plot, I will allow the user to define which variables they want on the chart.
For the time series plot, I am going to create an interactive {plotly} chart that allows the user to select the stat they want to see and then hover over the player’s points and obtain information like the raw value and the number of at bats in the given season via a simple tool tip.
The table, as discussed above, will user conditional formatting to provide the user with extra context about how that player performed relative to his peers in a given season.

Because I don’t like to clutter up my {shiny} apps, I tend to build my plots and tables into custom functions. That way, all I need to do is set up a reactive() in the server to obtain the user selected data and then call the function on that data. Here are the functions for the three figures above.

## table function
tbl_func <- function(NAME){ dat %>%
  filter(name == NAME) %>%
  select(yearID, AB:hr_rate, ends_with("z_t")) %>%
  gt(rowname_col = "yearID") %>%
  fmt_number(columns = ba:hr_rate,
             decimals = 3) %>%
  cols_label(
    AB = md("**AB**"),
    ba = md("**Batting Avg**"),
    obp = md("**OBP**"),
    slg = md("**SLG**"),
    ops = md("**OPS**"),
    hr_rate = md("**Home Run Rate**")
  ) %>%
  tab_header(title = NAME) %>%
  opt_align_table_header(align = "left") %>%
  tab_options(column_labels.border.top.color = "transparent",
              column_labels.border.top.width = px(3),
              table.border.top.color = "transparent",
              table.border.bottom.color = "transparent") %>%
  cols_align(align = "center") %>%
  cols_hide(columns = ends_with("z_t")) %>%
    tab_style(
      style = cell_fill(color = "palegreen"),
      location = cells_body(
        columns = ba,
        rows = ba_z_t > 60
      )
    )  %>%
    tab_style(
      style = cell_fill(color = "red"),
      location = cells_body(
        columns = ba,
        rows = ba_z_t < 40 ) ) %>%
    tab_style(
      style = cell_fill(color = "palegreen"),
      location = cells_body(
        columns = obp,
        rows = obp_z_t > 60
      )
    )  %>%
    tab_style(
      style = cell_fill(color = "red"),
      location = cells_body(
        columns = obp,
        rows = obp_z_t < 40 ) ) %>%
    tab_style(
      style = cell_fill(color = "palegreen"),
      location = cells_body(
        columns = slg,
        rows = slg_z_t > 60
      )
    )  %>%
    tab_style(
      style = cell_fill(color = "red"),
      location = cells_body(
        columns = slg,
        rows = slg_z_t < 40 ) ) %>%
    tab_style(
      style = cell_fill(color = "palegreen"),
      location = cells_body(
        columns = ops,
        rows = ops_z_t > 60
      )
    )  %>%
    tab_style(
      style = cell_fill(color = "red"),
      location = cells_body(
        columns = ops,
        rows = ops_z_t < 40 ) ) %>%
    tab_style(
      style = cell_fill(color = "palegreen"),
      location = cells_body(
        columns = hr_rate,
        rows = hr_rate_z_t > 60
      )
    )  %>%
    tab_style(
      style = cell_fill(color = "red"),
      location = cells_body(
        columns = hr_rate,
        rows = hr_rate_z_t < 40
      )
    ) 
}


## Polar plot function
polar_plt <- function(NAME, STATS){ dat_long %>%
    filter(name == NAME,
           stat %in% STATS) %>%
    ggplot(aes(x = stat, y = value, fill = stat)) +
    geom_col(color = "white", width = 0.75) +
    coord_polar(theta = "x") +
    geom_hline(yintercept = seq(50, 50, by = 1), size = 1.2) +
    labs(x = "", y = "") +
    ylim(0, 100)
  
} 


## time series plot function
time_plt <- function(NAME, STAT){
  
  STAT <- case_when(STAT == "BA" ~ "ba",
                    STAT == "OBP" ~ "obp",
                    STAT == "SLG" ~ "slg",
                    STAT == "OPS" ~ "ops",
                    STAT == "HR Rate" ~ "hr_rate")
  
  stat_z <- paste0(STAT, "_z")
  
  p <- dat %>% 
    filter(name == NAME) %>%
    select(yearID, AB, STAT, stat_z) %>%
    setNames(., c("yearID", "AB", "STAT", "stat_z")) %>%
    ggplot(aes(x = as.factor(yearID), 
               y = stat_z,
               group = 1,
               label = NAME,
               label2 = AB,
               lable3 = STAT)) +
    geom_hline(yintercept = 0,
               size = 1.1,
               linetype = "dashed") +
    geom_line(size = 1.2) +
    geom_point(shape = 21,
               size = 6,
               color = "black",
               fill = "white") +
    ylim(-4, 4) 
  
  
  ggplotly(p)
  
}

Build the {shiny} app

The below code will construct the {shiny} app. We allow the user to select a player, select the stats of interest for the polar plot, and select the stat they’d like to track over time.

If you’d like to see a video of the app in use, CLICK HERE <shiny – user defined chart parameters>

If you want to run this yourself or build one similar to it you can access my code on GitHub.

#### Shiny App ---------------------------------------------------------------

## User Interface
ui <- fluidPage(
  
  titlePanel("MLB Hitters Shiny App\n2017-2019"),
  
  
  sidebarPanel(width = 3,
             selectInput("name",
                             label = "Choose a Player:",
                             choices = unique(dat$name),
                             selected = NULL,
                             multiple = FALSE),
              
              selectInput("stat",
                          label = "Choose stats for polar plot:",
                          choices = unique(dat_long$stat),
                          selected = NULL,
                          multiple = TRUE),
              
              selectInput("time_stat",
                          label = "Choose stat for time series:",
                          choices = unique(dat_long$stat),
                          selected = NULL,
                          multiple = FALSE)
  ),
  
  
  mainPanel(
    
    gt_output(outputId = "tbl"),
    
    fluidRow(
      
      column(6, plotOutput(outputId = "polar")),
      column(6, plotlyOutput(outputId = "time"))
    )
    
  )
)



server <- function(input, output){
  
  ## get player selected for table
  NAME <- reactive({ dat_long %>%
      filter(name == input$name) %>%
      distinct(name, .keep_all = FALSE) %>%
      pull(name)
    })
  
  ## get stats for polar plot
  polar_stats <- reactive({ dat_long %>%
      filter(stat %in% c(input$stat)) %>%
      pull(stat)
    
  })
  
  ## get stat for time series
  ts_stat <- reactive({ dat %>%
      select(ba:hr_rate) %>%
      setNames(., c("BA", "OBP", "SLG", "OPS", "HR Rate")) %>%
      select(input$time_stat) %>% 
      colnames()
    
  })
  
  ## table output
  output$tbl <- render_gt(
      tbl_func(NAME = NAME())
  )
  
  ## polar plot output
  output$polar <- renderPlot(
    
    polar_plt(NAME = NAME(),
              STAT = polar_stats())
    
  )
  
  ## time series plot output
  output$time <- renderPlotly(
    
    time_plt(NAME = NAME(),
             STAT = ts_stat())
    
  )

  
}





shinyApp(ui, server)

Regression to the Mean in Sports

Last week, scientist David Borg posted an article to twitter talking about regression to the mean in epidemiological research (1). Regression to the Mean is a statistical phenomenon where extreme observations tend to move closer towards the population mean in subsequent observations, due simply to natural variation. To steal Galton’s example, tall parents will often have tall children but those children, while taller than the average child, will tend to be shorter than their parents (regressed to the mean). It’s also one of the reasons why clinicians have such a difficult time understanding whether their intervention actually made the patient better or whether observed improvements are simply due to regression to the mean over the course of treatment (something that well designed studies attempt to rule out by using randomized controlled experiments).

Of course, this phenomenon is not unique to epidemiology or biostatistics. In fact, the phrase is commonly used in sport when discussing players or teams that have extremely high or low performances in a season and there is a general expectation that they will be more normal next year. An example of this could be the sophomore slump exhibited by rookies who perform at an extremely high level in their first season.

Given that this phenomenon is so common in our lives, the goal with this blog article is to show what regression to the mean looks like for team wins in baseball from one year to the next.

Data

We will use data from the Lahman baseball database (freely available in R) and concentrate on team wins in the 2015 and 2016 MLB seasons.

library(tidyverse)
library(Lahman)
library(ggalt)

theme_set(theme_bw())

data(Teams)

dat <- Teams %>%
  select(yearID, teamID, W) %>%
  arrange(teamID, yearID) %>%
  filter(yearID %in% c(2015, 2016)) %>%
  group_by(teamID) %>%
  mutate(yr_2 = lead(W)) %>%
  rename(yr_1 = W) %>%
  filter(!is.na(yr_2)) %>%
  ungroup() 

dat %>%
  head()

Exploratory Data Analysis

dat %>%
  ggplot(aes(x = yr_1, y = yr_2, label = teamID)) +
  geom_point() +
  ggrepel::geom_text_repel() +
  geom_abline(intercept = 0,
              slope = 1,
              color = "red",
              linetype = "dashed",
              size = 1.1) +
  labs(x = "2015 wins",
       y = "2016 wins")

The dashed line is the line of equality. A team that lies exactly on this line would be a team that had the exact number of wins in 2015 as they did in 2016. While no team lies exactly on this line, looking at the chart, what we can deduce is that teams below the red line had more wins in 2015 and less in 2016 while the opposite is true for those that lie above the line. Minnesota had a large decline in performance going from just over 80 wins in 2015 to below 60 wins in 2017.

The correlation between wins in 2015 to wins in 2016 is 0.54.

cor.test(dat$yr_1, dat$yr_2)

Plotting changes in wins from 2015 to 2016

We can plot each team and show their change in wins from 2015 (green) to 2016 (blue). We will break them into two groups, teams that saw a decrease in wins in 2016 relative to 2015 and teams that saw an increase in wins from 2015 to 2016. We will plot the z-score of team wins on the x-axis so that we can reflect average as being “0”.

## get the z-score of team wins for each season
dat <- dat %>%
  mutate(z_yr1 = (yr_1 - mean(yr_1)) / sd(yr_1),
         z_yr2 = (yr_2 - mean(yr_2)) / sd(yr_2))

dat %>%
  mutate(pred_z_dir = ifelse(yr_2 > yr_1, "increase wins", "decrease wins")) %>%
  ggplot(aes(x = z_yr1, xend = z_yr2, y = reorder(teamID, z_yr1))) +
  geom_vline(xintercept = 0,
             linetype = "dashed",
             size = 1.2,
             color = "grey") +
  geom_dumbbell(size = 1.2,
                colour_x = "green",
                colour_xend = "blue",
                size_x = 6,
                size_xend = 6) +
  facet_wrap(~pred_z_dir, scales = "free_y") +
  scale_color_manual(values = c("decrease wins" = "red", "increase wins" = "black")) +
  labs(x = "2015",
       y = "2016",
       title = "Green = 2015 Wins\nBlue = 2016 Wins",
       color = "Predicted Win Direction")

On the decreasing wins side, notice that all teams from SLN to LAA had more wins in 2015 (green) and then regressed towards the mean (or slightly below the mean) in 2016. From MIN down, those teams actually got even worse in 2016.

On the increasing wins side, from BOS down to PHI all of the teams regressed upward towards the mean (NOTE: regressing upward sounds weird, which is why some refer to this as reversion to the mean). From CHN to BAL, those teams were at or above average in 2015 and got better in 2016.

It makes sense that not all teams revert towards the mean in the second year given that teams attempt to upgrade their roster from one season to the next in order to maximize their chances of winning more games.

Regression to the with linear regression

There are three ways we can evaluate regression to the mean using linear regression and all three approaches will lead us to the same conclusion.

1. Linear regression with the raw data, predicting 2016 wins from 2015 wins.

2. Linear regression predicting 2016 wins from the grand mean centered 2015 wins (grand mean centered just means subtracting the league average wins, 81, from the observed wins for each team).

3. Linear regression using z-scores. This approach will produce results in standard deviation units rather than raw values.

## linear regression with raw values
fit <- lm(yr_2 ~ yr_1, data = dat)
summary(fit)

## linear regression with grand mean centered 2015 wins 
fit_grand_mean <- lm(yr_2 ~ I(yr_1 - mean(yr_1)), data = dat)
summary(fit_grand_mean)

## linear regression with z-score values
fit_z <- lm(z_yr2 ~ z_yr1, data = dat)
summary(fit_z)

Next, we take these equations and simply make predictions for 2016 win totals for each team.

dat$pred_fit <- fitted(fit)
dat$pred_fit_grand_mean <- fitted(fit_grand_mean)
dat$pred_fit_z <- fitted(fit_z) dat %>%
  head()

You’ll notice that the first two predictions are exactly the same. The third prediction is in standard deviation units. For example, Arizona (ARI) was -0.188 standard deviations below the mean in 2015 and in 2016 they were predicted to regress towards the mean and be -0.102 standard deviations below the mean. However, they actually went the other way and got even worse, finishing the season -1.12 standard deviations below the mean!

We can plot the residuals and see how off our projections where for each team’s 2016 win total.

par(mfrow = c(2,2))
hist(resid(fit), main = "Linear reg with raw values")
hist(resid(fit_grand_mean), main = "Linear reg with\ngrand mean centered values")
hist(resid(fit_z), main = "Linear reg with z-scored values")

The residuals look weird here because we are only dealing with two seasons of data. At the end of this blog article I will run the residual plot for all seasons from 2000 – 2019 to show how the residuals resemble more of a normal distribution.

We can see that there seems to be an extreme outlier that has a -20 win residual. Let’s pull that team out and see who it was.

dat %>%
  filter((yr_2 - pred_fit) < -19)

It was Minnesota, who went from an average season in 2015 (83 wins) and dropped all the way to 59 wins in 2016 (-2.05 standard deviations below the mean), something we couldn’t have really predicted.

The average absolute change from year1 to year2 for these teams was 8 wins.

mean(abs(dat$yr_2 - dat$yr_1))

Calculating Regression to the mean by hand

To explore the concept more, we can calculate regression toward the mean for the 2016 season by using the year-to-year correlation of team wins, the average wins for the league, and the z-score for each teams wins in 2015. The equation looks like this:

yr2.wins = avg.league.wins + sd_wins * predicted_yr2_z

Where predicted_yr2_z is calculated as:

predicted_yr2_z = (yr_1z * year.to.year.correlation)

We calculated the correlation coefficient above but let’s now store it as its own variable. Additionally we will store the league average team wins and standard deviation in 2015.

avg_wins <- mean(dat$yr_1)
sd_wins <- sd(dat$yr_1)
r <- cor.test(dat$yr_1, dat$yr_2)$estimate

avg_wins
sd_wins
r

On average teams won 81 games (which makes sense for a 162 season) with a standard deviation of about 10 games.

Let’s look at Pittsburgh (Pitt)

dat %>%
  filter(teamID == "PIT") %>%
  select(yearID:z_yr1) %>%
  mutate(predicted_yr2_z = z_yr1 * r,
         predicted_yr2_wins = avg_wins + sd_wins * predicted_yr2_z)

In 2015 Pittsburgh had 98 wins, a z-score of 1.63.
We predict them in 2016 to regress to the mean and have a z-score of 0.881 (90 wins)

Add these by hand regression to the mean predictions to all teams.

dat <- dat %>%
  mutate(predicted_yr2_z = z_yr1 * r,
         predicted_yr2_wins = avg_wins + sd_wins * predicted_yr2_z)

dat %>%
  head()

We see that our by hand calculation produces the same prediction, as it should.

Show the residuals for all seasons from 2000 – 2019

Here, we will filter out the data from the data base for the desired seasons, refit the model, and plot the residuals.

dat2 <- Teams %>%
  select(yearID, teamID, W) %>%
  arrange(teamID, yearID) %>%
  filter(between(x = yearID,
                 left = 2000,
                 right = 2019)) %>%
  group_by(teamID) %>%
  mutate(yr_2 = lead(W)) %>%
  rename(yr_1 = W) %>%
  filter(!is.na(yr_2)) %>%
  ungroup() 

## Correlation between year 1 and year 2
cor.test(dat2$yr_1, dat2$yr_2)


## linear model
fit2 <- lm(yr_2 ~ yr_1, data = dat2)
summary(fit2)

## residual plot
hist(resid(fit2),
     main = "Residuals\n(Seasons 2000 - 2019")

Now that we have more than a two seasons of data we see a normal distribution of the residuals. The correlation between year1 and year2 in this larger data set was 0.54, the same correlation we saw with the two seasons data.

With this larger data set, the average change in wins from year1 to year2 was 9 (not far from what we saw in the smaller data set above).

mean(abs(dat2$yr_2 - dat2$yr_1))

Wrapping Up

Regression to the mean is a common phenomenon in life. It can be difficult for practitioners in sports medicine and strength and conditioning to tease out the effects of regression to the mean when applying a specific training/rehab intervention. Often, regression to the mean fools us into believing that the intervention we did apply has some sort of causal relationship with the observed outcome. This phenomenon is also prevalent in sport when evaluating the performance of individual players and teams from one year to the next. With some simple calculations we can explore what regression to the mean could look like for data in our own setting, providing a compass and some base rates for us to evaluate observations going forward.

All of the code for this blog can be accessed on my GitHub page. If you notice any errors, please reach out!

References

1) Barnett, AG. van der Pols, JC. Dobson, AJ. (2005). Regression to the mean: what it is and how to deal with it. International Journal of Epidemiology; 34: 215-220.

2) Schall, T. Smith, G. Do baseball players regress toward the mean? The American Statistician; 54(4): 231-235.

TidyX Episode 112: s3 objects part 3 – building a sports tournament simulation

After a few weeks break, Ellis Hughes and I are back with our third and final installment of s3 objects (Part 1, Part 2). In this episode we complete our sports tournament simulation function and show it can be used to predict each successive round and move the winners across each match until an eventual winner is crowned.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

tidymodels – Extract model coefficients for all cross validated folds

As I’ve discussed previously, we sometimes don’t have enough data where doing a train/test split makes sense. As such, we are better off building our model using cross-validation. In previous blog articles, I’ve talked about how to build models using cross-validation within the {tidymodels} framework (see HERE and HERE). In my prior examples, we fit the model over the cross-validation folds and then constructed the final model that we could then use to make predictions with, later on.

Recently, I ran into a situation where I wanted to see what the model coefficients look like across all of the cross-validation folds. So, I decided to make a quick blog post on how to do this, in case it is useful to others.

Load Packages & Data

We will use the {mtcars} package from R and build a regression model, using several independent variables, to predict miles per gallon (mpg).

### Packages -------------------------------------------------------

library(tidyverse)
library(tidymodels)

### Data -------------------------------------------------------

dat <- mtcars dat %>%
  head()

Create Cross-Validation Folds of the Data

I’ll use 10-fold cross validation.

### Modelling -------------------------------------------------------
## Create 10 Cross Validation Folds

set.seed(1)
cv_folds <- vfold_cv(dat, v = 10)
cv_folds

Specify a linear model and set up the model formula

## Specify the linear regression engine
## model specs
lm_spec <- linear_reg() %>%
  set_engine("lm") 


## Model formula
mpg_formula <- mpg ~ cyl + disp + wt + drat

Set up the model workflow and fit the model to the cross-validated folds

## Set up workflow
lm_wf <- workflow() %>%
  add_formula(mpg_formula) %>%
  add_model(lm_spec) 

## Fit the model to the cross validation folds
lm_fit <- lm_wf %>%
  fit_resamples(
    resamples = cv_folds,
    control = control_resamples(extract = extract_model, save_pred = TRUE)
  )

Extract the model coefficients for each of the 10 folds (this is the fun part!)

Looking at the lm_fit output above, we see that it is a tibble consisting of various nested lists. The id column indicates which cross-validation fold the lists in each row pertain to. The model coefficients for each fold are stored in the .extracts column of lists. Instead of printing out all 10, let’s just have a look at the first 3 folds to see what they look like.

lm_fit$.extracts %>% 
  .[1:3]

There we see in the .extracts column, <lm> indicating the linear model for each fold. With a series of unnesting we can snag the model coefficients and then put them into a tidy format using the {broom} package. I’ve commented out each line of code below so that you know exactly what is happening.

# Let's unnest this and get the coefficients out
model_coefs <- lm_fit %>% 
  select(id, .extracts) %>%                    # get the id and .extracts columns
  unnest(cols = .extracts) %>%                 # unnest .extracts, which produces the model in a list
  mutate(coefs = map(.extracts, tidy)) %>%     # use map() to apply the tidy function and get the coefficients in their own column
  unnest(coefs)                                # unnest the coefs column you just made to get the coefficients for each fold

model_coefs

Now that we have a table of estimates, we can plot the coefficient estimates and their 95% confidence intervals. The term column indicates each variable. We will remove the (Intercept) for plotting purposes.

Plot the Coefficients

## Plot the model coefficients and 2*SE across all folds
model_coefs %>%
  filter(term != "(Intercept)") %>%
  select(id, term, estimate, std.error) %>%
  group_by(term) %>%
  mutate(avg_estimate = mean(estimate)) %>%
  ggplot(aes(x = id, y = estimate)) +
  geom_hline(aes(yintercept = avg_estimate),
             size = 1.2,
             linetype = "dashed") +
  geom_point(size = 4) +
  geom_errorbar(aes(ymin = estimate - 2*std.error, ymax = estimate + 2*std.error),
                width = 0.1,
                size = 1.2) +
  facet_wrap(~term, scales = "free_y") +
  labs(x = "CV Folds",
       y = "Estimate ± 95% CI",
       title = "Regression Coefficients ± 95% CI for 10-fold CV",
       subtitle = "Dashed Line = Average Coefficient Estimate over 10 CV Folds per Independent Variable") +
  theme_classic() +
  theme(strip.background = element_rect(fill = "black"),
        strip.text = element_text(face = "bold", size = 12, color = "white"),
        axis.title = element_text(size = 14, face = "bold"),
        axis.text.x = element_text(angle = 60, hjust = 1, face = "bold", size = 12),
        axis.text.y = element_text(face = "bold", size = 12),
        plot.title = element_text(size = 18),
        plot.subtitle = element_text(size = 16))

Now we can clearly see the model coefficients and confidence intervals for each of the 10 cross validated folds.

Wrapping Up

This was just a quick and easy way of fitting a model using cross-validation to extract out the model coefficients for each fold. Often, this is probably not necessary as you will fit your model, evaluate your model, and be off and running. However, there may be times where more specific interrogation of the model is required or, you might want to dig a little deeper into the various outputs of the cross-validated folds.

All of the code is available on my GitHub page.

If you notice any errors in code, please reach out!

Patrick Ward, PhD

Patrick Ward, PhD | Sports Science & Analytics