Category Archives: Sports Analytics

The Role of Skill vs Luck in Team Sport Winning

This week on the Wharton Moneyball Podcast, the hosts were discussing World Cup results following the group play stage.

At one point, they talked about the variance in performance and the role that luck can play in winning or losing. They felt like they didn’t have as good an intuition about how variable the games were and one of the hosts, Eric Bradlow, said that he’d try and look into it and have an answer for next week’s episode.

The discussion reminded me of a chapter in one of my favorite books, The Success Equation by Michael Mauboussin. The book goes into nice detail about the role of skill and luck in sports, investing, and life. On page 78, Mauboussin provides the following equation:

Skill = Observed Outcome – Luck

From there, he explains the steps for determining the contribution of luck to winning in a variety of team sports. Basically, the amount of luck plays is represented as the ratio of the variance of luck to the variance of observed outcomes.

To calculate the variance of observed outcomes, we find the win% of each team in a given season and calculate the standard deviation of that win%. Squaring this value gives us the variance of observed outcomes.

When calculating the variance of luck we first find the average win% of all of the teams and, treating this as a binomial, we calculate the standard deviation as sqrt((win% * (1 – win%)) / n_games), where n_games is the number of games in a season.

I scraped data for the previous 3 complete seasons for the Premier League, NHL, NBA, NFL, and MLB from sports-reference.com. All of the data and code for calculating the contribution of luck to winning for these sports is available on my GitHub page.

Let’s walk through an example

Since the guys on Wharton Moneyball Podcast were talking about Soccer, I’ll use the Premier League as an example.

First, we create a function that does all the calculations for us. All we need to do is obtain the relevant summary statistics to feed into the function and then let it do all the work.

league_perf <- function(avg_win_pct, obs_sd, luck_sd, league){
  
  ## convert standard deviations to variance
  var_obs <- obs_sd^2
  var_luck <- luck_sd^2
  
  ## calculate the variation due to skill
  var_skill <- var_obs + var_luck
  
  ## calculate the contribution of luck to winning
  luck_contribution <- var_luck / var_obs
  
  
  ## Results table
  output <- tibble(
    league = {{league}},
    avg_win_pct = avg_win_pct,
    luck_contribution = luck_contribution,
  )
  
  return(output)
  
}

 

Using the Premier League data set we calculate the games per season, league average win%, the observed standard deviation, and the luck standard deviation.

NOTE: I’m not much of a soccer guy so I wasn’t sure how to handle draws. I read somewhere that they equate to about 1/3 of a win, so that is what I use here to credit a team for a draw.

## get info for function
# NOTE: Treating Draws as 1/3 of a win, since that reflects the points the team is awarded in the standings
premier_lg %>%
  select(Squad, MP, W, D) %>%
  mutate(win_pct = (W + 1/3*D) / MP) %>%
  summarize(games = max(MP),
            avg_win_pct = mean(win_pct),
            obs_sd = sd(win_pct),
            luck_sd = sqrt((avg_win_pct * (1 - avg_win_pct)) / games))

## Run the function
premier_lg_output <- league_perf(avg_win_pct = 0.462,
                                 obs_sd = 0.155,
                                 luck_sd = 0.0809,
                                 league = "Premier League")

premier_lg_output

Looking at our summary table, it appears that teams have had an average win% over the past 3 seasons of 46.2% (not quite 50%, as we will see in NBA, MLB, or NFL, since draws happen so frequently). The standard deviation of team win% was 15.5% (this is the observed standard deviation), while the luck standard deviation, 8.1%, is the binomial standard deviation using the average win percentage and 38 games in a season.

Feeding these values into the function created above we find that luck contributes approximately 27.2% to winning in the Premier League. In Mauboussin’s book he found that the contribution of luck to winning was 31% in the Premier League from 2007 – 2011. I don’t feel like the below value is too far off of that, though I don’t have a good sense for what sort of magnitude of change would be surprising. That said, the change could be due to improved skill in the Premier League (already one of the most skillful leagues in the world) or perhaps how he handled draws, which was not discussed in the book and may differ from the approach I took here.

Let’s look at the table of results for all sports

Again, you can get the code for calculating the contribution of luck to winning in these sports from my GitHub page, so no need to rehash it all here. Instead, let’s go directly to the results.

  • NBA appears to be the most skill driven league with only about 15% of the contribution to winning coming from luck. Mauboussin found this value to be 12% from 2007 – 2011.
  • The NFL appears to be drive most by luck, contributing 39% to winning. This is identical to what Mauboussin observed using NFL data from 2007 – 2011.
  • The two most surprising outcomes here are the MLB and NHL. Mauboussin found the MLB to have a 34% contribution of luck (2007 – 2011) and the NHL a 53% contribution of luck (2008 – 2012). However, in my table below it would appear that the contribution of luck has decreased substantially in these two sports, using data from previous 3 seasons (NOTE: throwing out the COVID year doesn’t alter this drastically enough to make it close to what Mauhoussin showed).

Digging Deeper into MLB and NHL

I pulled data from the exact same seasons that Mauboussin used in his book and obtained the following results.

  • The results I observed for the MLB are identical to what Mauboussin had in his book (pg. 80)
  • The results for the NFL are slightly off (47.7% compared to 53%) but this might have to do with how I am handling overtime wins and losses (which awards teams points in the standings), as I don’t know enough about hockey to determine what value to assign to them. Perhaps Mauboussin addressed these two outcomes in his calculations in a more specific way.

Additional hypotheses:

  • Maybe skill has improved in hockey over the past decade and a half?
  • Maybe a change in tactics (e.g., the shift) or strategy (e.g., hitters trying to hit more home runs instead of just making contact or pitchers trying to explicitly train to increase max velocity) has taken some of the luck events out of baseball and turned it into more of a zero-sum duel between the batter and pitcher, making game outcomes more dependent on the skill of both players?
  • Maybe I have an error somewhere…let me know if you spot one!

Wrapping Up

Although we marvel at the skill of the athletes we watch in their arena of competition, it’s important to recognize that luck plays a role in the outcomes, and for some sports it plays more of a role than others. But, luck is also what keeps us watching and makes things fun! Sometimes the ball bounces your way and you can steal a win! The one thing I always appreciate form the discussions on the Wharton Moneyball Podcast is that the hosts don’t shy away from explaining things like luck, randomness, variance, regression to the mean, and weighting observed outcomes with prior knowledge. This way of thinking isn’t just about sport, it’s about life. In this sense, we may consider sport to be the vehicle through which these hosts are teaching us about our world.

Approximating a Bayesian Posterior Prediction

This past week on the Wharton Moneyball Podcast, during Quarter 2 of the show the hosts got into a discussion about Bayesian methods, simulating uncertainty, and frequentist approaches to statistical analysis. The show hosts are all strong Bayesian advocates but at one point in the discussion, Eric Bradlow mentioned that “frequenstist can answer similar questions by building distributions from their model predictions.” (paraphrasing)

This comment reminded me of Chapter 7 in Gelman and Hill’s brilliant book, Data Analysis Using Regression and Multilevel/Hierarchical Models. In this chapter, the authors do what they call an informal Bayesian approach by simulating the predictions from a linear regression model. It’s an interesting (and easy) approach that can be helpful for getting a sense for how Bayesian posterior predictions work without building a full Bayesian model. I call it, “building a bridge to Bayes”.

Since the entire book is coded in R, I decided to code an example in Python.

The Jupyter notebook is accessible on my GITHUB page.

(Special Material: In the Jupyter Notebook I also include additional material on how to calculate prediction intervals and confidence intervals by hand in python. I wont go over those entire sections here as I don’t want the post to get too long.)

Libraries & Data

We will start by loading up the libraries that we need and the data. I’ll use the iris data set, since it is conveniently available from the sklearn library. We will build the regression model using the statsmodels.api library.

## import libraries

from sklearn import datasets
import pandas as pd
import numpy as np
import statsmodels.api as smf
from scipy import stats
import matplotlib.pyplot as plt

## iris data set
iris = datasets.load_iris()

## convert to pandas sata frame
data = iris['data']
target = iris['target']

iris_df = pd.DataFrame(data)
iris_df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
iris_df.head(3)

Build a linear regression model

Next, we build a simple ordinary least squares regression model to predict petal_width from petal_length.

 

## Set up our X and y variables
X = iris_df['petal_length']
y = iris_df['petal_width']

# NOTE: statsmodels does not automatically add an intercept, so you need to do that manually

X = smf.add_constant(X)

# Build regression model
# NOTE: the X and y variables are reversed in the function compared to sklearn

fit_lm = smf.OLS(y, X).fit()

# Get an R-like output of the model

fit_lm.summary()

Simulate a distribution around the slope coefficient

The slope coefficient for the petal_length variable, from the model output above, is 0.4158 with a standard error of 0.01. We will store these two values in their own variables and then use them to create a simulation of 10,000 samples and plot the distribution.

## get summary stats
mu_slope = 0.4158
se_slope = 0.01

## create simulation
n_samples = 10000
coef_sim = np.random.normal(loc = mu_slope, scale = se_slope, size = n_samples)

## plot simulated distribution

plt.hist(coef_sim, bins = 60)
plt.show()

We can also grab the summary statistics from the simulated distribution. We will snag the mean and the 90% quantile interval.

## get summary stats from our simulation
summary_stats = {
    'Mean': coef_sim.mean(),
    'Low90': np.quantile(coef_sim, 0.05),
    'High90': np.quantile(coef_sim, 0.95)
}

summary_stats

Making a prediction on a new observation and building a posterior predictive distribution

Now that we’ve gotten a good sense for how to create a simulation in python, we can create a new observation of petal_length and make a prediction about what the petal_width would be based on our model. In addition, we will get the prediction intervals from the output and use them to calculate a standard error for the prediction, which we will use for the posterior predictive simulation.

Technical Note: In the prediction output you will see that python returns a mean_ci_lower and mean_ci_upper and an obs_ci_lower and obs_ci_higher. The latter two are the prediction intervals  while the former two are the confidence intervals. I previously discussed the difference HERE and this is confirmed in the Jupyter Notebook where I calculate these values by hand.

# create a new petal_length observation
new_dat = np.array([[1, 1.7]])    # add a 1 upfront to indicate the model intercept

# make prediction of petal_width using the model
prediction = fit_lm.get_prediction(new_dat)

# get summary of the prediction
prediction.summary_frame()

Store the predicted value (0.343709) and then calculate the standard error from the lower and upper prediction intervals. Run a simulation and then plot the distribution of predictions.

mu_pred = 0.343709
se_pred = 0.754956 - 0.343709     # subtract the upper prediction interval from the mean to get the variability
n_sims = 10000

pred_obs = np.random.normal(loc = mu_pred, scale = se_pred, size = n_sims)

plt.hist(pred_obs, bins = 60)
plt.show()

Just as we did for the simulation of the slope coefficient we can extract our summary statistics (mean and 90% quantile intervals).

## get summary stats from our simulation
summary_stats = {
    'Mean': pred_obs.mean(),
    'Low90': np.quantile(pred_obs, 0.05),
    'High90': np.quantile(pred_obs, 0.95)
}

summary_stats

Wrapping Up

That is a pretty easy way to get a sense for approximating a Bayesian posterior predictive distribution. Rather than simply reporting the predicted value and a confidence interval or prediction interval, it is sometimes nice to build an entire distribution. Aside from it being visually appealing, it allows us to answer other questions we might have, for example, what percentage of the data is greater or less than 0.5 (or any other threshold value you might be interested in)?

As stated earlier, all of this code is accessible on my GITHUB page and the Jupyter notebook also has additional sections on how to calculate confidence intervals and prediction intervals by hand.

If you notice any errors, let me know!

Shiny – User Defined Chart Parameters

A colleague was working on a web app for his basketball team and asked me if there was a way to create a {shiny} web app that allowed the user to define which parameters they would like to see on the plot. I figured this would be something others might be interested in as well, so here we go!

Load Packages, Helper Functions & Data

I’ll use data from the Lahman baseball database (seasons 2017 – 2019). I’m also going to create two helper functions, one for calculating the z-scores for our stats of interest and one for calculating the t-value from the z-score. The t-value will put the z-score on a 0 to 100 scale for plotting purposes in our polar plot. Additionally, we will use these standardized scores to conditionally format colors on our {gt} table (but we will hide the standardized columns so that the user only sees the raw data and colors). Finally, I’m going to create both a wide and long format of the data as it will be easier to use one or the other, depending on the type of plot or table I am building.

#### Load packages ------------------------------------------------
library(tidyverse)
library(shiny)
library(Lahman)
library(gt)
library(plotly)

theme_set(theme_minimal() + 
            theme(
              axis.text = element_text(face = "bold", size = 12),
              legend.title = element_blank(),
              legend.position = "none"
            ) )

#### helper functions -------------------------------------------

z_score <- function(x){
  z = (x - mean(x, na.rm = T)) / sd(x, na.rm = T)
  return(z)
}

t_score <- function(x){ t = (x * 10) + 50 t = ifelse(t > 100, 100, 
             ifelse(t < 0, 0, t))
  return(t)
}


#### Get Data ---------------------------------------------------

dat <- Batting %>%
  filter(between(yearID, left = 2017, right = 2019),
         AB >= 200) %>% 
  group_by(yearID, playerID) %>%
  summarize(across(.cols = G:GIDP,
         ~sum(.x)),
         .groups = "drop") %>%
  mutate(ba = H / AB,
         obp = (H + BB + HBP) / (AB + HBP + SF),
         slg = ((H - X2B - X3B - HR) + X2B*2 + X3B*3 + HR*4) / AB,
         ops = obp + slg,
         hr_rate = H / AB) %>%
  select(playerID, yearID, AB, ba:hr_rate) %>%
  mutate(across(.cols = ba:hr_rate,
                list(z = z_score)),
         across(.cols = ba_z:hr_rate_z,
                list(t = t_score))) %>%
  left_join(People %>%
              mutate(name = paste(nameLast, nameFirst, sep = ", ")) %>%
              select(playerID, name)) %>%
  relocate(name, .before = yearID)


dat_long <- Batting %>%
  filter(between(yearID, left = 2017, right = 2019),
         AB >= 200) %>% 
  group_by(playerID) %>%
  summarize(across(.cols = G:GIDP,
                   ~sum(.x)),
            .groups = "drop") %>%
  mutate(ba = H / AB,
         obp = (H + BB + HBP) / (AB + HBP + SF),
         slg = ((H - X2B - X3B - HR) + X2B*2 + X3B*3 + HR*4) / AB,
         ops = obp + slg,
         hr_rate = H / AB) %>%
  select(playerID, AB, ba:hr_rate) %>%
  mutate(across(.cols = ba:hr_rate,
                list(z = z_score)),
         across(.cols = ba_z:hr_rate_z,
                list(t = t_score))) %>%
  left_join(People %>%
              mutate(name = paste(nameLast, nameFirst, sep = ", ")) %>%
              select(playerID, name)) %>%
  relocate(name, .before = AB) %>%
  select(playerID:AB, ends_with("z_t")) %>%
  pivot_longer(cols = -c(playerID, name, AB),
               names_to = "stat") %>%
  mutate(stat = case_when(stat == "ba_z_t" ~ "BA",
                          stat == "obp_z_t" ~ "OBP",
                          stat == "slg_z_t" ~ "SLG",
                          stat == "ops_z_t" ~ "OPS",
                          stat == "hr_rate_z_t" ~ "HR Rate"))


dat %>%
  head()

dat_long %>%
  head()

 

The Figures for our App

Before I build the {shiny} app, I wanted to first construct the three figures I will include. The code for these will be accessible in Github, but here is what they look like:

  • For the polar plot, I will allow the user to define which variables they want on the chart.
  • For the time series plot, I am going to create an interactive {plotly} chart that allows the user to select the stat they want to see and then hover over the player’s points and obtain information like the raw value and the number of at bats in the given season via a simple tool tip.
  • The table, as discussed above, will user conditional formatting to provide the user with extra context about how that player performed relative to his peers in a given season.

Because I don’t like to clutter up my {shiny} apps, I tend to build my plots and tables into custom functions. That way, all I need to do is set up a reactive() in the server to obtain the user selected data and then call the function on that data. Here are the functions for the three figures above.

## table function
tbl_func <- function(NAME){ dat %>%
  filter(name == NAME) %>%
  select(yearID, AB:hr_rate, ends_with("z_t")) %>%
  gt(rowname_col = "yearID") %>%
  fmt_number(columns = ba:hr_rate,
             decimals = 3) %>%
  cols_label(
    AB = md("**AB**"),
    ba = md("**Batting Avg**"),
    obp = md("**OBP**"),
    slg = md("**SLG**"),
    ops = md("**OPS**"),
    hr_rate = md("**Home Run Rate**")
  ) %>%
  tab_header(title = NAME) %>%
  opt_align_table_header(align = "left") %>%
  tab_options(column_labels.border.top.color = "transparent",
              column_labels.border.top.width = px(3),
              table.border.top.color = "transparent",
              table.border.bottom.color = "transparent") %>%
  cols_align(align = "center") %>%
  cols_hide(columns = ends_with("z_t")) %>%
    tab_style(
      style = cell_fill(color = "palegreen"),
      location = cells_body(
        columns = ba,
        rows = ba_z_t > 60
      )
    )  %>%
    tab_style(
      style = cell_fill(color = "red"),
      location = cells_body(
        columns = ba,
        rows = ba_z_t < 40 ) ) %>%
    tab_style(
      style = cell_fill(color = "palegreen"),
      location = cells_body(
        columns = obp,
        rows = obp_z_t > 60
      )
    )  %>%
    tab_style(
      style = cell_fill(color = "red"),
      location = cells_body(
        columns = obp,
        rows = obp_z_t < 40 ) ) %>%
    tab_style(
      style = cell_fill(color = "palegreen"),
      location = cells_body(
        columns = slg,
        rows = slg_z_t > 60
      )
    )  %>%
    tab_style(
      style = cell_fill(color = "red"),
      location = cells_body(
        columns = slg,
        rows = slg_z_t < 40 ) ) %>%
    tab_style(
      style = cell_fill(color = "palegreen"),
      location = cells_body(
        columns = ops,
        rows = ops_z_t > 60
      )
    )  %>%
    tab_style(
      style = cell_fill(color = "red"),
      location = cells_body(
        columns = ops,
        rows = ops_z_t < 40 ) ) %>%
    tab_style(
      style = cell_fill(color = "palegreen"),
      location = cells_body(
        columns = hr_rate,
        rows = hr_rate_z_t > 60
      )
    )  %>%
    tab_style(
      style = cell_fill(color = "red"),
      location = cells_body(
        columns = hr_rate,
        rows = hr_rate_z_t < 40
      )
    ) 
}


## Polar plot function
polar_plt <- function(NAME, STATS){ dat_long %>%
    filter(name == NAME,
           stat %in% STATS) %>%
    ggplot(aes(x = stat, y = value, fill = stat)) +
    geom_col(color = "white", width = 0.75) +
    coord_polar(theta = "x") +
    geom_hline(yintercept = seq(50, 50, by = 1), size = 1.2) +
    labs(x = "", y = "") +
    ylim(0, 100)
  
} 


## time series plot function
time_plt <- function(NAME, STAT){
  
  STAT <- case_when(STAT == "BA" ~ "ba",
                    STAT == "OBP" ~ "obp",
                    STAT == "SLG" ~ "slg",
                    STAT == "OPS" ~ "ops",
                    STAT == "HR Rate" ~ "hr_rate")
  
  stat_z <- paste0(STAT, "_z")
  
  p <- dat %>% 
    filter(name == NAME) %>%
    select(yearID, AB, STAT, stat_z) %>%
    setNames(., c("yearID", "AB", "STAT", "stat_z")) %>%
    ggplot(aes(x = as.factor(yearID), 
               y = stat_z,
               group = 1,
               label = NAME,
               label2 = AB,
               lable3 = STAT)) +
    geom_hline(yintercept = 0,
               size = 1.1,
               linetype = "dashed") +
    geom_line(size = 1.2) +
    geom_point(shape = 21,
               size = 6,
               color = "black",
               fill = "white") +
    ylim(-4, 4) 
  
  
  ggplotly(p)
  
}

 

Build the {shiny} app

The below code will construct the {shiny} app. We allow the user to select a player, select the stats of interest for the polar plot, and select the stat they’d like to track over time.

If you’d like to see a video of the app in use, CLICK HERE <shiny – user defined chart parameters>

If you want to run this yourself or build one similar to it you can access my code on GitHub.

 

#### Shiny App ---------------------------------------------------------------

## User Interface
ui <- fluidPage(
  
  titlePanel("MLB Hitters Shiny App\n2017-2019"),
  
  
  sidebarPanel(width = 3,
             selectInput("name",
                             label = "Choose a Player:",
                             choices = unique(dat$name),
                             selected = NULL,
                             multiple = FALSE),
              
              selectInput("stat",
                          label = "Choose stats for polar plot:",
                          choices = unique(dat_long$stat),
                          selected = NULL,
                          multiple = TRUE),
              
              selectInput("time_stat",
                          label = "Choose stat for time series:",
                          choices = unique(dat_long$stat),
                          selected = NULL,
                          multiple = FALSE)
  ),
  
  
  mainPanel(
    
    gt_output(outputId = "tbl"),
    
    fluidRow(
      
      column(6, plotOutput(outputId = "polar")),
      column(6, plotlyOutput(outputId = "time"))
    )
    
  )
)



server <- function(input, output){
  
  ## get player selected for table
  NAME <- reactive({ dat_long %>%
      filter(name == input$name) %>%
      distinct(name, .keep_all = FALSE) %>%
      pull(name)
    })
  
  ## get stats for polar plot
  polar_stats <- reactive({ dat_long %>%
      filter(stat %in% c(input$stat)) %>%
      pull(stat)
    
  })
  
  ## get stat for time series
  ts_stat <- reactive({ dat %>%
      select(ba:hr_rate) %>%
      setNames(., c("BA", "OBP", "SLG", "OPS", "HR Rate")) %>%
      select(input$time_stat) %>% 
      colnames()
    
  })
  
  ## table output
  output$tbl <- render_gt(
      tbl_func(NAME = NAME())
  )
  
  ## polar plot output
  output$polar <- renderPlot(
    
    polar_plt(NAME = NAME(),
              STAT = polar_stats())
    
  )
  
  ## time series plot output
  output$time <- renderPlotly(
    
    time_plt(NAME = NAME(),
             STAT = ts_stat())
    
  )

  
}





shinyApp(ui, server)

Regression to the Mean in Sports

Last week, scientist David Borg posted an article to twitter talking about regression to the mean in epidemiological research (1). Regression to the Mean is a statistical phenomenon where extreme observations tend to move closer towards the population mean in subsequent observations, due simply to natural variation. To steal Galton’s example, tall parents will often have tall children but those children, while taller than the average child, will tend to be shorter than their parents (regressed to the mean). It’s also one of the reasons why clinicians have such a difficult time understanding whether their intervention actually made the patient better or whether observed improvements are simply due to regression to the mean over the course of treatment (something that well designed studies attempt to rule out by using randomized controlled experiments).

Of course, this phenomenon is not unique to epidemiology or biostatistics. In fact, the phrase is commonly used in sport when discussing players or teams that have extremely high or low performances in a season and there is a general expectation that they will be more normal next year. An example of this could be the sophomore slump exhibited by rookies who perform at an extremely high level in their first season.

Given that this phenomenon is so common in our lives, the goal with this blog article is to show what regression to the mean looks like for team wins in baseball from one year to the next.

Data

We will use data from the Lahman baseball database (freely available in R) and concentrate on team wins in the 2015 and 2016 MLB seasons.

library(tidyverse)
library(Lahman)
library(ggalt)

theme_set(theme_bw())

data(Teams)

dat <- Teams %>%
  select(yearID, teamID, W) %>%
  arrange(teamID, yearID) %>%
  filter(yearID %in% c(2015, 2016)) %>%
  group_by(teamID) %>%
  mutate(yr_2 = lead(W)) %>%
  rename(yr_1 = W) %>%
  filter(!is.na(yr_2)) %>%
  ungroup() 

dat %>%
  head()

 

Exploratory Data Analysis

dat %>%
  ggplot(aes(x = yr_1, y = yr_2, label = teamID)) +
  geom_point() +
  ggrepel::geom_text_repel() +
  geom_abline(intercept = 0,
              slope = 1,
              color = "red",
              linetype = "dashed",
              size = 1.1) +
  labs(x = "2015 wins",
       y = "2016 wins")

The dashed line is the line of equality. A team that lies exactly on this line would be a team that had the exact number of wins in 2015 as they did in 2016. While no team lies exactly on this line, looking at the chart, what we can deduce is that teams below the red line had more wins in 2015 and less in 2016 while the opposite is true for those that lie above the line. Minnesota had a large decline in performance going from just over 80 wins in 2015 to below 60 wins in 2017.

The correlation between wins in 2015 to wins in 2016 is 0.54.

cor.test(dat$yr_1, dat$yr_2)

Plotting changes in wins from 2015 to 2016

We can plot each team and show their change in wins from 2015 (green) to 2016 (blue). We will break them into two groups, teams that saw a decrease in wins in 2016 relative to 2015 and teams that saw an increase in wins from 2015 to 2016. We will plot the z-score of team wins on the x-axis so that we can reflect average as being “0”.

## get the z-score of team wins for each season
dat <- dat %>%
  mutate(z_yr1 = (yr_1 - mean(yr_1)) / sd(yr_1),
         z_yr2 = (yr_2 - mean(yr_2)) / sd(yr_2))

dat %>%
  mutate(pred_z_dir = ifelse(yr_2 > yr_1, "increase wins", "decrease wins")) %>%
  ggplot(aes(x = z_yr1, xend = z_yr2, y = reorder(teamID, z_yr1))) +
  geom_vline(xintercept = 0,
             linetype = "dashed",
             size = 1.2,
             color = "grey") +
  geom_dumbbell(size = 1.2,
                colour_x = "green",
                colour_xend = "blue",
                size_x = 6,
                size_xend = 6) +
  facet_wrap(~pred_z_dir, scales = "free_y") +
  scale_color_manual(values = c("decrease wins" = "red", "increase wins" = "black")) +
  labs(x = "2015",
       y = "2016",
       title = "Green = 2015 Wins\nBlue = 2016 Wins",
       color = "Predicted Win Direction")

On the decreasing wins side, notice that all teams from SLN to LAA had more wins in 2015 (green) and then regressed towards the mean (or slightly below the mean) in 2016. From MIN down, those teams actually got even worse in 2016.

On the increasing wins side, from BOS down to PHI all of the teams regressed upward towards the mean (NOTE: regressing upward sounds weird, which is why some refer to this as reversion to the mean). From CHN to BAL, those teams were at or above average in 2015 and got better in 2016.

It makes sense that not all teams revert towards the mean in the second year given that teams attempt to upgrade their roster from one season to the next in order to maximize their chances of winning more games.

Regression to the with linear regression

There are three ways we can evaluate regression to the mean using linear regression and all three approaches will lead us to the same conclusion.

1. Linear regression with the raw data, predicting 2016 wins from 2015 wins.

2. Linear regression predicting 2016 wins from the grand mean centered 2015 wins (grand mean centered just means subtracting the league average wins, 81, from the observed wins for each team).

3. Linear regression using z-scores. This approach will produce results in standard deviation units rather than raw values.

## linear regression with raw values
fit <- lm(yr_2 ~ yr_1, data = dat)
summary(fit)

## linear regression with grand mean centered 2015 wins 
fit_grand_mean <- lm(yr_2 ~ I(yr_1 - mean(yr_1)), data = dat)
summary(fit_grand_mean)

## linear regression with z-score values
fit_z <- lm(z_yr2 ~ z_yr1, data = dat)
summary(fit_z)

Next, we take these equations and simply make predictions for 2016 win totals for each team.

dat$pred_fit <- fitted(fit)
dat$pred_fit_grand_mean <- fitted(fit_grand_mean)
dat$pred_fit_z <- fitted(fit_z) dat %>%
  head()

You’ll notice that the first two predictions are exactly the same. The third prediction is in standard deviation units. For example, Arizona (ARI) was -0.188 standard deviations below the mean in 2015 and in 2016 they were predicted to regress towards the mean and be -0.102 standard deviations below the mean. However, they actually went the other way and got even worse, finishing the season -1.12 standard deviations below the mean!

We can plot the residuals and see how off our projections where for each team’s 2016 win total.

par(mfrow = c(2,2))
hist(resid(fit), main = "Linear reg with raw values")
hist(resid(fit_grand_mean), main = "Linear reg with\ngrand mean centered values")
hist(resid(fit_z), main = "Linear reg with z-scored values")

The residuals look weird here because we are only dealing with two seasons of data. At the end of this blog article I will run the residual plot for all seasons from 2000 – 2019 to show how the residuals resemble more of a normal distribution.

We can see that there seems to be an extreme outlier that has a -20 win residual. Let’s pull that team out and see who it was.

dat %>%
  filter((yr_2 - pred_fit) < -19)

It was Minnesota, who went from an average season in 2015 (83 wins) and dropped all the way to 59 wins in 2016 (-2.05 standard deviations below the mean), something we couldn’t have really predicted.

The average absolute change from year1 to year2 for these teams was 8 wins.

mean(abs(dat$yr_2 - dat$yr_1))

Calculating Regression to the mean by hand

To explore the concept more, we can calculate regression toward the mean for the 2016 season by using the year-to-year correlation of team wins, the average wins for the league, and the z-score for each teams wins in 2015. The equation looks like this:

yr2.wins = avg.league.wins + sd_wins * predicted_yr2_z

Where predicted_yr2_z is calculated as:

predicted_yr2_z = (yr_1z * year.to.year.correlation)

We calculated the correlation coefficient above but let’s now store it as its own variable. Additionally we will store the league average team wins and standard deviation in 2015.

avg_wins <- mean(dat$yr_1)
sd_wins <- sd(dat$yr_1)
r <- cor.test(dat$yr_1, dat$yr_2)$estimate

avg_wins
sd_wins
r

On average teams won 81 games (which makes sense for a 162 season) with a standard deviation of about 10 games.

Let’s look at Pittsburgh (Pitt)

dat %>%
  filter(teamID == "PIT") %>%
  select(yearID:z_yr1) %>%
  mutate(predicted_yr2_z = z_yr1 * r,
         predicted_yr2_wins = avg_wins + sd_wins * predicted_yr2_z)

  • In 2015 Pittsburgh had 98 wins, a z-score of 1.63.
  • We predict them in 2016 to regress to the mean and have a z-score of 0.881 (90 wins)

Add these by hand regression to the mean predictions to all teams.

dat <- dat %>%
  mutate(predicted_yr2_z = z_yr1 * r,
         predicted_yr2_wins = avg_wins + sd_wins * predicted_yr2_z)

dat %>%
  head()


We see that our by hand calculation produces the same prediction, as it should.

Show the residuals for all seasons from 2000 – 2019

Here, we will filter out the data from the data base for the desired seasons, refit the model, and plot the residuals.

 

dat2 <- Teams %>%
  select(yearID, teamID, W) %>%
  arrange(teamID, yearID) %>%
  filter(between(x = yearID,
                 left = 2000,
                 right = 2019)) %>%
  group_by(teamID) %>%
  mutate(yr_2 = lead(W)) %>%
  rename(yr_1 = W) %>%
  filter(!is.na(yr_2)) %>%
  ungroup() 

## Correlation between year 1 and year 2
cor.test(dat2$yr_1, dat2$yr_2)


## linear model
fit2 <- lm(yr_2 ~ yr_1, data = dat2)
summary(fit2)

## residual plot
hist(resid(fit2),
     main = "Residuals\n(Seasons 2000 - 2019")

Now that we have more than a two seasons of data we see a normal distribution of the residuals. The correlation between year1 and year2 in this larger data set was 0.54, the same correlation we saw with the two seasons data.

With this larger data set, the average change in wins from year1 to year2 was 9 (not far from what we saw in the smaller data set above).

mean(abs(dat2$yr_2 - dat2$yr_1))

Wrapping Up

Regression to the mean is a common phenomenon in life. It can be difficult for practitioners in sports medicine and strength and conditioning to tease out the effects of regression to the mean when applying a specific training/rehab intervention. Often, regression to the mean fools us into believing that the intervention we did apply has some sort of causal relationship with the observed outcome. This phenomenon is also prevalent in sport when evaluating the performance of individual players and teams from one year to the next. With some simple calculations we can explore what regression to the mean could look like for data in our own setting, providing a compass and some base rates for us to evaluate observations going forward.

All of the code for this blog can be accessed on my GitHub page. If you notice any errors, please reach out!

References

1) Barnett, AG. van der Pols, JC. Dobson, AJ. (2005). Regression to the mean: what it is and how to deal with it. International Journal of Epidemiology; 34: 215-220.

2) Schall, T. Smith, G. Do baseball players regress toward the mean? The American Statistician; 54(4): 231-235.

tidymodels – Extract model coefficients for all cross validated folds

As I’ve discussed previously, we sometimes don’t have enough data where doing a train/test split makes sense. As such, we are better off building our model using cross-validation. In previous blog articles, I’ve talked about how to build models using cross-validation within the {tidymodels} framework (see HERE and HERE). In my prior examples, we fit the model over the cross-validation folds and then constructed the final model that we could then use to make predictions with, later on.

Recently, I ran into a situation where I wanted to see what the model coefficients look like across all of the cross-validation folds. So, I decided to make a quick blog post on how to do this, in case it is useful to others.

Load Packages & Data

We will use the {mtcars} package from R and build a regression model, using several independent variables, to predict miles per gallon (mpg).

### Packages -------------------------------------------------------

library(tidyverse)
library(tidymodels)

### Data -------------------------------------------------------

dat <- mtcars dat %>%
  head()

 

Create Cross-Validation Folds of the Data

I’ll use 10-fold cross validation.

### Modelling -------------------------------------------------------
## Create 10 Cross Validation Folds

set.seed(1)
cv_folds <- vfold_cv(dat, v = 10)
cv_folds

Specify a linear model and set up the model formula

## Specify the linear regression engine
## model specs
lm_spec <- linear_reg() %>%
  set_engine("lm") 


## Model formula
mpg_formula <- mpg ~ cyl + disp + wt + drat

Set up the model workflow  and fit the model to the cross-validated folds

## Set up workflow
lm_wf <- workflow() %>%
  add_formula(mpg_formula) %>%
  add_model(lm_spec) 

## Fit the model to the cross validation folds
lm_fit <- lm_wf %>%
  fit_resamples(
    resamples = cv_folds,
    control = control_resamples(extract = extract_model, save_pred = TRUE)
  )

Extract the model coefficients for each of the 10 folds (this is the fun part!)

Looking at the lm_fit output above, we see that it is a tibble consisting of various nested lists. The id column indicates which cross-validation fold the lists in each row pertain to. The model coefficients for each fold are stored in the .extracts column of lists. Instead of printing out all 10, let’s just have a look at the first 3 folds to see what they look like.

lm_fit$.extracts %>% 
  .[1:3]

There we see in the .extracts column, <lm> indicating the linear model for each fold. With a series of unnesting we can snag the model coefficients and then put them into a tidy format using the {broom} package. I’ve commented out each line of code below so that you know exactly what is happening.

# Let's unnest this and get the coefficients out
model_coefs <- lm_fit %>% 
  select(id, .extracts) %>%                    # get the id and .extracts columns
  unnest(cols = .extracts) %>%                 # unnest .extracts, which produces the model in a list
  mutate(coefs = map(.extracts, tidy)) %>%     # use map() to apply the tidy function and get the coefficients in their own column
  unnest(coefs)                                # unnest the coefs column you just made to get the coefficients for each fold

model_coefs

Now that we have a table of estimates, we can plot the coefficient estimates and their 95% confidence intervals. The term column indicates each variable. We will remove the (Intercept) for plotting purposes.

Plot the Coefficients

## Plot the model coefficients and 2*SE across all folds
model_coefs %>%
  filter(term != "(Intercept)") %>%
  select(id, term, estimate, std.error) %>%
  group_by(term) %>%
  mutate(avg_estimate = mean(estimate)) %>%
  ggplot(aes(x = id, y = estimate)) +
  geom_hline(aes(yintercept = avg_estimate),
             size = 1.2,
             linetype = "dashed") +
  geom_point(size = 4) +
  geom_errorbar(aes(ymin = estimate - 2*std.error, ymax = estimate + 2*std.error),
                width = 0.1,
                size = 1.2) +
  facet_wrap(~term, scales = "free_y") +
  labs(x = "CV Folds",
       y = "Estimate ± 95% CI",
       title = "Regression Coefficients ± 95% CI for 10-fold CV",
       subtitle = "Dashed Line = Average Coefficient Estimate over 10 CV Folds per Independent Variable") +
  theme_classic() +
  theme(strip.background = element_rect(fill = "black"),
        strip.text = element_text(face = "bold", size = 12, color = "white"),
        axis.title = element_text(size = 14, face = "bold"),
        axis.text.x = element_text(angle = 60, hjust = 1, face = "bold", size = 12),
        axis.text.y = element_text(face = "bold", size = 12),
        plot.title = element_text(size = 18),
        plot.subtitle = element_text(size = 16))

Now we can clearly see the model coefficients and confidence intervals for each of the 10 cross validated folds.

Wrapping Up

This was just a quick and easy way of fitting a model using cross-validation to extract out the model coefficients for each fold. Often, this is probably not necessary as you will fit your model, evaluate your model, and be off and running. However, there may be times where more specific interrogation of the model is required or, you might want to dig a little deeper into the various outputs of the cross-validated folds.

All of the code is available on my GitHub page.

If you notice any errors in code, please reach out!