Category Archives: Sports Science

Box & Dotplots for Performance Visuals – Creating an Interactive Plot with plotly

Yesterday, I provided some code to make a simple boxplot with doplot in order to visualize an athlete’s performance relative to their peers.

Today, we will try and make this plot interactive. To do so, we will use the {plotly} package and save the plots as html files that can be sent to our coworkers or decision-makers so that they can interact directly with the data.

Data

We will use the same simulated data from yesterday and also load the {plotly} library.

### Load libraries -----------------------------------------------
library(tidyverse)
library(randomNames)
library(plotly)

### Data -------------------------------------------------------
set.seed(2022)
dat <- tibble( participant = randomNames(n = 20), performance = rnorm(n = 20, mean = 100, sd = 10)) 

Interactive plotly plot

Our first two plots will be the same, one vertical and one horizontal. Plotly is a little different than ggplot2 in syntax.

  • I start by creating a base plot, indicating I want the plot to be a boxplot. I also tell plotly that I want to group by participant, so that the dots will show up alongside the boxplot, as they did in yesterday’s visual.
  • Once I’ve specified the base plot, I indicate that I want boxpoints to add the points next to the boxplot and set some colors (again, using a colorblind friendly palette)
  • Finally, I add axis labels and a title.
  • For easy, I use the subplot() function to place the two plots next to each other so that you can compare the vertical and horizontal plot and see which you prefer.
### Build plotly plot -------------------------------------------
# Set plot base
perf_plt <- plot_ly(dat, type = "box") %>%
  group_by(participant)

# Vertical plot
vert_plt <- perf_plt %>%
  add_boxplot(y = ~performance,
              boxpoints = "all",
              line = list(color = 'black'),
              text = ~participant,
              marker = list(color = '#56B4E9',
                            size = 15)) %>% 
  layout(xaxis = list(showticklabels = FALSE)) %>%
  layout(yaxis = list(title = "Performance")) %>%
  layout(title = "Team Performance")       

# Horizontal plot
hz_plt <- perf_plt %>%
  add_boxplot(x = ~performance,
              boxpoints = "all",
              line = list(color = 'black'),
              text = ~participant,
              marker = list(color = '#E69F00',
                            size = 15)) %>% 
  layout(yaxis = list(showticklabels = FALSE)) %>%
  layout(xaxis = list(title = "Performance")) %>%
  layout(title = "Team Performance") 

## put the two plots next to each other
subplot(vert_plt, hz_plt)



 

  • Statically, we can see the plot below and if you click on the red link beneath it you will be taken to the interactive version, where you can hover over the points and see the individual athlete’s name and performance on the test.

interactive_plt1

 

Interactive plotly with selector option

Next, we build the same plot but add a selector box so that the user can select the individual of interest and see their point relative to the boxplot (the population data).

This approach requires a few steps:

  • I have to create a highlight key to explicitly tell plotly that I want to be able to highlight the participants.
  • Next I create the base plot but this time instead of using the original data, I pass in the highlight key that I created in step 1.
  • I build the plot just like before.
  • Once the plot has been built I use the highlight() function to tell plotly how I want the plot to behave.

NOTE: This approach is super useful and easy and doesn’t require a shiny server to share the results. That said, I find this aspect of plotly to be a bit clunky and, when given the choice between this or using shiny, I take shiny because it has a lot more options to customize it exactly how you want. The downside is that you’d need a shiny server to share your results with colleagues or decision-makers, so there is a trade off.

### plotly with selection box -------------------------------------------
# set 'particpant' as the group to select
person_of_interest <- highlight_key(dat, ~participant)

# create a new base plotly plot using the person_of_interest_element
selection_perf_plt <- plot_ly(person_of_interest, type = "box") %>%
  group_by(participant)

# build the plot
plt_selector <- selection_perf_plt %>%
  group_by(participant) %>%
  add_boxplot(x = ~performance,
              boxpoints = "all",
              line = list(color = 'black'),
              text = ~participant,
              marker = list(color = '#56B4E9',
                            size = 15)) %>% 
  layout(yaxis = list(showticklabels = FALSE)) %>%
  layout(xaxis = list(title = "Performance")) %>%
  layout(title = "Team Performance")   

# create the selector tool
plt_selector %>%
  highlight(on = 'plotly_click',
              off = 'plotly_doubleclick',
              selectize = TRUE,
              dynamic = TRUE,
              persistent = TRUE)

 

  • Statically, we can see what the plot looks like below.
  • Below the static image you can click the red link to see me walk through the interactive plot. Notice that as I select participants it selects them out. I can add as many as I want and change color to highlight certain participants over others. Additionally, once I begin to remove participants you’ll notice that plotly will create a boxplot for the selected sub population, which may be useful when communicating performance results.
  • Finally, the last red link will allow you to open the interactive tool yourself and play around with it.

interactive_plt2_video

interactive_plt2

Wrapping Up

Today’s article provided some interactive options for the static plots that were created in yesterday’s blog article.

As always, the complete code for this article is available on my GITHUB page.

Box & Dotplots for Performance Visuals

A colleague recently asked me about visualizing athlete performance of athletes relative to their teammates. More specifically, they wanted something that showed some sort of team average and normal range and then a way to highlight where the individual athlete of interest resided within the population.

Immediately, my mind went to some type of boxplot visualization combined with a dotplot where the athlete can clearly identified. Here are a few examples I quickly came up with.

Simulate Performance Data

First we will simulate some performance data for a group of athletes.

### Load libraries -----------------------------------------------
library(tidyverse)
library(randomNames)

### Data -------------------------------------------------------
set.seed(2022)
dat <- tibble(
  participant = randomNames(n = 20),
  performance = rnorm(n = 20, mean = 100, sd = 10))

Plot 1 – Boxplot with Points

The first plot is a simple boxplot plot with dots below it.

A couple of notes:

  • I’ve selected a few athletes to be our “of_interest” players for the plot.
  • This plot doesn’t have a y-axis, since all I am doing is plotting the boxplot for the distribution of performance. Therefore, I set the y-axis variable to a factor, so that is simply identifies a space within the grid to organize my plot.
  • I’m using a colorblind friendly palette to ensure that the colors are viewable to a broad audience.
  • Everything else after that is basic {ggplot2} code with some simple theme styling for the plot space and the legend position.
dat %>%
  mutate(of_interest = case_when(participant %in% c("Gallegos, Dennis", "Vonfeldt, Mckenna") ~ participant,
                                 TRUE ~ "everyone else")) %>%
  ggplot(aes(x = performance, y = factor(0))) +
  geom_boxplot(width = 0.2) +
  geom_point(aes(x = performance, y = factor(0),
                  fill = of_interest),
              position = position_nudge(y = -0.2),
              shape = 21,
              size = 8,
              color = "black",
              alpha = 0.6) +
  scale_fill_manual(values = c("Gallegos, Dennis" = "#E69F00", "Vonfeldt, Mckenna" = "#56B4E9", "everyone else" = "#999999")) +
  labs(x = "Performance",
       title = "Team Performance",
       fill = "Participants") +
  theme_classic() +
  theme(axis.text.y = element_blank(),
        axis.title.y = element_blank(),
        axis.text.x = element_text(size = 10, face = "bold"),
        axis.title.x = element_text(size = 12, face = "bold"),
        plot.title = element_text(size = 18),
        legend.position = "top")

Not too bad! You might have more data than we’ve simulated and thus the inline dots might start to get busy. We can use geom_dotplot() to create separation between dots in areas where there is more density and expose that density of performance scores a bit better.

Plot 2 – Boxplot with Dotplots

Swapping out geom_point() with geom_dotplot() allows us to produce the same plot above just with a dotplot.

  • Here, I set “y = positional” since, as before, we don’t have a true y-axis. Doing so allows me to specify where on the y-axis I want to place by boxplot and dotplot in their respective aes().
# Horizontal
dat %>%
  mutate(of_interest = case_when(participant %in% c("Gallegos, Dennis", "Vonfeldt, Mckenna") ~ participant,
                                 TRUE ~ "everyone else")) %>%
  ggplot(aes(x = performance, y = positional)) +
  geom_boxplot(aes(y = 0.2),
    width = 0.2) +
  geom_dotplot(aes(y = 0, 
                   fill = of_interest),
              color = "black",
              stackdir="center",
              binaxis = "x",
              alpha = 0.6) +
  scale_fill_manual(values = c("Gallegos, Dennis" = "#E69F00", "Vonfeldt, Mckenna" = "#56B4E9", "everyone else" = "#999999")) +
  labs(x = "Performance",
       title = "Team Performance",
       fill = "Participants") +
  theme_classic() +
  theme(axis.text.y = element_blank(),
        axis.title.y = element_blank(),
        axis.text.x = element_text(size = 10, face = "bold"),
        axis.title.x = element_text(size = 12, face = "bold"),
        plot.title = element_text(size = 18),
        legend.position = "top",
        legend.title = element_text(size = 13),
        legend.text = element_text(size = 11),
        legend.key.size = unit(2, "line")) 

If you don’t like the idea of a horizontal plot you can also do it in vertical.

# vertical
dat %>%
  mutate(of_interest = case_when(participant %in% c("Gallegos, Dennis", "Vonfeldt, Mckenna") ~ participant,
                                 TRUE ~ "everyone else")) %>%
  ggplot(aes(x=positional, y= performance)) +
  geom_dotplot(aes(x = 1.75, 
                   fill = of_interest), 
               binaxis="y", 
               stackdir="center") +
  geom_boxplot(aes(x = 2), 
               width=0.2) +
  scale_fill_manual(values = c("Gallegos, Dennis" = "#E69F00", "Vonfeldt, Mckenna" = "#56B4E9", "everyone else" = "#999999")) +
  labs(y = "Performance",
       title = "Team Performance",
       fill = "Participants") +
  theme_classic() +
  theme(axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.y = element_text(size = 10, face = "bold"),
        axis.title.y = element_text(size = 12, face = "bold"),
        plot.title = element_text(size = 18),
        legend.position = "top",
        legend.title = element_text(size = 13),
        legend.text = element_text(size = 11),
        legend.key.size = unit(2, "line")) +
  xlim(1.5, 2.25)

Wrapping Up

Visualizing an athlete’s performance relative to their team or population can be a useful for communicating data. Boxplots with dotplots can be a compelling way to show where an athlete falls when compared to their peers. Other options could have been to show a density plot with points below it (like a Raincloud plot). However, I often feel like people have a harder time grasping a density plot whereas the the boxplot clearly gives them an average (line inside of the box) and some “normal” range (interquartile range, referenced by the box) to anchor themselves to. Finally, this plot can easily be built into an interactive plot using Shiny.

All of the code for this article is available on my GITHUB page.

Approximating a Bayesian Posterior Prediction

This past week on the Wharton Moneyball Podcast, during Quarter 2 of the show the hosts got into a discussion about Bayesian methods, simulating uncertainty, and frequentist approaches to statistical analysis. The show hosts are all strong Bayesian advocates but at one point in the discussion, Eric Bradlow mentioned that “frequenstist can answer similar questions by building distributions from their model predictions.” (paraphrasing)

This comment reminded me of Chapter 7 in Gelman and Hill’s brilliant book, Data Analysis Using Regression and Multilevel/Hierarchical Models. In this chapter, the authors do what they call an informal Bayesian approach by simulating the predictions from a linear regression model. It’s an interesting (and easy) approach that can be helpful for getting a sense for how Bayesian posterior predictions work without building a full Bayesian model. I call it, “building a bridge to Bayes”.

Since the entire book is coded in R, I decided to code an example in Python.

The Jupyter notebook is accessible on my GITHUB page.

(Special Material: In the Jupyter Notebook I also include additional material on how to calculate prediction intervals and confidence intervals by hand in python. I wont go over those entire sections here as I don’t want the post to get too long.)

Libraries & Data

We will start by loading up the libraries that we need and the data. I’ll use the iris data set, since it is conveniently available from the sklearn library. We will build the regression model using the statsmodels.api library.

## import libraries

from sklearn import datasets
import pandas as pd
import numpy as np
import statsmodels.api as smf
from scipy import stats
import matplotlib.pyplot as plt

## iris data set
iris = datasets.load_iris()

## convert to pandas sata frame
data = iris['data']
target = iris['target']

iris_df = pd.DataFrame(data)
iris_df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
iris_df.head(3)

Build a linear regression model

Next, we build a simple ordinary least squares regression model to predict petal_width from petal_length.

 

## Set up our X and y variables
X = iris_df['petal_length']
y = iris_df['petal_width']

# NOTE: statsmodels does not automatically add an intercept, so you need to do that manually

X = smf.add_constant(X)

# Build regression model
# NOTE: the X and y variables are reversed in the function compared to sklearn

fit_lm = smf.OLS(y, X).fit()

# Get an R-like output of the model

fit_lm.summary()

Simulate a distribution around the slope coefficient

The slope coefficient for theĀ petal_length variable, from the model output above, is 0.4158 with a standard error of 0.01. We will store these two values in their own variables and then use them to create a simulation of 10,000 samples and plot the distribution.

## get summary stats
mu_slope = 0.4158
se_slope = 0.01

## create simulation
n_samples = 10000
coef_sim = np.random.normal(loc = mu_slope, scale = se_slope, size = n_samples)

## plot simulated distribution

plt.hist(coef_sim, bins = 60)
plt.show()

We can also grab the summary statistics from the simulated distribution. We will snag the mean and the 90% quantile interval.

## get summary stats from our simulation
summary_stats = {
    'Mean': coef_sim.mean(),
    'Low90': np.quantile(coef_sim, 0.05),
    'High90': np.quantile(coef_sim, 0.95)
}

summary_stats

Making a prediction on a new observation and building a posterior predictive distribution

Now that we’ve gotten a good sense for how to create a simulation in python, we can create a new observation of petal_length and make a prediction about what the petal_width would be based on our model. In addition, we will get the prediction intervals from the output and use them to calculate a standard error for the prediction, which we will use for the posterior predictive simulation.

Technical Note: In the prediction output you will see that python returns a mean_ci_lower and mean_ci_upper and an obs_ci_lower and obs_ci_higher. The latter two are the prediction intervalsĀ  while the former two are the confidence intervals. I previously discussed the difference HERE and this is confirmed in the Jupyter Notebook where I calculate these values by hand.

# create a new petal_length observation
new_dat = np.array([[1, 1.7]])    # add a 1 upfront to indicate the model intercept

# make prediction of petal_width using the model
prediction = fit_lm.get_prediction(new_dat)

# get summary of the prediction
prediction.summary_frame()

Store the predicted value (0.343709) and then calculate the standard error from the lower and upper prediction intervals. Run a simulation and then plot the distribution of predictions.

mu_pred = 0.343709
se_pred = 0.754956 - 0.343709     # subtract the upper prediction interval from the mean to get the variability
n_sims = 10000

pred_obs = np.random.normal(loc = mu_pred, scale = se_pred, size = n_sims)

plt.hist(pred_obs, bins = 60)
plt.show()

Just as we did for the simulation of the slope coefficient we can extract our summary statistics (mean and 90% quantile intervals).

## get summary stats from our simulation
summary_stats = {
    'Mean': pred_obs.mean(),
    'Low90': np.quantile(pred_obs, 0.05),
    'High90': np.quantile(pred_obs, 0.95)
}

summary_stats

Wrapping Up

That is a pretty easy way to get a sense for approximating a Bayesian posterior predictive distribution. Rather than simply reporting the predicted value and a confidence interval or prediction interval, it is sometimes nice to build an entire distribution. Aside from it being visually appealing, it allows us to answer other questions we might have, for example, what percentage of the data is greater or less than 0.5 (or any other threshold value you might be interested in)?

As stated earlier, all of this code is accessible on my GITHUB page and the Jupyter notebook also has additional sections on how to calculate confidence intervals and prediction intervals by hand.

If you notice any errors, let me know!

Shiny – User Defined Chart Parameters

A colleague was working on a web app for his basketball team and asked me if there was a way to create a {shiny} web app that allowed the user to define which parameters they would like to see on the plot. I figured this would be something others might be interested in as well, so here we go!

Load Packages, Helper Functions & Data

I’ll use data from the Lahman baseball database (seasons 2017 – 2019). I’m also going to create two helper functions, one for calculating the z-scores for our stats of interest and one for calculating the t-value from the z-score. The t-value will put the z-score on a 0 to 100 scale for plotting purposes in our polar plot. Additionally, we will use these standardized scores to conditionally format colors on our {gt} table (but we will hide the standardized columns so that the user only sees the raw data and colors). Finally, I’m going to create both a wide and long format of the data as it will be easier to use one or the other, depending on the type of plot or table I am building.

#### Load packages ------------------------------------------------
library(tidyverse)
library(shiny)
library(Lahman)
library(gt)
library(plotly)

theme_set(theme_minimal() + 
            theme(
              axis.text = element_text(face = "bold", size = 12),
              legend.title = element_blank(),
              legend.position = "none"
            ) )

#### helper functions -------------------------------------------

z_score <- function(x){
  z = (x - mean(x, na.rm = T)) / sd(x, na.rm = T)
  return(z)
}

t_score <- function(x){ t = (x * 10) + 50 t = ifelse(t > 100, 100, 
             ifelse(t < 0, 0, t))
  return(t)
}


#### Get Data ---------------------------------------------------

dat <- Batting %>%
  filter(between(yearID, left = 2017, right = 2019),
         AB >= 200) %>% 
  group_by(yearID, playerID) %>%
  summarize(across(.cols = G:GIDP,
         ~sum(.x)),
         .groups = "drop") %>%
  mutate(ba = H / AB,
         obp = (H + BB + HBP) / (AB + HBP + SF),
         slg = ((H - X2B - X3B - HR) + X2B*2 + X3B*3 + HR*4) / AB,
         ops = obp + slg,
         hr_rate = H / AB) %>%
  select(playerID, yearID, AB, ba:hr_rate) %>%
  mutate(across(.cols = ba:hr_rate,
                list(z = z_score)),
         across(.cols = ba_z:hr_rate_z,
                list(t = t_score))) %>%
  left_join(People %>%
              mutate(name = paste(nameLast, nameFirst, sep = ", ")) %>%
              select(playerID, name)) %>%
  relocate(name, .before = yearID)


dat_long <- Batting %>%
  filter(between(yearID, left = 2017, right = 2019),
         AB >= 200) %>% 
  group_by(playerID) %>%
  summarize(across(.cols = G:GIDP,
                   ~sum(.x)),
            .groups = "drop") %>%
  mutate(ba = H / AB,
         obp = (H + BB + HBP) / (AB + HBP + SF),
         slg = ((H - X2B - X3B - HR) + X2B*2 + X3B*3 + HR*4) / AB,
         ops = obp + slg,
         hr_rate = H / AB) %>%
  select(playerID, AB, ba:hr_rate) %>%
  mutate(across(.cols = ba:hr_rate,
                list(z = z_score)),
         across(.cols = ba_z:hr_rate_z,
                list(t = t_score))) %>%
  left_join(People %>%
              mutate(name = paste(nameLast, nameFirst, sep = ", ")) %>%
              select(playerID, name)) %>%
  relocate(name, .before = AB) %>%
  select(playerID:AB, ends_with("z_t")) %>%
  pivot_longer(cols = -c(playerID, name, AB),
               names_to = "stat") %>%
  mutate(stat = case_when(stat == "ba_z_t" ~ "BA",
                          stat == "obp_z_t" ~ "OBP",
                          stat == "slg_z_t" ~ "SLG",
                          stat == "ops_z_t" ~ "OPS",
                          stat == "hr_rate_z_t" ~ "HR Rate"))


dat %>%
  head()

dat_long %>%
  head()

 

The Figures for our App

Before I build the {shiny} app, I wanted to first construct the three figures I will include. The code for these will be accessible in Github, but here is what they look like:

  • For the polar plot, I will allow the user to define which variables they want on the chart.
  • For the time series plot, I am going to create an interactive {plotly} chart that allows the user to select the stat they want to see and then hover over the player’s points and obtain information like the raw value and the number of at bats in the given season via a simple tool tip.
  • The table, as discussed above, will user conditional formatting to provide the user with extra context about how that player performed relative to his peers in a given season.

Because I don’t like to clutter up my {shiny} apps, I tend to build my plots and tables into custom functions. That way, all I need to do is set up a reactive() in the server to obtain the user selected data and then call the function on that data. Here are the functions for the three figures above.

## table function
tbl_func <- function(NAME){ dat %>%
  filter(name == NAME) %>%
  select(yearID, AB:hr_rate, ends_with("z_t")) %>%
  gt(rowname_col = "yearID") %>%
  fmt_number(columns = ba:hr_rate,
             decimals = 3) %>%
  cols_label(
    AB = md("**AB**"),
    ba = md("**Batting Avg**"),
    obp = md("**OBP**"),
    slg = md("**SLG**"),
    ops = md("**OPS**"),
    hr_rate = md("**Home Run Rate**")
  ) %>%
  tab_header(title = NAME) %>%
  opt_align_table_header(align = "left") %>%
  tab_options(column_labels.border.top.color = "transparent",
              column_labels.border.top.width = px(3),
              table.border.top.color = "transparent",
              table.border.bottom.color = "transparent") %>%
  cols_align(align = "center") %>%
  cols_hide(columns = ends_with("z_t")) %>%
    tab_style(
      style = cell_fill(color = "palegreen"),
      location = cells_body(
        columns = ba,
        rows = ba_z_t > 60
      )
    )  %>%
    tab_style(
      style = cell_fill(color = "red"),
      location = cells_body(
        columns = ba,
        rows = ba_z_t < 40 ) ) %>%
    tab_style(
      style = cell_fill(color = "palegreen"),
      location = cells_body(
        columns = obp,
        rows = obp_z_t > 60
      )
    )  %>%
    tab_style(
      style = cell_fill(color = "red"),
      location = cells_body(
        columns = obp,
        rows = obp_z_t < 40 ) ) %>%
    tab_style(
      style = cell_fill(color = "palegreen"),
      location = cells_body(
        columns = slg,
        rows = slg_z_t > 60
      )
    )  %>%
    tab_style(
      style = cell_fill(color = "red"),
      location = cells_body(
        columns = slg,
        rows = slg_z_t < 40 ) ) %>%
    tab_style(
      style = cell_fill(color = "palegreen"),
      location = cells_body(
        columns = ops,
        rows = ops_z_t > 60
      )
    )  %>%
    tab_style(
      style = cell_fill(color = "red"),
      location = cells_body(
        columns = ops,
        rows = ops_z_t < 40 ) ) %>%
    tab_style(
      style = cell_fill(color = "palegreen"),
      location = cells_body(
        columns = hr_rate,
        rows = hr_rate_z_t > 60
      )
    )  %>%
    tab_style(
      style = cell_fill(color = "red"),
      location = cells_body(
        columns = hr_rate,
        rows = hr_rate_z_t < 40
      )
    ) 
}


## Polar plot function
polar_plt <- function(NAME, STATS){ dat_long %>%
    filter(name == NAME,
           stat %in% STATS) %>%
    ggplot(aes(x = stat, y = value, fill = stat)) +
    geom_col(color = "white", width = 0.75) +
    coord_polar(theta = "x") +
    geom_hline(yintercept = seq(50, 50, by = 1), size = 1.2) +
    labs(x = "", y = "") +
    ylim(0, 100)
  
} 


## time series plot function
time_plt <- function(NAME, STAT){
  
  STAT <- case_when(STAT == "BA" ~ "ba",
                    STAT == "OBP" ~ "obp",
                    STAT == "SLG" ~ "slg",
                    STAT == "OPS" ~ "ops",
                    STAT == "HR Rate" ~ "hr_rate")
  
  stat_z <- paste0(STAT, "_z")
  
  p <- dat %>% 
    filter(name == NAME) %>%
    select(yearID, AB, STAT, stat_z) %>%
    setNames(., c("yearID", "AB", "STAT", "stat_z")) %>%
    ggplot(aes(x = as.factor(yearID), 
               y = stat_z,
               group = 1,
               label = NAME,
               label2 = AB,
               lable3 = STAT)) +
    geom_hline(yintercept = 0,
               size = 1.1,
               linetype = "dashed") +
    geom_line(size = 1.2) +
    geom_point(shape = 21,
               size = 6,
               color = "black",
               fill = "white") +
    ylim(-4, 4) 
  
  
  ggplotly(p)
  
}

 

Build the {shiny} app

The below code will construct the {shiny} app. We allow the user to select a player, select the stats of interest for the polar plot, and select the stat they’d like to track over time.

If you’d like to see a video of the app in use, CLICK HERE <shiny – user defined chart parameters>

If you want to run this yourself or build one similar to it you can access my code on GitHub.

 

#### Shiny App ---------------------------------------------------------------

## User Interface
ui <- fluidPage(
  
  titlePanel("MLB Hitters Shiny App\n2017-2019"),
  
  
  sidebarPanel(width = 3,
             selectInput("name",
                             label = "Choose a Player:",
                             choices = unique(dat$name),
                             selected = NULL,
                             multiple = FALSE),
              
              selectInput("stat",
                          label = "Choose stats for polar plot:",
                          choices = unique(dat_long$stat),
                          selected = NULL,
                          multiple = TRUE),
              
              selectInput("time_stat",
                          label = "Choose stat for time series:",
                          choices = unique(dat_long$stat),
                          selected = NULL,
                          multiple = FALSE)
  ),
  
  
  mainPanel(
    
    gt_output(outputId = "tbl"),
    
    fluidRow(
      
      column(6, plotOutput(outputId = "polar")),
      column(6, plotlyOutput(outputId = "time"))
    )
    
  )
)



server <- function(input, output){
  
  ## get player selected for table
  NAME <- reactive({ dat_long %>%
      filter(name == input$name) %>%
      distinct(name, .keep_all = FALSE) %>%
      pull(name)
    })
  
  ## get stats for polar plot
  polar_stats <- reactive({ dat_long %>%
      filter(stat %in% c(input$stat)) %>%
      pull(stat)
    
  })
  
  ## get stat for time series
  ts_stat <- reactive({ dat %>%
      select(ba:hr_rate) %>%
      setNames(., c("BA", "OBP", "SLG", "OPS", "HR Rate")) %>%
      select(input$time_stat) %>% 
      colnames()
    
  })
  
  ## table output
  output$tbl <- render_gt(
      tbl_func(NAME = NAME())
  )
  
  ## polar plot output
  output$polar <- renderPlot(
    
    polar_plt(NAME = NAME(),
              STAT = polar_stats())
    
  )
  
  ## time series plot output
  output$time <- renderPlotly(
    
    time_plt(NAME = NAME(),
             STAT = ts_stat())
    
  )

  
}





shinyApp(ui, server)

Regression to the Mean in Sports

Last week, scientist David Borg posted an article to twitter talking about regression to the mean in epidemiological research (1). Regression to the Mean is a statistical phenomenon where extreme observations tend to move closer towards the population mean in subsequent observations, due simply to natural variation. To steal Galton’s example, tall parents will often have tall children but those children, while taller than the average child, will tend to be shorter than their parents (regressed to the mean). It’s also one of the reasons why clinicians have such a difficult time understanding whether their intervention actually made the patient better or whether observed improvements are simply due to regression to the mean over the course of treatment (something that well designed studies attempt to rule out by using randomized controlled experiments).

Of course, this phenomenon is not unique to epidemiology or biostatistics. In fact, the phrase is commonly used in sport when discussing players or teams that have extremely high or low performances in a season and there is a general expectation that they will be more normal next year. An example of this could be the sophomore slump exhibited by rookies who perform at an extremely high level in their first season.

Given that this phenomenon is so common in our lives, the goal with this blog article is to show what regression to the mean looks like for team wins in baseball from one year to the next.

Data

We will use data from the Lahman baseball database (freely available in R) and concentrate on team wins in the 2015 and 2016 MLB seasons.

library(tidyverse)
library(Lahman)
library(ggalt)

theme_set(theme_bw())

data(Teams)

dat <- Teams %>%
  select(yearID, teamID, W) %>%
  arrange(teamID, yearID) %>%
  filter(yearID %in% c(2015, 2016)) %>%
  group_by(teamID) %>%
  mutate(yr_2 = lead(W)) %>%
  rename(yr_1 = W) %>%
  filter(!is.na(yr_2)) %>%
  ungroup() 

dat %>%
  head()

 

Exploratory Data Analysis

dat %>%
  ggplot(aes(x = yr_1, y = yr_2, label = teamID)) +
  geom_point() +
  ggrepel::geom_text_repel() +
  geom_abline(intercept = 0,
              slope = 1,
              color = "red",
              linetype = "dashed",
              size = 1.1) +
  labs(x = "2015 wins",
       y = "2016 wins")

The dashed line is the line of equality. A team that lies exactly on this line would be a team that had the exact number of wins in 2015 as they did in 2016. While no team lies exactly on this line, looking at the chart, what we can deduce is that teams below the red line had more wins in 2015 and less in 2016 while the opposite is true for those that lie above the line. Minnesota had a large decline in performance going from just over 80 wins in 2015 to below 60 wins in 2017.

The correlation between wins in 2015 to wins in 2016 is 0.54.

cor.test(dat$yr_1, dat$yr_2)

Plotting changes in wins from 2015 to 2016

We can plot each team and show their change in wins from 2015 (green) to 2016 (blue). We will break them into two groups, teams that saw a decrease in wins in 2016 relative to 2015 and teams that saw an increase in wins from 2015 to 2016. We will plot the z-score of team wins on the x-axis so that we can reflect average as being “0”.

## get the z-score of team wins for each season
dat <- dat %>%
  mutate(z_yr1 = (yr_1 - mean(yr_1)) / sd(yr_1),
         z_yr2 = (yr_2 - mean(yr_2)) / sd(yr_2))

dat %>%
  mutate(pred_z_dir = ifelse(yr_2 > yr_1, "increase wins", "decrease wins")) %>%
  ggplot(aes(x = z_yr1, xend = z_yr2, y = reorder(teamID, z_yr1))) +
  geom_vline(xintercept = 0,
             linetype = "dashed",
             size = 1.2,
             color = "grey") +
  geom_dumbbell(size = 1.2,
                colour_x = "green",
                colour_xend = "blue",
                size_x = 6,
                size_xend = 6) +
  facet_wrap(~pred_z_dir, scales = "free_y") +
  scale_color_manual(values = c("decrease wins" = "red", "increase wins" = "black")) +
  labs(x = "2015",
       y = "2016",
       title = "Green = 2015 Wins\nBlue = 2016 Wins",
       color = "Predicted Win Direction")

On the decreasing wins side, notice that all teams from SLN to LAA had more wins in 2015 (green) and then regressed towards the mean (or slightly below the mean) in 2016. From MIN down, those teams actually got even worse in 2016.

On the increasing wins side, from BOS down to PHI all of the teams regressed upward towards the mean (NOTE: regressing upward sounds weird, which is why some refer to this as reversion to the mean). From CHN to BAL, those teams were at or above average in 2015 and got better in 2016.

It makes sense that not all teams revert towards the mean in the second year given that teams attempt to upgrade their roster from one season to the next in order to maximize their chances of winning more games.

Regression to the with linear regression

There are three ways we can evaluate regression to the mean using linear regression and all three approaches will lead us to the same conclusion.

1. Linear regression with the raw data, predicting 2016 wins from 2015 wins.

2. Linear regression predicting 2016 wins from the grand mean centered 2015 wins (grand mean centered just means subtracting the league average wins, 81, from the observed wins for each team).

3. Linear regression using z-scores. This approach will produce results in standard deviation units rather than raw values.

## linear regression with raw values
fit <- lm(yr_2 ~ yr_1, data = dat)
summary(fit)

## linear regression with grand mean centered 2015 wins 
fit_grand_mean <- lm(yr_2 ~ I(yr_1 - mean(yr_1)), data = dat)
summary(fit_grand_mean)

## linear regression with z-score values
fit_z <- lm(z_yr2 ~ z_yr1, data = dat)
summary(fit_z)

Next, we take these equations and simply make predictions for 2016 win totals for each team.

dat$pred_fit <- fitted(fit)
dat$pred_fit_grand_mean <- fitted(fit_grand_mean)
dat$pred_fit_z <- fitted(fit_z) dat %>%
  head()

You’ll notice that the first two predictions are exactly the same. The third prediction is in standard deviation units. For example, Arizona (ARI) was -0.188 standard deviations below the mean in 2015 and in 2016 they were predicted to regress towards the mean and be -0.102 standard deviations below the mean. However, they actually went the other way and got even worse, finishing the season -1.12 standard deviations below the mean!

We can plot the residuals and see how off our projections where for each team’s 2016 win total.

par(mfrow = c(2,2))
hist(resid(fit), main = "Linear reg with raw values")
hist(resid(fit_grand_mean), main = "Linear reg with\ngrand mean centered values")
hist(resid(fit_z), main = "Linear reg with z-scored values")

The residuals look weird here because we are only dealing with two seasons of data. At the end of this blog article I will run the residual plot for all seasons from 2000 – 2019 to show how the residuals resemble more of a normal distribution.

We can see that there seems to be an extreme outlier that has a -20 win residual. Let’s pull that team out and see who it was.

dat %>%
  filter((yr_2 - pred_fit) < -19)

It was Minnesota, who went from an average season in 2015 (83 wins) and dropped all the way to 59 wins in 2016 (-2.05 standard deviations below the mean), something we couldn’t have really predicted.

The average absolute change from year1 to year2 for these teams was 8 wins.

mean(abs(dat$yr_2 - dat$yr_1))

Calculating Regression to the mean by hand

To explore the concept more, we can calculate regression toward the mean for the 2016 season by using the year-to-year correlation of team wins, the average wins for the league, and the z-score for each teams wins in 2015. The equation looks like this:

yr2.wins = avg.league.wins + sd_wins * predicted_yr2_z

Where predicted_yr2_z is calculated as:

predicted_yr2_z = (yr_1z * year.to.year.correlation)

We calculated the correlation coefficient above but let’s now store it as its own variable. Additionally we will store the league average team wins and standard deviation in 2015.

avg_wins <- mean(dat$yr_1)
sd_wins <- sd(dat$yr_1)
r <- cor.test(dat$yr_1, dat$yr_2)$estimate

avg_wins
sd_wins
r

On average teams won 81 games (which makes sense for a 162 season) with a standard deviation of about 10 games.

Let’s look at Pittsburgh (Pitt)

dat %>%
  filter(teamID == "PIT") %>%
  select(yearID:z_yr1) %>%
  mutate(predicted_yr2_z = z_yr1 * r,
         predicted_yr2_wins = avg_wins + sd_wins * predicted_yr2_z)

  • In 2015 Pittsburgh had 98 wins, a z-score of 1.63.
  • We predict them in 2016 to regress to the mean and have a z-score of 0.881 (90 wins)

Add these by hand regression to the mean predictions to all teams.

dat <- dat %>%
  mutate(predicted_yr2_z = z_yr1 * r,
         predicted_yr2_wins = avg_wins + sd_wins * predicted_yr2_z)

dat %>%
  head()


We see that our by hand calculation produces the same prediction, as it should.

Show the residuals for all seasons from 2000 – 2019

Here, we will filter out the data from the data base for the desired seasons, refit the model, and plot the residuals.

 

dat2 <- Teams %>%
  select(yearID, teamID, W) %>%
  arrange(teamID, yearID) %>%
  filter(between(x = yearID,
                 left = 2000,
                 right = 2019)) %>%
  group_by(teamID) %>%
  mutate(yr_2 = lead(W)) %>%
  rename(yr_1 = W) %>%
  filter(!is.na(yr_2)) %>%
  ungroup() 

## Correlation between year 1 and year 2
cor.test(dat2$yr_1, dat2$yr_2)


## linear model
fit2 <- lm(yr_2 ~ yr_1, data = dat2)
summary(fit2)

## residual plot
hist(resid(fit2),
     main = "Residuals\n(Seasons 2000 - 2019")

Now that we have more than a two seasons of data we see a normal distribution of the residuals. The correlation between year1 and year2 in this larger data set was 0.54, the same correlation we saw with the two seasons data.

With this larger data set, the average change in wins from year1 to year2 was 9 (not far from what we saw in the smaller data set above).

mean(abs(dat2$yr_2 - dat2$yr_1))

Wrapping Up

Regression to the mean is a common phenomenon in life. It can be difficult for practitioners in sports medicine and strength and conditioning to tease out the effects of regression to the mean when applying a specific training/rehab intervention. Often, regression to the mean fools us into believing that the intervention we did apply has some sort of causal relationship with the observed outcome. This phenomenon is also prevalent in sport when evaluating the performance of individual players and teams from one year to the next. With some simple calculations we can explore what regression to the mean could look like for data in our own setting, providing a compass and some base rates for us to evaluate observations going forward.

All of the code for this blog can be accessed on my GitHub page. If you notice any errors, please reach out!

References

1) Barnett, AG. van der Pols, JC. Dobson, AJ. (2005). Regression to the mean: what it is and how to deal with it. International Journal of Epidemiology; 34: 215-220.

2) Schall, T. Smith, G. Do baseball players regress toward the mean? The American Statistician; 54(4): 231-235.