# Approximating a Bayesian Posterior Prediction

This past week on the Wharton Moneyball Podcast, during Quarter 2 of the show the hosts got into a discussion about Bayesian methods, simulating uncertainty, and frequentist approaches to statistical analysis. The show hosts are all strong Bayesian advocates but at one point in the discussion, Eric Bradlow mentioned that “frequenstist can answer similar questions by building distributions from their model predictions.” (paraphrasing)

This comment reminded me of Chapter 7 in Gelman and Hill’s brilliant book, Data Analysis Using Regression and Multilevel/Hierarchical Models. In this chapter, the authors do what they call an informal Bayesian approach by simulating the predictions from a linear regression model. It’s an interesting (and easy) approach that can be helpful for getting a sense for how Bayesian posterior predictions work without building a full Bayesian model. I call it, “building a bridge to Bayes”.

Since the entire book is coded in R, I decided to code an example in Python.

The Jupyter notebook is accessible on my GITHUB page.

(Special Material: In the Jupyter Notebook I also include additional material on how to calculate prediction intervals and confidence intervals by hand in python. I wont go over those entire sections here as I don’t want the post to get too long.)

Libraries & Data

We will start by loading up the libraries that we need and the data. I’ll use the iris data set, since it is conveniently available from the sklearn library. We will build the regression model using the statsmodels.api library.

```## import libraries

from sklearn import datasets
import pandas as pd
import numpy as np
import statsmodels.api as smf
from scipy import stats
import matplotlib.pyplot as plt

## iris data set

## convert to pandas sata frame
data = iris['data']
target = iris['target']

iris_df = pd.DataFrame(data)
iris_df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
```

Build a linear regression model

Next, we build a simple ordinary least squares regression model to predict petal_width from petal_length.

```## Set up our X and y variables
X = iris_df['petal_length']
y = iris_df['petal_width']

# NOTE: statsmodels does not automatically add an intercept, so you need to do that manually

# Build regression model
# NOTE: the X and y variables are reversed in the function compared to sklearn

fit_lm = smf.OLS(y, X).fit()

# Get an R-like output of the model

fit_lm.summary()
```

Simulate a distribution around the slope coefficient

The slope coefficient for the petal_length variable, from the model output above, is 0.4158 with a standard error of 0.01. We will store these two values in their own variables and then use them to create a simulation of 10,000 samples and plot the distribution.

```## get summary stats
mu_slope = 0.4158
se_slope = 0.01

## create simulation
n_samples = 10000
coef_sim = np.random.normal(loc = mu_slope, scale = se_slope, size = n_samples)

## plot simulated distribution

plt.hist(coef_sim, bins = 60)
plt.show()
```

We can also grab the summary statistics from the simulated distribution. We will snag the mean and the 90% quantile interval.

```## get summary stats from our simulation
summary_stats = {
'Mean': coef_sim.mean(),
'Low90': np.quantile(coef_sim, 0.05),
'High90': np.quantile(coef_sim, 0.95)
}

summary_stats
```

Making a prediction on a new observation and building a posterior predictive distribution

Now that we’ve gotten a good sense for how to create a simulation in python, we can create a new observation of petal_length and make a prediction about what the petal_width would be based on our model. In addition, we will get the prediction intervals from the output and use them to calculate a standard error for the prediction, which we will use for the posterior predictive simulation.

Technical Note: In the prediction output you will see that python returns a mean_ci_lower and mean_ci_upper and an obs_ci_lower and obs_ci_higher. The latter two are the prediction intervals  while the former two are the confidence intervals. I previously discussed the difference HERE and this is confirmed in the Jupyter Notebook where I calculate these values by hand.

```# create a new petal_length observation
new_dat = np.array([[1, 1.7]])    # add a 1 upfront to indicate the model intercept

# make prediction of petal_width using the model
prediction = fit_lm.get_prediction(new_dat)

# get summary of the prediction
prediction.summary_frame()
```

Store the predicted value (0.343709) and then calculate the standard error from the lower and upper prediction intervals. Run a simulation and then plot the distribution of predictions.

```mu_pred = 0.343709
se_pred = 0.754956 - 0.343709     # subtract the upper prediction interval from the mean to get the variability
n_sims = 10000

pred_obs = np.random.normal(loc = mu_pred, scale = se_pred, size = n_sims)

plt.hist(pred_obs, bins = 60)
plt.show()
```

Just as we did for the simulation of the slope coefficient we can extract our summary statistics (mean and 90% quantile intervals).

```## get summary stats from our simulation
summary_stats = {
'Mean': pred_obs.mean(),
'Low90': np.quantile(pred_obs, 0.05),
'High90': np.quantile(pred_obs, 0.95)
}

summary_stats
```

Wrapping Up

That is a pretty easy way to get a sense for approximating a Bayesian posterior predictive distribution. Rather than simply reporting the predicted value and a confidence interval or prediction interval, it is sometimes nice to build an entire distribution. Aside from it being visually appealing, it allows us to answer other questions we might have, for example, what percentage of the data is greater or less than 0.5 (or any other threshold value you might be interested in)?

As stated earlier, all of this code is accessible on my GITHUB page and the Jupyter notebook also has additional sections on how to calculate confidence intervals and prediction intervals by hand.

If you notice any errors, let me know!

# Shiny – User Defined Chart Parameters

A colleague was working on a web app for his basketball team and asked me if there was a way to create a {shiny} web app that allowed the user to define which parameters they would like to see on the plot. I figured this would be something others might be interested in as well, so here we go!

Load Packages, Helper Functions & Data

I’ll use data from the Lahman baseball database (seasons 2017 – 2019). I’m also going to create two helper functions, one for calculating the z-scores for our stats of interest and one for calculating the t-value from the z-score. The t-value will put the z-score on a 0 to 100 scale for plotting purposes in our polar plot. Additionally, we will use these standardized scores to conditionally format colors on our {gt} table (but we will hide the standardized columns so that the user only sees the raw data and colors). Finally, I’m going to create both a wide and long format of the data as it will be easier to use one or the other, depending on the type of plot or table I am building.

```#### Load packages ------------------------------------------------
library(tidyverse)
library(shiny)
library(Lahman)
library(gt)
library(plotly)

theme_set(theme_minimal() +
theme(
axis.text = element_text(face = "bold", size = 12),
legend.title = element_blank(),
legend.position = "none"
) )

#### helper functions -------------------------------------------

z_score <- function(x){
z = (x - mean(x, na.rm = T)) / sd(x, na.rm = T)
return(z)
}

t_score <- function(x){ t = (x * 10) + 50 t = ifelse(t > 100, 100,
ifelse(t < 0, 0, t))
return(t)
}

#### Get Data ---------------------------------------------------

dat <- Batting %>%
filter(between(yearID, left = 2017, right = 2019),
AB >= 200) %>%
group_by(yearID, playerID) %>%
summarize(across(.cols = G:GIDP,
~sum(.x)),
.groups = "drop") %>%
mutate(ba = H / AB,
obp = (H + BB + HBP) / (AB + HBP + SF),
slg = ((H - X2B - X3B - HR) + X2B*2 + X3B*3 + HR*4) / AB,
ops = obp + slg,
hr_rate = H / AB) %>%
select(playerID, yearID, AB, ba:hr_rate) %>%
mutate(across(.cols = ba:hr_rate,
list(z = z_score)),
across(.cols = ba_z:hr_rate_z,
list(t = t_score))) %>%
left_join(People %>%
mutate(name = paste(nameLast, nameFirst, sep = ", ")) %>%
select(playerID, name)) %>%
relocate(name, .before = yearID)

dat_long <- Batting %>%
filter(between(yearID, left = 2017, right = 2019),
AB >= 200) %>%
group_by(playerID) %>%
summarize(across(.cols = G:GIDP,
~sum(.x)),
.groups = "drop") %>%
mutate(ba = H / AB,
obp = (H + BB + HBP) / (AB + HBP + SF),
slg = ((H - X2B - X3B - HR) + X2B*2 + X3B*3 + HR*4) / AB,
ops = obp + slg,
hr_rate = H / AB) %>%
select(playerID, AB, ba:hr_rate) %>%
mutate(across(.cols = ba:hr_rate,
list(z = z_score)),
across(.cols = ba_z:hr_rate_z,
list(t = t_score))) %>%
left_join(People %>%
mutate(name = paste(nameLast, nameFirst, sep = ", ")) %>%
select(playerID, name)) %>%
relocate(name, .before = AB) %>%
select(playerID:AB, ends_with("z_t")) %>%
pivot_longer(cols = -c(playerID, name, AB),
names_to = "stat") %>%
mutate(stat = case_when(stat == "ba_z_t" ~ "BA",
stat == "obp_z_t" ~ "OBP",
stat == "slg_z_t" ~ "SLG",
stat == "ops_z_t" ~ "OPS",
stat == "hr_rate_z_t" ~ "HR Rate"))

dat %>%

dat_long %>%
```

The Figures for our App

Before I build the {shiny} app, I wanted to first construct the three figures I will include. The code for these will be accessible in Github, but here is what they look like:

• For the polar plot, I will allow the user to define which variables they want on the chart.
• For the time series plot, I am going to create an interactive {plotly} chart that allows the user to select the stat they want to see and then hover over the player’s points and obtain information like the raw value and the number of at bats in the given season via a simple tool tip.
• The table, as discussed above, will user conditional formatting to provide the user with extra context about how that player performed relative to his peers in a given season.

Because I don’t like to clutter up my {shiny} apps, I tend to build my plots and tables into custom functions. That way, all I need to do is set up a reactive() in the server to obtain the user selected data and then call the function on that data. Here are the functions for the three figures above.

```## table function
tbl_func <- function(NAME){ dat %>%
filter(name == NAME) %>%
select(yearID, AB:hr_rate, ends_with("z_t")) %>%
gt(rowname_col = "yearID") %>%
fmt_number(columns = ba:hr_rate,
decimals = 3) %>%
cols_label(
AB = md("**AB**"),
ba = md("**Batting Avg**"),
obp = md("**OBP**"),
slg = md("**SLG**"),
ops = md("**OPS**"),
hr_rate = md("**Home Run Rate**")
) %>%
tab_options(column_labels.border.top.color = "transparent",
column_labels.border.top.width = px(3),
table.border.top.color = "transparent",
table.border.bottom.color = "transparent") %>%
cols_align(align = "center") %>%
cols_hide(columns = ends_with("z_t")) %>%
tab_style(
style = cell_fill(color = "palegreen"),
location = cells_body(
columns = ba,
rows = ba_z_t > 60
)
)  %>%
tab_style(
style = cell_fill(color = "red"),
location = cells_body(
columns = ba,
rows = ba_z_t < 40 ) ) %>%
tab_style(
style = cell_fill(color = "palegreen"),
location = cells_body(
columns = obp,
rows = obp_z_t > 60
)
)  %>%
tab_style(
style = cell_fill(color = "red"),
location = cells_body(
columns = obp,
rows = obp_z_t < 40 ) ) %>%
tab_style(
style = cell_fill(color = "palegreen"),
location = cells_body(
columns = slg,
rows = slg_z_t > 60
)
)  %>%
tab_style(
style = cell_fill(color = "red"),
location = cells_body(
columns = slg,
rows = slg_z_t < 40 ) ) %>%
tab_style(
style = cell_fill(color = "palegreen"),
location = cells_body(
columns = ops,
rows = ops_z_t > 60
)
)  %>%
tab_style(
style = cell_fill(color = "red"),
location = cells_body(
columns = ops,
rows = ops_z_t < 40 ) ) %>%
tab_style(
style = cell_fill(color = "palegreen"),
location = cells_body(
columns = hr_rate,
rows = hr_rate_z_t > 60
)
)  %>%
tab_style(
style = cell_fill(color = "red"),
location = cells_body(
columns = hr_rate,
rows = hr_rate_z_t < 40
)
)
}

## Polar plot function
polar_plt <- function(NAME, STATS){ dat_long %>%
filter(name == NAME,
stat %in% STATS) %>%
ggplot(aes(x = stat, y = value, fill = stat)) +
geom_col(color = "white", width = 0.75) +
coord_polar(theta = "x") +
geom_hline(yintercept = seq(50, 50, by = 1), size = 1.2) +
labs(x = "", y = "") +
ylim(0, 100)

}

## time series plot function
time_plt <- function(NAME, STAT){

STAT <- case_when(STAT == "BA" ~ "ba",
STAT == "OBP" ~ "obp",
STAT == "SLG" ~ "slg",
STAT == "OPS" ~ "ops",
STAT == "HR Rate" ~ "hr_rate")

stat_z <- paste0(STAT, "_z")

p <- dat %>%
filter(name == NAME) %>%
select(yearID, AB, STAT, stat_z) %>%
setNames(., c("yearID", "AB", "STAT", "stat_z")) %>%
ggplot(aes(x = as.factor(yearID),
y = stat_z,
group = 1,
label = NAME,
label2 = AB,
lable3 = STAT)) +
geom_hline(yintercept = 0,
size = 1.1,
linetype = "dashed") +
geom_line(size = 1.2) +
geom_point(shape = 21,
size = 6,
color = "black",
fill = "white") +
ylim(-4, 4)

ggplotly(p)

}
```

Build the {shiny} app

The below code will construct the {shiny} app. We allow the user to select a player, select the stats of interest for the polar plot, and select the stat they’d like to track over time.

If you’d like to see a video of the app in use, CLICK HERE <shiny – user defined chart parameters>

If you want to run this yourself or build one similar to it you can access my code on GitHub.

```#### Shiny App ---------------------------------------------------------------

## User Interface
ui <- fluidPage(

titlePanel("MLB Hitters Shiny App\n2017-2019"),

sidebarPanel(width = 3,
selectInput("name",
label = "Choose a Player:",
choices = unique(dat\$name),
selected = NULL,
multiple = FALSE),

selectInput("stat",
label = "Choose stats for polar plot:",
choices = unique(dat_long\$stat),
selected = NULL,
multiple = TRUE),

selectInput("time_stat",
label = "Choose stat for time series:",
choices = unique(dat_long\$stat),
selected = NULL,
multiple = FALSE)
),

mainPanel(

gt_output(outputId = "tbl"),

fluidRow(

column(6, plotOutput(outputId = "polar")),
column(6, plotlyOutput(outputId = "time"))
)

)
)

server <- function(input, output){

## get player selected for table
NAME <- reactive({ dat_long %>%
filter(name == input\$name) %>%
distinct(name, .keep_all = FALSE) %>%
pull(name)
})

## get stats for polar plot
polar_stats <- reactive({ dat_long %>%
filter(stat %in% c(input\$stat)) %>%
pull(stat)

})

## get stat for time series
ts_stat <- reactive({ dat %>%
select(ba:hr_rate) %>%
setNames(., c("BA", "OBP", "SLG", "OPS", "HR Rate")) %>%
select(input\$time_stat) %>%
colnames()

})

## table output
output\$tbl <- render_gt(
tbl_func(NAME = NAME())
)

## polar plot output
output\$polar <- renderPlot(

polar_plt(NAME = NAME(),
STAT = polar_stats())

)

## time series plot output
output\$time <- renderPlotly(

time_plt(NAME = NAME(),
STAT = ts_stat())

)

}

shinyApp(ui, server)
```

# Regression to the Mean in Sports

Last week, scientist David Borg posted an article to twitter talking about regression to the mean in epidemiological research (1). Regression to the Mean is a statistical phenomenon where extreme observations tend to move closer towards the population mean in subsequent observations, due simply to natural variation. To steal Galton’s example, tall parents will often have tall children but those children, while taller than the average child, will tend to be shorter than their parents (regressed to the mean). It’s also one of the reasons why clinicians have such a difficult time understanding whether their intervention actually made the patient better or whether observed improvements are simply due to regression to the mean over the course of treatment (something that well designed studies attempt to rule out by using randomized controlled experiments).

Of course, this phenomenon is not unique to epidemiology or biostatistics. In fact, the phrase is commonly used in sport when discussing players or teams that have extremely high or low performances in a season and there is a general expectation that they will be more normal next year. An example of this could be the sophomore slump exhibited by rookies who perform at an extremely high level in their first season.

Given that this phenomenon is so common in our lives, the goal with this blog article is to show what regression to the mean looks like for team wins in baseball from one year to the next.

Data

We will use data from the Lahman baseball database (freely available in R) and concentrate on team wins in the 2015 and 2016 MLB seasons.

```library(tidyverse)
library(Lahman)
library(ggalt)

theme_set(theme_bw())

data(Teams)

dat <- Teams %>%
select(yearID, teamID, W) %>%
arrange(teamID, yearID) %>%
filter(yearID %in% c(2015, 2016)) %>%
group_by(teamID) %>%
rename(yr_1 = W) %>%
filter(!is.na(yr_2)) %>%
ungroup()

dat %>%
```

Exploratory Data Analysis

```dat %>%
ggplot(aes(x = yr_1, y = yr_2, label = teamID)) +
geom_point() +
ggrepel::geom_text_repel() +
geom_abline(intercept = 0,
slope = 1,
color = "red",
linetype = "dashed",
size = 1.1) +
labs(x = "2015 wins",
y = "2016 wins")
```

The dashed line is the line of equality. A team that lies exactly on this line would be a team that had the exact number of wins in 2015 as they did in 2016. While no team lies exactly on this line, looking at the chart, what we can deduce is that teams below the red line had more wins in 2015 and less in 2016 while the opposite is true for those that lie above the line. Minnesota had a large decline in performance going from just over 80 wins in 2015 to below 60 wins in 2017.

The correlation between wins in 2015 to wins in 2016 is 0.54.

```cor.test(dat\$yr_1, dat\$yr_2)
```

Plotting changes in wins from 2015 to 2016

We can plot each team and show their change in wins from 2015 (green) to 2016 (blue). We will break them into two groups, teams that saw a decrease in wins in 2016 relative to 2015 and teams that saw an increase in wins from 2015 to 2016. We will plot the z-score of team wins on the x-axis so that we can reflect average as being “0”.

```## get the z-score of team wins for each season
dat <- dat %>%
mutate(z_yr1 = (yr_1 - mean(yr_1)) / sd(yr_1),
z_yr2 = (yr_2 - mean(yr_2)) / sd(yr_2))

dat %>%
mutate(pred_z_dir = ifelse(yr_2 > yr_1, "increase wins", "decrease wins")) %>%
ggplot(aes(x = z_yr1, xend = z_yr2, y = reorder(teamID, z_yr1))) +
geom_vline(xintercept = 0,
linetype = "dashed",
size = 1.2,
color = "grey") +
geom_dumbbell(size = 1.2,
colour_x = "green",
colour_xend = "blue",
size_x = 6,
size_xend = 6) +
facet_wrap(~pred_z_dir, scales = "free_y") +
scale_color_manual(values = c("decrease wins" = "red", "increase wins" = "black")) +
labs(x = "2015",
y = "2016",
title = "Green = 2015 Wins\nBlue = 2016 Wins",
color = "Predicted Win Direction")
```

On the decreasing wins side, notice that all teams from SLN to LAA had more wins in 2015 (green) and then regressed towards the mean (or slightly below the mean) in 2016. From MIN down, those teams actually got even worse in 2016.

On the increasing wins side, from BOS down to PHI all of the teams regressed upward towards the mean (NOTE: regressing upward sounds weird, which is why some refer to this as reversion to the mean). From CHN to BAL, those teams were at or above average in 2015 and got better in 2016.

It makes sense that not all teams revert towards the mean in the second year given that teams attempt to upgrade their roster from one season to the next in order to maximize their chances of winning more games.

Regression to the with linear regression

There are three ways we can evaluate regression to the mean using linear regression and all three approaches will lead us to the same conclusion.

1. Linear regression with the raw data, predicting 2016 wins from 2015 wins.

2. Linear regression predicting 2016 wins from the grand mean centered 2015 wins (grand mean centered just means subtracting the league average wins, 81, from the observed wins for each team).

3. Linear regression using z-scores. This approach will produce results in standard deviation units rather than raw values.

```## linear regression with raw values
fit <- lm(yr_2 ~ yr_1, data = dat)
summary(fit)

## linear regression with grand mean centered 2015 wins
fit_grand_mean <- lm(yr_2 ~ I(yr_1 - mean(yr_1)), data = dat)
summary(fit_grand_mean)

## linear regression with z-score values
fit_z <- lm(z_yr2 ~ z_yr1, data = dat)
summary(fit_z)
```

Next, we take these equations and simply make predictions for 2016 win totals for each team.

```dat\$pred_fit <- fitted(fit)
dat\$pred_fit_grand_mean <- fitted(fit_grand_mean)
dat\$pred_fit_z <- fitted(fit_z) dat %>%
```

You’ll notice that the first two predictions are exactly the same. The third prediction is in standard deviation units. For example, Arizona (ARI) was -0.188 standard deviations below the mean in 2015 and in 2016 they were predicted to regress towards the mean and be -0.102 standard deviations below the mean. However, they actually went the other way and got even worse, finishing the season -1.12 standard deviations below the mean!

We can plot the residuals and see how off our projections where for each team’s 2016 win total.

```par(mfrow = c(2,2))
hist(resid(fit), main = "Linear reg with raw values")
hist(resid(fit_grand_mean), main = "Linear reg with\ngrand mean centered values")
hist(resid(fit_z), main = "Linear reg with z-scored values")
```

The residuals look weird here because we are only dealing with two seasons of data. At the end of this blog article I will run the residual plot for all seasons from 2000 – 2019 to show how the residuals resemble more of a normal distribution.

We can see that there seems to be an extreme outlier that has a -20 win residual. Let’s pull that team out and see who it was.

```dat %>%
filter((yr_2 - pred_fit) < -19)
```

It was Minnesota, who went from an average season in 2015 (83 wins) and dropped all the way to 59 wins in 2016 (-2.05 standard deviations below the mean), something we couldn’t have really predicted.

The average absolute change from year1 to year2 for these teams was 8 wins.

```mean(abs(dat\$yr_2 - dat\$yr_1))
```

Calculating Regression to the mean by hand

To explore the concept more, we can calculate regression toward the mean for the 2016 season by using the year-to-year correlation of team wins, the average wins for the league, and the z-score for each teams wins in 2015. The equation looks like this:

yr2.wins = avg.league.wins + sd_wins * predicted_yr2_z

Where predicted_yr2_z is calculated as:

predicted_yr2_z = (yr_1z * year.to.year.correlation)

We calculated the correlation coefficient above but let’s now store it as its own variable. Additionally we will store the league average team wins and standard deviation in 2015.

```avg_wins <- mean(dat\$yr_1)
sd_wins <- sd(dat\$yr_1)
r <- cor.test(dat\$yr_1, dat\$yr_2)\$estimate

avg_wins
sd_wins
r
```

On average teams won 81 games (which makes sense for a 162 season) with a standard deviation of about 10 games.

Let’s look at Pittsburgh (Pitt)

```dat %>%
filter(teamID == "PIT") %>%
select(yearID:z_yr1) %>%
mutate(predicted_yr2_z = z_yr1 * r,
predicted_yr2_wins = avg_wins + sd_wins * predicted_yr2_z)
```

• In 2015 Pittsburgh had 98 wins, a z-score of 1.63.
• We predict them in 2016 to regress to the mean and have a z-score of 0.881 (90 wins)

Add these by hand regression to the mean predictions to all teams.

```dat <- dat %>%
mutate(predicted_yr2_z = z_yr1 * r,
predicted_yr2_wins = avg_wins + sd_wins * predicted_yr2_z)

dat %>%
```

We see that our by hand calculation produces the same prediction, as it should.

Show the residuals for all seasons from 2000 – 2019

Here, we will filter out the data from the data base for the desired seasons, refit the model, and plot the residuals.

```dat2 <- Teams %>%
select(yearID, teamID, W) %>%
arrange(teamID, yearID) %>%
filter(between(x = yearID,
left = 2000,
right = 2019)) %>%
group_by(teamID) %>%
rename(yr_1 = W) %>%
filter(!is.na(yr_2)) %>%
ungroup()

## Correlation between year 1 and year 2
cor.test(dat2\$yr_1, dat2\$yr_2)

## linear model
fit2 <- lm(yr_2 ~ yr_1, data = dat2)
summary(fit2)

## residual plot
hist(resid(fit2),
main = "Residuals\n(Seasons 2000 - 2019")
```

Now that we have more than a two seasons of data we see a normal distribution of the residuals. The correlation between year1 and year2 in this larger data set was 0.54, the same correlation we saw with the two seasons data.

With this larger data set, the average change in wins from year1 to year2 was 9 (not far from what we saw in the smaller data set above).

```mean(abs(dat2\$yr_2 - dat2\$yr_1))
```

Wrapping Up

Regression to the mean is a common phenomenon in life. It can be difficult for practitioners in sports medicine and strength and conditioning to tease out the effects of regression to the mean when applying a specific training/rehab intervention. Often, regression to the mean fools us into believing that the intervention we did apply has some sort of causal relationship with the observed outcome. This phenomenon is also prevalent in sport when evaluating the performance of individual players and teams from one year to the next. With some simple calculations we can explore what regression to the mean could look like for data in our own setting, providing a compass and some base rates for us to evaluate observations going forward.

All of the code for this blog can be accessed on my GitHub page. If you notice any errors, please reach out!

References

1) Barnett, AG. van der Pols, JC. Dobson, AJ. (2005). Regression to the mean: what it is and how to deal with it. International Journal of Epidemiology; 34: 215-220.

2) Schall, T. Smith, G. Do baseball players regress toward the mean? The American Statistician; 54(4): 231-235.

# Mixed Models in Sport Science – Frequentist & Bayesian

Intro

Tim Newans and colleagues recently published a nice paper discussing the value of mixed models compared to repeated measures ANOVA in sport science research (1). I thought the paper highlighted some key points, though I do disagree slightly with the notion that sport science research hasn’t adopted mixed models, as I feel like they have been relatively common in the field over the past decade. That said, I wanted to do a blog that goes a bit further into mixed models because I felt like, while the aim of the paper was to show their value compared with repeated measures ANOVA, there are some interesting aspects of mixed models that weren’t touched upon in the manuscript. In addition to building up several mixed models, it might be fun to extend the final model to a Bayesian mixed model to show the parallels between the two and some of the additional things that we can learn with posterior distributions. The data used by Newans et al. had independent variables that were categorical, level of competition and playing position. The data I will use here is slightly different in that the independent variable is a continuous variable, but the concepts still apply.

Obtain the Data and Perform EDA

For this tutorial, I’ll use the convenient sleepstudy data set in the {lme4} package. This study is a series of repeated observations of reaction time on subjects that were being sleep deprived over 10 days. This sort of repeated nature of data is not uncommon in sport science as serial measurements (e.g., GPS, RPE, Wellness, HRV, etc) are frequently recorded in the applied environment. Additionally, many sport and exercise studies collect time point data following a specific intervention. For example, measurements of creatine-kinase or force plate jumps at intervals of 0, 24, and 48 hours following a damaging bout of exercise or competition.

```knitr::opts_chunk\$set(echo = TRUE)
library(tidyverse)
library(lme4)
library(broom)
library(gt)
library(patchwork)
library(arm)

theme_set(theme_bw())

dat <- sleepstudy dat %>% head()
```

The goal here is to obtain an estimate about the change in reaction time (longer reaction time means getting worse) following a number of days of sleep deprivation. Let’s get a plot of the average change in reaction time for each day of sleep deprivation.

```dat %>%
group_by(Days) %>%
summarize(N = n(),
avg = mean(Reaction),
se = sd(Reaction) / sqrt(N)) %>%
ggplot(aes(x = Days, y = avg)) +
geom_ribbon(aes(ymin = avg - 1.96*se, ymax = avg + 1.96*se),
fill = "light grey",
alpha = 0.8) +
geom_line(size = 1.2) +
scale_x_continuous(labels = seq(from = 0, to = 9, by = 1),
breaks = seq(from = 0, to = 9, by = 1)) +
labs(x = "Days of Sleep Deprivation",
y = "Average Reaction Time",
title = "Reaction Time ~ Days of Sleep Deprivation",
subtitle = "Mean ± 95% CI")
```

Okay, we can clearly see that something is going on here. As the days of sleep deprivation increase the reaction time in the test is also increasing, a fairly intuitive finding.

However, we also know that we are dealing with repeated measures on a number of subjects. As such, some subjects might have differences that vary from the group. For example, some might have a worse effect than the population average while others might not be that impacted at all. Let’s tease this out visually.

```dat %>%
ggplot(aes(x = Days, y = Reaction)) +
geom_line(size = 1) +
geom_point(shape = 21,
size = 2,
color = "black",
fill = "white") +
geom_smooth(method = "lm",
se = FALSE) +
facet_wrap(~Subject) +
scale_x_continuous(labels = seq(from = 0, to = 9, by = 3),
breaks = seq(from = 0, to = 9, by = 3)) +
labs(x = "Days of Sleep Deprivation",
y = "Average Reaction Time",
title = "Reaction Time ~ Days of Sleep Deprivation")
```

This is interesting. While we do see that many of the subjects exhibit an increase reaction time as the number of days of sleep deprivation increase there are a few subjects that behave differently from the norm. For example, Subject 309 seems to have no real change in reaction time while Subject 335 is showing a slight negative trend!

Linear Model

Before we get into modeling data specific to the individuals in the study, let’s just build a simple linear model. This is technically “the wrong” model because it violates assumptions of independence. After all, repeated measures on each individual means that there is going to be a level of correlation from one test to the next for any one individual. That said, building a simple model, even if it is the wrong model, is sometimes a good place to start just to help get your bearings, understand what is there, and have a benchmark to compare future models against. Additionally, the simple model is easier to interpret, so it is a good first step.

```fit_ols <- lm(Reaction ~ Days, data = dat)
summary(fit_ols)
```

We end up with a model that suggests that for every one unit increase in a day of sleep deprivation, on average, we see about a 10.5 second increase in reaction time. The coefficient for the Days variable also has a standard error of 1.238 and the model r-squared indicates that days of sleep deprivation explain about 28.7% of the variance in reaction time.

Let’s get the confidence intervals for the model intercept and Days coefficient.

```confint(fit_ols)
```

Summary Measures Approach

Before moving to the mixed model approach, I want to touch on simple approach that was written about in 1990 by Matthews and Altman (2). I was introduced to this approach by one of my former PhD supervisors, Matt Weston, as he used it to analyze data in 2011 for a paper with my former lead PhD supervisor, Barry Drust, where they were quantifying the intensity of Premier League match-play for players and referees. This approach is extremely simple and, while you may be asking yourself, “Why not just jump into the mixed model and get on with it?”, just bear with me for a moment because this will make sense later when we begin discussing pooling effects in mixed models and Bayesian mixed models.

The basic approach suggested by Matthews (2) for this type of serially collected data is to treat each individual as the unit of measure and identify a single number, which summarizes that subject’s response over time. For our analysis here, the variable that we are interested in for each subject is the Days slope coefficient, as this value tells us the rate of change in reaction time for every passing day of sleep deprivation. Let’s construct a linear regression for each individual subject in the study and place their slope coefficients into a table.

```ind_reg <- dat %>%
group_by(Subject) %>%
group_modify(~ tidy(lm(Reaction ~ Days, data = .x)))

ind_days_slope <- ind_reg %>%
filter(term == "Days")

ind_days_slope %>%
ungroup() %>%
gt() %>%
fmt_number(columns = estimate:statistic,
decimals = 2) %>%
fmt_number(columns = p.value,
decimals = 3)
```

As we noticed in our visual inspection, Subject 335 had a negative trend and we can confirm their negative slope coefficient (-2.88). Subject 309 had a relatively flat relationship between days of sleep deprivation and reaction time. Here we see that their coefficient is 2.26 however the standard error, 0.98, indicates rather large uncertainty [95% CI: 0.3 to 4.22].

Now we can look at how each Subject’s response compares to the population. If we take the average of all of the slopes we get basically the same value that we got from our OLS model above, 10.47. I’ll build two figures, one that shows each Subject’s difference from the population average of 10.47 and one that shows each Subject’s difference from the population centered at 0 (being no difference from the population average). Both tell the same story, but offer different ways of visualizing the subjects relative to the population.

```pop_avg

plt_to_avg <- ind_days_slope %>%
mutate(pop_avg = pop_avg,
diff = estimate - pop_avg) %>%
ggplot(aes(x = estimate, y = as.factor(Subject))) +
geom_vline(xintercept = pop_avg) +
geom_segment(aes(x = diff + pop_avg,
xend = pop_avg,
y = Subject,
yend = Subject),
size = 1.2) +
geom_point(size = 4) +
labs(x = NULL,
y = "Subject",
title = "Difference compared to population\naverage change in reaction time (10.47 sec)")

plt_to_zero <- ind_days_slope %>%
mutate(pop_avg = pop_avg,
diff = estimate - pop_avg) %>%
ggplot(aes(x = diff, y = as.factor(Subject))) +
geom_vline(xintercept = 0) +
geom_segment(aes(x = 0,
xend = diff,
y = Subject,
yend = Subject),
size = 1.2) +
geom_point(size = 4) +
labs(x = NULL,
y = "Subject",
title = "Difference compared to population\naverage reflected against a 0 change")

plt_to_avg | plt_to_zero
```

From the plots, it is clear to see how each individual behaves compared to the population average. Now let’s build some mixed models and compare the results.

Mixed Models

The aim of the mixed model here is to in someway acknowledge that we have these repeated measures on each individual and we need to account for that. In the simple linear model approach above, the Subjects shared the same variance, which isn’t accurate given that each subject behaves slightly differently, which we saw in our initial plot and in the individual linear regression models we constructed in the prior section. Our goal is to build a model that allows us to make an inference about the way in which the amount of sleep deprivation, measured in days, impacts a human’s reaction time performance. Therefore, we wouldn’t want to add each subject into a single regression model, creating a coefficient for each individual within the model, as that would be akin to making comparisons between each individual similar to what we would do in an ANOVA (and it will also use a lot of degrees of freedom). So, we want to acknowledge that there are individual subjects that may vary from the population, while also modeling our question of interest (Reaction Time ~ Days of Sleep Deprivation).

Intercept Only Model

We begin with a simple model that has an intercept only but allows that intercept to vary by subject. As such, this model is not accounting for days of sleep deprivation. Rather, it is simply telling us the average reaction time response for the group and how each subject varies from that response.

```fit1 <- lmer(Reaction ~ 1 + (1|Subject), data = dat)
display(fit1)
```

The output provides us with an average estimate for reaction time of 298.51. Again, this is simply the group average for reaction time. We can confirm this easily.

```mean(dat\$Reaction)
```

The Error terms in our output above provide us with the standard deviation around the population intercept, 35.75, which is the between-subject variability. The residual here represents our within-subject standard variability. We can extract the actual values for each subject. Using the ranef() function we get the difference between each subject and the population average intercept (the fixed effect intercept).

```ranef(fit1)
```

One Fixed Effect plus Random Intercept Model

We will now extend the above mixed model to reflect our simple linear regression model except with random effects specified for each subject. Again, the only thing we are allowing to vary here is the individual subject’s intercept for reaction time.

```fit2 <- lmer(Reaction ~ Days + (1|Subject), data = dat)
display(fit2)
```

Adding the independent variable, Days, has changed the intercept, decreasing it from 298.51 to 251.41. The random effects have also changed from the first model. In the intercept only model we had an intercept standard deviation of 35.75 with a residual standard deviation of 44.26. In this model, accounting for days of sleep deprivation, the random effects intercept is now 37.12 and the residual (within subject) standard deviation has dropped to 30.99. The decline in the residual standard deviation tells us the model is picking up some of the individual differences that we have observed in plot of each subjects’ data.

Let’s look at the random effects intercept for each individual relative to the fixed effect intercept.

```ranef(fit2)
```

For example, Subject 309 has an intercept that is 77.8 seconds below the fixed effect intercept, while Subject 308 is 40.8 seconds above the fixed effect intercept.

If we use the coef() function we get returned the actual individual linear regression equation for each subject. Their difference compared to the population will be added to the fixed effect intercept, to create their individual intercept, and we will see the days coefficient, which is the same for each subject because we haven’t specified a varying slope model (yet).

```coef(fit2)
```

So, for example, the equation for Subject 308 has an equation of:

292.19 + 10.47*Days

Let’s also compare the results of this model to our original simple regression model. I’ll put all of the model results so far into a single data frame that we can keep adding to as we go, allowing us to compare the results.

We can see that once we add the day variable to the model (in fit_ols, and fit2) the model intercept for reaction time changes to reflect its output being dependent on this information. Notice that the intercept and slope values between the simple linear regression model and the fit2 model are the same. The mean result hasn’t changed but notice that the standard error for the intercept and slope as been altered. In particular, notice how much the standard error of the slope variable has declined once we moved from a simple regression to a mixed model. This is due do the fact that the model now knows that there are individuals with repeated measurements being recorded.

Looking at the random effects changes between fit1 (intercept only model) and fit2, we see that we’ve increased in standard deviation for the intercept (35.75 to 37.12), which represents the between-subject variability. Additionally, we have substantially decreased the residual standard deviation (44.26 to 30.99), which represents our within-subject variability. Again, because we have not explicitly told the model that certain measurements are coming from certain individuals, there appears to be more variability between subjects and less variability within-subject. Basically, there is some level of autocorrelation to the individual’s reaction time performance whereby they are more similar to themselves than they are to other people. This makes intuitive sense — if you can’t agree with yourself, who can you agree with?!

Random Slope and Intercept Model

The final mixed model we will build will allow not only the intercept to vary randomly between subjects but also the slope. Recall above that the coefficient for Days was the same across all subjects. This is because that model allowed the subjects to have different model intercepts but it made the assumption that they had the same model slope. However, we may have reason to believe that the slopes also differ between subjects after looking at the individual subject plots in section 1.

```fit3 <- lmer(Reaction ~ Days + (1 + Days|Subject), data = dat)
display(fit3)
```

Let’s take a look at the regression model for each individual.

```coef(fit3)
```

Now we see that the coefficient for the Days variable is different for each Subject. Let’s add the results of this model to our results data frame and compare everything.

Looking at this latest model we see that the intercept and slope coefficients have remained unchanged relative to the other models. Again, the only difference in fixed effects comes at the standard error for the intercept and the slope. This is because the variance in the data is being partitioned between the population estimate fixed effects and the individual random effects. Notice that for the random effects in this model, fit3, we see a substantial decline in the between-subject intercept, down to 24.74 from the mid 30’s in the previous two models. We also see a substantial decline the random effect residual, because we are now seeing less within individual variability as we account for the random slope. The addition here is the random effect standard deviation for the slope coefficient, which is 5.92.

We can plot the results of the varying intercepts and slopes across subjects using the {lattice} package.

```lattice::dotplot(ranef(fit3, condVar = T))
```

Comparing the models

To make model comparisons, Newans and colleagues used Akaike Information Criterion (AIC) to determine which model fit the data best. With the anova() function we simply pass our three mixed models in as arguments and we get returned several model comparison metrics including AIC, BIC, and p-values. These values are lowest for fit3, indicating it is the better model at explaining the underlying data. Which shouldn’t be too surprising given how individual the responses looked in the initial plot of the data.

```anova(fit1, fit2, fit3)
```

We can also plot the residuals of fit3 to see whether they violate any assumptions of homoscedasticity or normality.

```plot(fit3)

par(mfrow = c(1,2))
hist(resid(fit3))
qqnorm(resid(fit3))
qqline(resid(fit3), col = "red", lwd = 3)
```

Pooling Effects

So what’s going on here? The mixed model is really helping us account for the repeated measures and the correlated nature of an individuals data. In doing so, it allows us to make an estimate of a population effect (fixed effect), while adjusting the standard errors to be more reasonable given the violation of independence assumption. The random effects allow us to see how much variance there is for each individual measured within the population. In a way, I tend to think about mixed models as a sort of bridge to Bayesian analysis. In my mind, the fixed effects seem to look a bit like priors and the random effects are the adjustment we make to those priors having seen some data for an individual. In our example here, each of the 18 subjects have 10 reaction time measurements. If we were dealing with a data set that had different numbers of observations for each individual, however, we would see that those with less observations are pulled closer towards the population average (the fixed effect coefficients) while those with a larger number of observations are allowed to deviate further from the population average, if they indeed are different. This is because with more observations we have more confidence about what we are observing for that individual. In effect, this is called pooling.

In their book, Data Analysis Using Regression and Multilevel/Hierarchical Models, Gelman and Hill (4) discuss no-, complete-, and partial-pooling models.

• No-pooling models occur when the data from each subject are analyzed separately. This model ignores information in the data and could lead to poor inference.
• Complete-pooling disregards any variation that might occur between subjects. Such suppression of between-subject variance is missing the point of conducting the analysis and looking at individual difference.
• Partial-pooling strikes a balance between the two, allowing for some pooling of population data (fixed effects) while honoring the fact that there are individual variations occurring within the repeated measurements. This type of analysis is what we obtain with a mixed model.

I find it easiest to understand this by visualizing it. We already have a complete-pooling model (fit_ols) and a partial-pooling model (fit3). We need to fit the no-pooling model, which is a regression equation with each subject entered as a fixed effect. As a technical note, I will also add to the model -1 to have a coefficient for each subject returned instead of a model intercept.

```fit_no_pool <- lm(Reaction ~ Days + Subject - 1, data = dat)
summary(fit_no_pool)
```

To keep the visual small, we will fit each of our three models to 4 subjects (308, 309, 335, and 331) and then plot the respective regression lines. In addition, I will also plot the regression line from the individualized regression/summary measures approach that we built first, just to show the difference.

```## build a data frame for predictions to be stored
Subject <- as.factor(c(308, 309, 335, 331))
Days <- seq(from = 0, to = 9, by = 1)
pred_df <- crossing(Subject, Days)
pred_df

## complete pooling predictions
pred_df\$complete_pool_pred <- predict(fit_ols, newdata = pred_df)

## no pooling predictions
pred_df\$no_pool_pred <- predict(fit_no_pool, newdata = pred_df)

## summary measures/individual regression
subject_coefs <- ind_reg %>%
filter(Subject %in% unique(pred_df\$Subject)) %>%
dplyr::select(Subject, term, estimate) %>%
mutate(term = ifelse(term == "(Intercept)", "Intercept", "Days")) %>%
pivot_wider(names_from = term,
values_from = estimate)

subject_coefs %>%
as.data.frame()

pred_df <- pred_df %>%
mutate(ind_reg_pred = ifelse(
Subject == 308, 244.19 + 21.8*Days,
ifelse(
Subject == 309, 205.05 + 2.26*Days,
ifelse(
Subject == 331, 285.74 + 5.27*Days,
ifelse(Subject == 335, 263.03 - 2.88*Days, NA)
)
)
))

## partial pooling predictions
pred_df\$partial_pool_pred <- predict(fit3,
newdata = pred_df,
re.form = ~(1 + Days|Subject))

## get original results and add to the predicted data frame
subject_obs_data <- dat %>%
filter(Subject %in% unique(pred_df\$Subject)) %>%
dplyr::select(Subject, Days, Reaction)

pred_df <- pred_df %>%
left_join(subject_obs_data)

## final predicted data set with original observations
pred_df %>%

## plot results
pred_df %>%
pivot_longer(cols = complete_pool_pred:partial_pool_pred,
names_to = "model_pred") %>%
arrange(model_pred) %>%
ggplot(aes(x = Days, y = Reaction)) +
geom_point(size = 4,
shape = 21,
color = "black",
fill = "white") +
geom_line(aes(y = value, color = model_pred),
size = 1.1) +
facet_wrap(~Subject)
```

Let’s unpack this a bit:

• The complete-pooling line (orange) has the same intercept and slope for each of the subjects. Clearly it does not fit the data well for each subject.
• The no-pooling line (dark green) has the same slope for each subject but we notice that the intercept varies as it is higher for some subject than others.
• The partial-pooling line (purple) comes from the mixed model and we can see that it has a slope and intercept that is unique to each subject.
• Finally, the independent regression/summary measures line (light green) is similar to the partial-pooling line, as a bespoke regression model was built for each subject. Note, that in this example the two lines have very little difference but, if we were dealing with subjects who had varying levels of sample size, this would not be the case. The reason is because those with lower sample size will have an independent regression line completely estimated from their observed data, however, their partial pooling line will be pulled more towards the population average, given their lower sample.

Bayesian Mixed Models

Since I mentioned that I tend to think of mixed models as sort of a bridge to the Bayesian universe, let’s go ahead at turn our mixed model into a Bayes model and see what else we can do with it.

I’ll keep things simple here and allow the {rstanarm} library to use the default, weakly informative priors.

```library(rstanarm)

fit_bayes <- stan_lmer(Reaction ~ Days + (1 + Days|Subject), data = dat)
fit_bayes

# If you want to extract the mean, sd and 95% Credible Intervals
# summary(fit_bayes,
#         probs = c(0.025, 0.975),
#         digits = 2)
```

Let’s add the model output to our comparison table.

We see some slight changes between fit3 and fit_bayes, but nothing that drastic.

Let’s see what the posterior draws for all of the parameters in our model look like.

```# Extract the posterior draws for all parameters
post_sims <- as.matrix(fit_bayes)
dim(post_sims)
```

Yikes! That is a large matrix of numbers (4000 rows x 42 columns). Each subject gets, by default, 4000 draws from the posterior and each column represents a posterior sample from each of the different parameters in the model.

What if we slim this down to see what is going on? Let’s see what possible parameters in our matrix are in our matrix we might be interested in extracting.

```colnames(post_sims)
```

The first two columns are the fixed effect intercept and slope. Following that, we see that each subject has an intercept value and a Days value, coming from the random effects. Finally, we see that column names 39 to 42 are specific to the sigma values of the model.

Let’s keep things simple and see what we can do with the individual subject intercepts. Basically, we want to extract the posterior distribution of the fixed effects intercept and the posterior distribution of the random effects intercept per subject and combine those to reflect the posterior distribution of reaction time by subject.

```## posterior draw from the fixed effects intercept
fixed_intercept_sims <- as.matrix(fit_bayes,
pars = "(Intercept)")

## posterior draws from the individual subject intercepts
subject_ranef_intercept_sims <- as.matrix(fit_bayes,
regex_pars = "b\\[\\(Intercept\\) Subject\\:")

## combine the posterior draws of the fixed effects intercept with each individual
posterior_intercept_sims <- as.numeric(fixed_intercept_sims) + subject_ranef_intercept_sims
```

After drawing our posterior intercepts, we can get the mean, standard deviation, median, and 95% Credible Interval of the 4000 simulations for each subject and then graph them in a caterpillar plot.

```## mean per subject
intercept_mean <- colMeans(posterior_intercept_sims)

## sd per subject
intercept_sd <- apply(X = posterior_intercept_sims,
MARGIN = 2,
FUN = sd)

## median and 95% credible interval per subject
intercept_ci <- apply(X = posterior_intercept_sims,
MARGIN = 2,
FUN = quantile,
probs = c(0.025, 0.50, 0.975))

## summary results in a single data frame
intercept_ci_df <- data.frame(t(intercept_ci))
names(intercept_ci_df) <- c("x2.5", "x50", "x97.5")

## Combine summary statistics of posterior simulation draws
bayes_df <- data.frame(intercept_mean, intercept_sd, intercept_ci_df)

## Create a column for each subject's ID
bayes_df\$subject <- rownames(bayes_df)
bayes_df\$subject <- extract_numeric(bayes_df\$subject)

## Catepillar plot
ggplot(data = bayes_df,
aes(x = reorder(as.factor(subject), intercept_mean),
y = intercept_mean)) +
geom_pointrange(aes(ymin = x2.5,
ymax = x97.5),
position = position_jitter(width = 0.1,
height = 0)) +
geom_hline(yintercept = mean(bayes_df\$intercept_mean),
size = 0.5,
col = "red") +
labs(x = "Subject",
y = "Reaction Time",
title = "Reaction Time per Subject",
subtitle = "Mean ± 95% Credible Interval")
```

We can also use the posterior distributions to compare two individuals. For example, let’s compare Subject 308 (column 1 of our posterior sim matrix) to Subject 309 (column 2 of our posterior sim matrix).

```## create a difference between distributions
compare_dist <- posterior_intercept_sims[, 1] - posterior_intercept_sims[, 2]

# summary statistics of the difference
mean_diff <- mean(compare_dist)
sd_diff <- sd(compare_dist)

quantile_diff <- quantile(compare_dist, probs = c(0.025, 0.50, 0.975))
quantile_diff <- data.frame(t(quantile_diff))
names(quantile_diff) <- c("x2.5", "x50", "x97.5")

diff_df <- data.frame(mean_diff, sd_diff, quantile_diff)
round(diff_df, 2)
```

Subject 308 exhibits a 38.7 second higher reaction time than Subject 309 [95% Credible Interval 2.9 to 74.4].

We can plot the posterior differences as well.

```# Histogram of the differences
hist(compare_dist,
main = "Posterior Distribution Comparison\n(Subject 308 - Subject 309)",
xlab = "Difference in 4000 posterior simulations")
abline(v = 0,
col = "red",
lty = 2,
lwd = 2)
abline(v = mean_diff,
col = "black",
lwd = 2)
abline(v = quantile_diff\$x2.5,
col = "black",
lty = 2,
lwd = 2)
abline(v = quantile_diff\$x97.5,
col = "black",
lty = 2,
lwd = 2)
```

We can also use the posterior distributions to make a probabilistic statement about how often, in the 4000 posterior draws, Subject 308 had a higher reaction time than Subject 309.

```prop.table(table(posterior_intercept_sims[, 1] > posterior_intercept_sims[, 2]))
```

Here we see that mean posterior probability that Subject 308 has a higher reaction time than Subject 309 is 98.3%.

Of course we could (and should) go through and do a similar work up for the slope coefficient for the Days variable. However, this is getting long, so I’ll leave you to try that out on your own.

Wrapping Up

Mixed models are an interesting way of handling data consisting of multiple measurements taken on different individuals, as common in sport science. Thanks to Newans et al (1) for sharing their insights into these models. Hopefully this blog was useful in explaining a little bit about how mixed models work and how extending them to a Bayesian framework offers some additional ways of looking at the data. Just note that if you run the code on your own, the Bayesian results might be slightly different than mine as the Monte Carlo component of the Bayesian model produces some randomness in each simulation (and I didn’t set a seed prior to running the model).

The entire code is available on my GitHub page.

If you notice any errors in the code, please reach out!

References

1. Newans T, Bellinger P, Drovandi C, Buxton S, Minahan C. (2022). The utility of mixed models in sport science: A call for further adoption in longitudinal data sets. Int J Sports Phys Perf. Published ahead of print.

2. Matthews JNS, Altman DG, Campbell MJ, Royston P. (1990). Analysis of serial measurements in medical research. Br Med J; 300: 230-235.

3. Weston M, Drust B, Gregson W. (2011). Intensities of exercise during match-play in FA Premier League Referees and players. J Sports Sc; 29(5): 527-532.

4. Gelman A, Hill J. (2009). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.

# Making Predictions with a Bayesian Regression Model

One of my favorite podcasts is Wharton Moneyball. I listen every week, usually during my weekly long run, and I never miss an episode. This past week the hosts were discussing an email they received from a listener, a medical doctor, who encouraged them add a disclaimer before their COVID discussions because he felt that some listeners may interpret their words as medical advice. This turned into a conversation amongst the hosts of the show about how they are reading and interpreting the stats within the COVID studies and an explanation of the difference between the average population effect and an effect for a single individual within a population are two very different things.

The discussion made me think a lot about the difference between nomothetic (group-based) research and idographic (individual person) research, which myself and some colleagues discussed in a 2017 paper in the International Journal of Sports Physiology and Performance, Putting the “I” back in team. It also made me think about something Gelman and colleagues discussed in their brilliant book, Regression and Other Stories. In Chapter 9, the authors’ discussion Prediction & Bayesian Inference and detail three types of predictions we may seek to make from our Bayesian regression model:

1. A point prediction
2. A point prediction with uncertainty
3. A predictive distribution for a new observation in the population

The first two points are directed at the population average and seek to answer the question, “What is the average prediction, y, in the population for someone exhibiting variables x?” and, “How much uncertainty is there around the average population prediction?” Point 3 is a little more interesting and also one of the valuable aspects of Bayesian analysis. Here, we are attempting to move away from the population and say something specific about an individual within the population. Of course, making a statement about an individual within a population will come with a large amount of uncertainty, which we can explore more specifically with our Bayes model by plotting a distribution of posterior predictions.

The Data

We will use the Palmer Penguins data, from the {palmerpenguis} package in R. To keep things simple, we will deal with the Adelie species and build a simple regression model with the independent variable bill_length_mm and dependent variable bill_depth_mm.

Let’s quickly look at the data.

```knitr::opts_chunk\$set(echo = TRUE)
library(tidyverse)
library(palmerpenguins)
library(rstanarm)

theme_set(theme_classic())

dat <- na.omit(penguins)
select(bill_length_mm, bill_depth_mm)

ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_smooth(method = "lm",
size = 2,
color = "black",
se = FALSE) +
geom_point(size = 5,
shape = 21,
fill = "grey",
color = "black") +
labs(x = "Bill Length (mm)",
y = "Bill Depth (mm)",
title = "Bill Depth ~ Bill Length",
```

The Model

Since I have no real prior knowledge about the bill length or bill depth of penguins, I’ll stick with the default priors provided by {rstanarm}.

```fit <- stan_glm(bill_depth_mm ~ bill_length_mm, data = adelie)
summary(fit)
```

Making predictions on a new penguin

Let’s say we observe a new Adelie penguin with a bill length of 41.3 and we want to predict what the bill depth would be. There are two ways to go about this using {rstanarm}. The first is to use the built in functions for the {rstanarm} package. The second is to extract posterior samples from the {rstanarm} model fit and build our distribution from there. We will take each of the above three prediction types in turn, using the built in functions, and then finish by extracting posterior samples and confirm what we obtained with the built in functions using the full distribution.

```new_bird <- data.frame(bill_length_mm = 41.3)
```

1. Point Prediction

Here, we want to know the average bill depth in the population for an Adelie penguin with a bill length of 41.3mm. We can obtain this with the predict() function or we can extract out the coefficients from our model and perform the linear equation ourselves. Let’s do both!

```# predict() function
predict(fit, newdata = new_bird)

# linear equation by hand
intercept <- broom.mixed::tidy(fit)[1, 2]
bill_length_coef <- broom.mixed::tidy(fit)[2, 2]

intercept + bill_length_coef * new_bird\$bill_length_mm
```

We predict an Adelie with a bill length of 41.3 to have, on average, a bill depth of 18.8. Let’s put that point in our plot to where it falls with the rest of the data.

```adelie %>%
ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_smooth(method = "lm",
size = 2,
color = "black",
se = FALSE) +
geom_point(size = 5,
shape = 21,
fill = "grey",
color = "black") +
geom_point(aes(x = 41.3, y = 18.8),
size = 5,
shape = 21,
fill = "palegreen",
color = "black") +
labs(x = "Bill Length (mm)",
y = "Bill Depth (mm)",
title = "Bill Depth ~ Bill Length",
```

There it is, in green! Our new point for a bill length of 41.3 falls smack on top of the linear regression line, the population average predicted bill depth given this bill length. That’s awful precise! Surely there has to be some uncertainty around this new point, right?

2. Point prediction with uncertainty

To obtain the uncertainty around the predicted point estimate we use the posterior_linpred() function.

```new_bird_pred_pop <- posterior_linpred(fit, newdata = new_bird)

hist(new_bird_pred_pop)

mean(new_bird_pred_pop)
sd(new_bird_pred_pop)
qnorm(p = c(0.025, 0.975), mean = mean(new_bird_pred_pop), sd(new_bird_pred_pop))
```

What posterior_linpred() produced is a vector of posterior draws (predictions) for our new bird. This allowed us to visualize a distribution of potential bill depths. Additionally, we can take that vector of posterior draws and find that we predict an Adelie penguin with a bill length of 41.3 mm to have a bill depth of 18.8 mm, the same value we obtained in our point estimate prediction, with a 95% credible interval between 18.5 and 19.0.

Both of these approaches are still working at the population level. What if we want to get down to an individual level and make a prediction of bill depth for a specific penguin in the population? Given that individuals within a population will have a number of factors that make them unique, we need to assume more uncertainty.

3. A predictive distribution for a new observation in the population

To obtain a prediction with uncertainty at the individual level, we use the posterior_predict() function. This function will produce a vector of uncertainty that is much larger than what we saw above, as it is using the model error in the prediction.

```new_bird_pred_ind <- posterior_predict(fit, newdata = new_bird)

hist(new_bird_pred_ind,
xlab = "Bill Depth (mm)",
main = "Distribution of Predicted Bill Depths\nfor a New Penguin with a Bill Length of 41.3mm")
abline(v = mean(new_bird_pred_ind[,1]),
col = "red",
lwd = 6,
lty = 2)

mean(new_bird_pred_ind)
sd(new_bird_pred_ind)
mean(new_bird_pred_ind) + qnorm(p = c(0.025, 0.975)) * sd(new_bird_pred_ind)

```

Similar to the prediction and uncertainty for the average in the population, we can extract the mean predicted value with 95% credible intervals for the new bird. As explained previously, the uncertainty is larger that estimating a population value. Here, we have a mean prediction for bill depth of 18.8 mm, the same as we obtained in the population example. Our 95% Credible, however, has increased to a range of potential values between 16.6 and 21.0 mm.

Let’s visualize this new point with its uncertainty together with the original data.

```new_df <- data.frame( bill_length_mm = 41.3, bill_depth_mm = 18.8, low = 16.6, high = 21.0 ) adelie %>%
ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_smooth(method = "lm",
size = 2,
color = "black",
se = FALSE) +
geom_point(size = 5,
shape = 21,
fill = "grey",
color = "black") +
geom_errorbar(aes(ymin = low, ymax = high),
data = new_df,
linetype = "dashed",
color = "red",
width = 0,
size = 2) +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm),
data = new_df,
size = 5,
shape = 21,
fill = "palegreen",
color = "black") +
labs(x = "Bill Length (mm)",
y = "Bill Depth (mm)",
title = "Bill Depth ~ Bill Length",
```

Notice how much uncertainty we now have (red dashed errorbar) in our prediction!

Okay, so what is going on here? To unpack this, let’s pull out samples from the posterior distribution.

Extract samples from the posterior distribution

We extract our samples using the as.matrix() function. This produced 4000 random samples of the intercept, bill_length_mm coefficient, and the sigma (error). I’ve also summarize the mean for each of these three values below. Notice that the mean across all of the samples are the same values we obtained in the summary output of our model fit.

```posterior_samp <- as.matrix(fit)
nrow(posterior_samp)

colMeans(posterior_samp)
```

We can visualize uncertainty around all three model parameters and also plot the original data and the regression line from our samples.

```par(mfrow = c(2,2))
hist(posterior_samp[,1], main = 'intercept',
xlab = "Model Intercept")
hist(posterior_samp[,2], main = 'beta coefficient',
xlab = "Bill Length Coefficient")
hist(posterior_samp[,3], main = 'model sigma',
xlab = "sigma")
pch = 19,
col = 'grey',
xlab = "Bill Length (mm)",
ylab = "Bill Depth (mm)",
main = "Bill Depth ~ Bill Length")
abline(a = mean(posterior_samp[, 1]),
b = mean(posterior_samp[, 2]),
col = "red",
lwd = 3,
lty = 2)
```

Let’s make a point prediction for the average bill depth in the population based on the bill length of 41.3mm from our new bird and confirm those results with what we obtained with the predict() function.

```intercept_samp <- colMeans(posterior_samp)[1]
bill_length_coef_samp <- colMeans(posterior_samp)[2]

intercept_samp + bill_length_coef_samp * new_bird\$bill_length_mm

# confirm with the results from the predict() function
predict(fit, newdata = new_bird)
```

Next, we can make a population prediction with uncertainty, which will be the standard error around the population mean prediction. These results produce a mean and standard deviation for the predicted response. We confirm our results to those we obtained with the posterior_linpred() function above.

```intercept_vector <- posterior_samp[, 1]
beta_coef_vector <- posterior_samp[, 2]

pred_vector <- intercept_vector + beta_coef_vector * new_bird\$bill_length_mm

## Get summary statistics for the population prediction with uncertainty
mean(pred_vector)
sd(pred_vector)
qnorm(p = c(0.025, 0.975), mean = mean(pred_vector), sd(pred_vector))

## confirm with the results from the posterior_linpred() function
mean(new_bird_pred_pop)
sd(new_bird_pred_pop)
qnorm(p = c(0.025, 0.975), mean = mean(new_bird_pred_pop), sd(new_bird_pred_pop))
```

Finally, we can use the samples from our posterior distribution to predict the bill depth for an individual within the population, obtaining a full distribution to summarize our uncertainty. We will compare this with the results obtained from the posterior_predict() function.

To make this work, we use the intercept and beta coefficient vectors we produced above for the population prediction with uncertainty. However, in the above example the uncertainty was the standard error of the mean for bill depth. Here, we need to obtain a third vector, the vector of sigma values from our posterior distribution samples. Using that sigma vector we will add uncertainty to our predictions by taking a random sample from a normal distribution with a mean of 0 and a standard deviation of the sigma values.

```sigma_samples <- posterior_samp[, 3]
n_samples <- length(sigma_samples)

individual_pred <- intercept_vector + beta_coef_vector * new_bird\$bill_length_mm + rnorm(n = n_samples, mean = 0, sd = sigma_samples)

## summary statistics
mean(individual_pred)
sd(individual_pred)
mean(individual_pred) + qnorm(p = c(0.025, 0.975)) * sd(individual_pred)

## confirm results obtained from the posterior_predict() function
mean(new_bird_pred_ind)
sd(new_bird_pred_ind)
mean(new_bird_pred_ind) + qnorm(p = c(0.025, 0.975)) * sd(new_bird_pred_ind)
```

We obtain nearly the exact same results that we did with the posterior_predict() function aside form some rounding differences. This occurs because the error for the prediction is using a random number generator with mean 0 and standard deviation of the sigma values, so the results are not exactly the same every time.

Wrapping Up

So there you have it, three types of predictions we can obtain from a Bayesian regression model. You can easily obtain these from the designed functions from the {rstanarm} package or you can extract a sample from the posterior distribution and make the predictions yourself as well as create visualizations of model uncertainty.

You can access the full code on my GitHub Page. In addition to what is above, I’ve also added a section that recreates the above analysis using the {brms} package, which is another package available for fitting Bayesian models in R.