Category Archives: Bayesian Model Building

Bayesian Updating for Normally Distributed Data – A few different approaches for the normal-normal conjugate

Introduction

Bayesian updating provides a way of combining prior knowledge/belief with newly observed data to obtain an updated knowledge of the world (posterior). Most Bayesian updating examples begin by observing a binomial outcome and combining those observations with a beta prior. While this is useful for understanding the basic crux of Bayesian updating, not all problems that we face in the real world will be binomial in nature, thus requiring a different likelihood distribution. For example, normally distributed data can be challenging to work with because there are two parameters (a mean and standard deviation) that have their own respective variances. Two circumvent this issue, in a normal-normal conjugate, we often accept one of the parameters as being known and fixed in the population (essentially treating it as a nuisance parameter and not something we are explicitly modeling). Often, because we care about updating our knowledge about the mean (center) of an observed value the standard deviation is taken to be fixed for the population, allowing us to create an updated mean and a corresponding distribution around it.

In reading about various approaches to normal-normal conjugate, I’ve noted three methods that can be used for Bayesian updating of a normally distributed variable in a simple way. The difference between the three approaches appears to be with the amount of information we have available to us about the observed values. These approaches are easy to use and can be applied quickly by a practitioner, with just a calculator, offering a convenient way to make observations and rationalize about the world around us.

The information required for the three approaches is as follows (I’ll share references to where I got each approach in the sections below):

We have a prior value for the population mean and the sample size that this mean was taken from. What we are lacking is information about the population standard deviation. Thus, we have no information about how the variable varies.
We have a prior mean and standard deviation for the population but we are lacking sample size information that the mean and standard deviation was derived from. Thus, we know how the variable varies but we don’t know how confident we should be about the observations (a large sample means we’d be more confident while a smaller sample means we’d be less confident).
We have all three pieces of population prior — mean, sd, and sample size.

The complete code and data are available on my GITHUB page.

Load Data

Reference: Data comes from basketball-reference.com advanced shooting stats.

We will load several seasons worth of NBA advanced shooting statistics and the stat we are interested in is Player Efficiency Rating (PER).


library(tidyverse)

theme_set(theme_light())

## load nba_advanced_shooting_stats.csv
d <- read.csv("nba_advanced_shooting_stats.csv", header = TRUE) %>%
  select(Season, Player, Pos, Age, Tm, G, MP, PER) %>%
  janitor::clean_names()

d %>%
  head() %>%
  knitr::kable()

We have 4 seasons worth of data (really about 3.5 given that the 2022-2023 season wasn’t complete when I scraped the original data).

Exploring Player Efficiency Rating & Minutes Played

Let’s focus on the Player Efficiency Rating (PER) and Minutes Played.

It looks like on average players has a PER greater than 0, between 10 and 15. The minutes played is right skewed with a vast majority of the players playing a low number of minutes and a few players playing a lot of minutes.

We can look at the numbers explicitly by evaluating the quantiles.

summary(d[, c("mp", "per")])

Setting up our priors

Since the 2022 – 2023 season was not finished when I scraped this data, we will base our prior for PER on the previous 3 seasons. Additionally, we will set up our prior mean from players in the population who had over the median number of minutes played in those seasons.

d %>%
  filter(season != "2022-2023",
         mp > median(mp)) %>%
  summarize(n_players = n(),
            avg_mp = mean(mp),
            avg_per = mean(per),
            sd_per = sd(per))

We can store these variables in their own elements so that they can be called later in our calculations.

prior_mu <- 14.79
prior_n <- 1448
prior_df <- prior_n - 1

Recall that for our prior standard deviation we need to obtain a prior for the standard deviation around the mean (a standard error of the mean) and we also need to obtain a known population standard deviation (what I referred to as the nuisance parameter above, since we wont be directly estimating it).

We will call the standard deviation for the mean PER, prior_sd, and the fixed standard deviation, prior_tau. To calculate the prior_sd we’ll take the standard deviation across the three seasons for each player and then take the mean of those player standard deviations. For prior_tau we’ll use the overall standard deviation of observed PER values for the three seasons (which was calculated in our summary function above). Again, we’ll store these values in their own elements for calculations later.

d %>%
  filter(season != "2022-2023",
         mp &gt; median(mp)) %>%
  group_by(player) %>%
	summarize(per_sd = sd(per),
	          .groups = "drop") %>%
  summarize(mean(per_sd, na.rm = TRUE))


prior_sd <- 1.57
prior_var <- prior_sd^2
prior_precision <- 1 / prior_var

prior_tau <- 4.53
prior_tau_var <- prior_tau^2
prior_tau_precision <- 1 / prior_tau_var

Selecting one player

Let’s start with one player and work through the three approaches explained above before applying them to the full data set.

We will select a player with a low number of minutes played so that we can see how their PER behaves when we combine it with our prior. I’ll select Thanasis Antetokounmpo from the 2022-2023 season.

ta <- d %>%
  filter(season == "2022-2023",
         player == "Thanasis Antetokounmpo")

ta

We don’t know Thanasis Antetokounmpo standard deviation of player efficiency rating over the 21 games (91 minutes) he played. Therefore, we don’t have a standard deviation for his production.

Let’s store his observed values in separate elements.

obs_mu <- 1.9
obs_n <- 91

Method 1

I stumbled upon this method in the 2nd edition of Wayne Winston’s fantastic book, Mathletics.

Combining the observed PER and sample size (minutes played) for Antetokounmpo with the prior PER and prior sample size for the population, we see that Antetokoumpo’s estimated PER gets pulled up closer to the population mean, though still below average.

To get a sense for how much sample size effects the shrinkage towards the prior, let’s pretend that Antetokounmpo had 1200 minutes of observation with the same PER.

Notice that with 1200 minutes played we are much more certain that Antetokounmpo has a below average PER.

Method 2

Recall that for method 2 to work we require a mean and standard deviation for Antetokounmpo’s PER. Since we don’t have a standard deviation for his PER in the 2022-2023 we will get his PER from the previous 3 seasons and calculate a standard deviation. We will store that value in its own element.

This approach was discussed in Chapter 9 of Gelman and Hill’s Regression and Other Stories.

d %>%
  filter(season != "2022-2023",
         player == "Thanasis Antetokounmpo") %>%
  summarize(sd(per))

obs_sd <- 1.19

Applying method 2 we get the following result.

## Posterior
bayes_v2 <- ((1/prior_sd^2 * prior_mu) + (1 / obs_sd^2 * obs_mu))/((1/obs_sd^2) + (1/prior_sd^2))

bayes_v2

## Posterior SD
bayes_v2_sd <- sqrt(1/(1/obs_sd^2 + 1/prior_sd^2))
bayes_v2_sd

## Posterior 95% CI
bayes_v2 + qnorm(p = c(0.025, 0.975))*bayes_v2_sd

We could use a similar approach with just the mean and standard deviation (no sample size info) but use precision (1 / variance) as the parameter describing our spread in the data (instead of SD). We obtain the same results.

obs_precision <- 1 / obs_sd^2

posterior_precision <- prior_precision + obs_precision

bayes_v2.2 <- prior_precision/posterior_precision * prior_mu + obs_precision/posterior_precision * obs_mu

bayes_v2.2

bayes_v2.2_sd <- sqrt(1/posterior_precision)
bayes_v2.2_sd

## Posterior 95% CI
bayes_v2.2 + qnorm(p = c(0.025, 0.975))*bayes_v2.2_sd

This result is much more conservative than method 1. We see that Antetokounmpo is estimated to be well below average. Additionally, now that we have a standard deviation for Antetokounmpo’s PER we are also able to calculate a credible interval for his performance.

Method 3

For this final method we will use all of the observed info – mean, sd, and minutes play. This approach was presented in William Bolstad’s Introduction to Bayesian Statistics, 2nd Ed.

bayes_v3_precision <- prior_precision + obs_n/prior_tau_var
bayes_v3_precision

bayes_v3_sd <- sqrt(1/bayes_v3_precision)
bayes_v3_sd

bayes_v3 <- (prior_precision / (obs_n/prior_tau_var + prior_precision))*prior_mu + ((obs_n/prior_tau_var) / (obs_n/prior_tau_var + prior_var)) * obs_mu

bayes_v3

## Posterior 95% CI
bayes_v3 + qnorm(p = c(0.025, 0.975))*bayes_v3_sd

Comparing the Results

data.frame(bayes_v1, bayes_v2, bayes_v2_sd, bayes_v3, bayes_v3_sd) %>%
  knitr::kable()

Method 1 has the largest pull towards the prior mean because it uses the least information. Since we don’t have an observed standard deviation for our observation, we also don’t have any information about the variability in the posterior mean.
Method 2 has less pull to the prior mean than version 1 and also has a rather large standard deviation around the values.
Methods 3 has the lowest pull towards the mean compared to the other three approaches and it uses the largest amount of information.

Antetokounmpo only had 91 minutes of observation time. To show how sample sizes effects our estimate, if we increase his sample size to 1000 we end up with more confidence about his performance (an estimated PER closer to the observed 1.9 and a smaller standard deviation).

bayes_v3.3_precision <- prior_precision + 1000/prior_tau_var
bayes_v3.3_precision

bayes_v3.3_sd <- sqrt(1/bayes_v3.3_precision)
bayes_v3.3_sd

bayes_v3.3 <- (prior_precision / (1000/prior_tau_var + prior_precision))*prior_mu + ((1000/prior_tau_var) / (1000/prior_tau_var + prior_var)) * obs_mu

bayes_v3.3

## Posterior 95% CI
bayes_v3.3 + qnorm(p = c(0.025, 0.975))*bayes_v3.3_sd

Let’s create a simulation using rnorm() and plot the estimates from the three methods. Since we don’t have a standard deviation for method 1 we will use the prior_sd. We notice that method 3, which uses the most information gives us a much more conservative belief about the player’s true performance compared to the other two methods.

N <- 1e4

set.seed(9087)
v1_sim <- rnorm(n = N, mean = bayes_v1, sd = prior_sd)
v2_sim <- rnorm(n = N, mean = bayes_v2, sd = bayes_v2_sd)
v3_sim <- rnorm(n = N, mean = bayes_v3, sd = bayes_v3_sd)


plot(density(v1_sim), 
     col = "blue",
     lwd = 3,
     xlim = c(0, 20),
     ylim = c(0, 0.95),
     main = "Bayesian Normal Updating -- 3 Approaches\nDashed Line = Observed PER")
lines(density(v2_sim), 
     col = "red",
     lwd = 3)
lines(density(v3_sim), 
     col = "green",
     lwd = 3)
abline(v = obs_mu,
       col = "black",
       lty = 2,
       lwd = 2)
legend(x = 12,
       y = 0.8,
       c("Method 1", "Method 2", "Method 3"),
       col = c("blue", "red", "green"),
       lwd = 2)

Wrapping Up

The normal-normal conjugate can be a little tricky compared to a beta-binomial conjugate, but it is an important distribution to work with given most of the data we deal with on a regular basis. Without getting into complex modeling we can use a few simple approaches for a normal-normal conjugate that allows us to quickly update our beliefs based on various bits of information we have access to. Hopefully this article was useful at showing a few of these approaches (there are others!).

The full code for this article is available on my GITHUB page.

If you notice any errors or have additional thoughts, feel free to email me!

Bayesian Priors for Categorical Variables using rstanarm

Continuing with more Bayesian data analysis using the {rstanarm} package, today walk through the ways of setting priors on categorical variables.

NOTE: Priors are a pretty controversial piece in Bayesian statistics and one of the arguments people make against Bayesian data analysis. Thus, I’ll also show what happens when you are overly bullish with your priors/

The full code is accessible on my GITHUB page.

Load Packages & Data

We are going to use the mtcars data set from R. The cylinder variable (cyl) is read in as a numeric but it only have three levels (4, 6, 8), therefore, we will convert it to a categorical variable and treat it as such for the analysis.

We are going to build a model that estimates miles per gallon (mpg) from the number of cylinders a car has. So, we will start by looking at the mean and standard deviation of mpg for each level of cyl.

## Bayesian priors for categorical variables using rstanarm

library(rstanarm)
library(tidyverse)
library(patchwork)

### Data -----------------------------------------------------------------
d <- mtcars %>%
  select(mpg, cyl, disp) %>%
  mutate(cyl = as.factor(cyl),
         cyl6 = ifelse(cyl == "6", 1, 0),
         cyl8 = ifelse(cyl == "8", 1, 0))

d %>% 
  head()

d %>%
  group_by(cyl) %>%
  summarize(avg = mean(mpg),
            SD = sd(mpg))

Fit the model using Ordinary Least Squares regression

Before constructing our Bayesian model, we fit the model as a basic regression model to see what the output looks like.

## Linear regression ------------------------------------------------------
fit_lm <- lm(mpg ~ cyl, data = d)
summary(fit_lm)

The model suggests there is a relationship between mpg and cyl number
A 4 cyl car is represented as the intercept. Consequently, the intercept represents the average mpg we would expect from a 4 cylinder car.
The other two coefficients (cyl6 and cyl8) represent the difference in mpg for each of those cylinder cars relative to a 4 cylinder car (the model intercept). So, a 6 cylinder can, on average, will get 7 less mpg than a 4 cylinder car while an 8 cylinder car will, on average, get about 12 less mpg’s than a 4 cylinder car.

Bayesian regression with rstanarm — No priors specified

First, let’s fit the model with no priors specified (using the functions default priors) to see what sort of output we get.

## setting no prior info
stan_glm(mpg ~ cyl, data = d) %>%
  summary(digits = 3)

The output is a little different than the OLS model. First we see that there are no p-values (in the spirit of Bayes analysis!).
We do find that the model coefficients are basically the same as those produce with the OLS model and even the standard deviation is similar to the standard errors from above.
Instead of p-values for each coefficient we get 80% credible intervals.
The sigma value at the bottom corresponds to the residual standard error we got in our OLS model.

Basically, the default priors “let the data speak” and reported back the underlying relationship in the empirical data.

Setting Some Priors

Next, we can set some minimally informative priors. These priors wont contain much information and, therefore, will be highly influenced by minimal amounts of evidence regarding the underlying relationship that is present in the data.

To set priors on independent variables in rstanarm we need to create an element to store them. We have two independent variables (cyl6 and cyl8), both requiring priors (we will set the prior for the intercept and the model sigma in the function directly). To set these priors we need to determine a distribution, a mean value (location), and a standard deviation (scale). We add these values into the distribution function in the order in which they will appear in the model. So, there will be a vector of location that is specific to cyl6 and cyl8 and then a vector of scale that is also specific to cyl6 and cyl8, in that order.

## Setting priors
ind_var_priors <- normal(location = c(0, 0), scale = c(10, 10))

Next, we run the model.

fit_rstan <- stan_glm(mpg ~ cyl, 
                      prior = ind_var_priors,
                      prior_intercept = normal(15, 8),
                      prior_aux = cauchy(0, 3),
                      data = d)

# fit_rstan
summary(fit_rstan, digits = 3)

Again, this model is not so different from the one that used the default priors (or from the findings of the OLS model). But, our priors were uninformative.

One note I’ll make, before proceeding on, is that you can do this a different way and simply dummy code the categorical variables and enter those dummies directly into the model, setting priors on each, and you will obtain the same result. The below code dummy codes cyl6 and cyl8 as booleans (1 = yes, 0 = no) so when both are 0 we effectively are left with cyl4 (the model intercept).

############################################################################################
#### Alternate approach to coding the priors -- dummy coding the categorical variables #####
############################################################################################

d2 <- d %>%
  mutate(cyl6 = ifelse(cyl == "6", 1, 0),
         cyl8 = ifelse(cyl == "8", 1, 0))

summary(lm(mpg ~ cyl6 + cyl8, data = d2))

stan_glm(mpg ~ cyl, 
         prior = ind_var_priors,
         prior_intercept = normal(15, 8),
         prior_aux = cauchy(0, 3),
         data = d2) %>%
  summary()

###########################################################################################
###########################################################################################

Okay, back to our regularly scheduled programming…..

So what’s the big deal?? The model coefficients are relatively the same as with OLS. Why go through the trouble? Two reasons:

Producing the posterior distribution of model coefficients posterior predictive distribution for the dependent variable allows us to evaluate our uncertainty around each. I’ve talked a bit about this before (Making Predictions with a Bayesian Regression Model, Confidence & Prediction Intervals – Compare and Contrast Frequentist and Bayesian Approaches, and Approximating a Bayesian Posterior with OLS).
If we have more information on relationship between mpg and cylinders we can code that in as information the model can use!

Let’s table point 2 for a second and extract out some posterior samples from our Bayesian regression and visualize the uncertainty in the coefficients.

# posterior samples
post_rstan <- as.matrix(fit_rstan) %>%
  as.data.frame() %>%
  rename('cyl4' = '(Intercept)')

post_rstan %>%
  head()

mu.cyl4 <- post_rstan$cyl4
mu.cyl6 <- post_rstan$cyl4 + post_rstan$cyl6
mu.cyl8 <- post_rstan$cyl4 + post_rstan$cyl8

rstan_results <- data.frame(mu.cyl4, mu.cyl6, mu.cyl8) %>%
  pivot_longer(cols = everything())


rstan_plt <- rstan_results %>%
  left_join(
    
    d %>%
      group_by(cyl) %>%
      summarize(avg = mean(mpg)) %>%
      rename(name = cyl) %>%
      mutate(name = case_when(name == "4" ~ "mu.cyl4",
                              name == "6" ~ "mu.cyl6",
                              name == "8" ~ "mu.cyl8"))
    
  ) %>%
  ggplot(aes(x = value, fill = name)) +
  geom_histogram(alpha = 0.4) +
  geom_vline(aes(xintercept = avg),
             color = "black",
             size = 1.2,
             linetype = "dashed") +
  facet_wrap(~name, scales = "free_x") +
  theme_light() +
  theme(strip.background = element_rect(fill = "black"),
        strip.text = element_text(color = "white", face = "bold")) +
  ggtitle("rstanarm")

rstan_plt

The above plot represents the posterior distribution (the prior combined with the observed data, the likelihood) of the estimated values for each of our cylinder types.
The dashed line is the observed mean mpg for each cylinder type in the data.
The distribution helps give us a good sense of the certainty (or uncertainty) we have in our estimates.

We can summarize this uncertainty with point estimates (e.g., mean and median) and measures of spread (e.g., standard deviation, credible intervals, quantile intervals).

# summarize posteriors
mean(mu.cyl4)
sd(mu.cyl4)
qnorm(p = c(0.05, 0.95), mean = mean(mu.cyl4), sd = sd(mu.cyl4))
quantile(mu.cyl4, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))

mean(mu.cyl6)
sd(mu.cyl6)
qnorm(p = c(0.05, 0.95), mean = mean(mu.cyl6), sd = sd(mu.cyl6))
quantile(mu.cyl6, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))

mean(mu.cyl8)
sd(mu.cyl8)
qnorm(p = c(0.05, 0.95), mean = mean(mu.cyl8), sd = sd(mu.cyl8))
quantile(mu.cyl8, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))

For example, the below information tells us that cyl8 cars will, on average, provide us with ~15.2 mpg with a credible interval between 13.7 and 16.2. The median value is 15.2 with an interquartile range between 14.6 and 15.8 and a 90% quantile interval ranging between 13.7 and 16.6.

Bullish Priors

As stated earlier, priors are one of the most controversial aspects of Bayesian analysis. Most argue against Bayes because they feel that priors can be be manipulated to fit the data. However, what many fail to recognize is that no analysis is void of human decision-making. All analysis is done by humans and thus there are a number of subjective decisions that need to be made along the way, such as deciding on what to do with outliers, how to handle missing data, the alpha level or level of confidence that you want to test your data against, etc. As I’ve said before, science isn’t often as objective as we’d like it to be. That all said, selecting priors can be done in a variety of ways (aside from just using non-informative priors as we did above). You could get expert opinion, you could use data and observations gained from a pilot study, or you can use information about parameters from previously conducted studies (though be cautious as these also might be bias due to publication issues such as the file drawer phenomenon, p-hacking, and researcher degrees of freedom).

When in doubt, it is probably best to be conservative with your priors. But, let’s say we sit down with a mechanic and inform him of a study where we are attempting to estimate the miles per gallon for 4, 6, and 8 cylinder cars. We ask him if he can help us with any prior knowledge about the decline in mpg when the number of cylinders increase. The mechanic is very bullish with his prior information and states,

“Of course I know the relationship between cylinders and miles per gallon!! Those 4 cylinder cars tend to be very economical and get around 50 mpg plus or minus 2. I haven’t seen too many 6 cylinder cars, but my hunch is that there are pretty similar to the 4 cylinder cars. Now 8 cylinder cars…I do a ton of work on those! Those cars get a bad wrap. In my experience they actually get better gas mileage than the 4 or 6 cylinder cars. My guess would be that they can get nearly 20 miles per gallon more than a 4 cylinder car!”

Clearly our mechanic has been sniffing too many fumes in the garage! But, let’s roll with his beliefs and codify them as prior knowledge for our model and see how such bullish priors influence the model’s behavior.

We set the intercept to be normally distributed with a mean of 50 and a standard deviation of 2.
Because the mechanic felt like the 6 cylinder car was similar to the 4 cylinder car we will stick suggest that the difference between 6 cylinders and 4 cylinders is normally distributed with a mean of 0 and standard deviation of 2.
Finally, we use the crazy mechanics belief that the 8 cylinder car gets roughly 20 more miles per gallon than the 4 cylinder car and we code its prior to be normally distributed with a mean of 20 and standard deviation of 5.

Fit the model…

## Use wildly different priors ---------------------------------------------------------
ind_var_priors2 <- normal(location = c(0, 20), scale = c(10, 5))

fit_rstan2 <- stan_glm(mpg ~ cyl, 
                       prior = ind_var_priors2,
                       prior_intercept = normal(50, 2),
                       prior_aux = cauchy(0, 10),
                       data = d)

summary(fit_rstan2, digits = 3)

Wow! Look how much the overly bullish/informative priors changed the model output.

Our new belief is that a 4 cylinder car gets approximately 39 mpg and the 6 cylinder car gets about 3 more mpg than that, on average.
The 8 cylinder car is now getting roughly 14 mpg more than the 4 cylinder car.

The bullish priors have overwhelmed the observed data. Notice that the results are not exact to the prior but the prior, as they are tugged a little bit closer to the observed data, though not by much. For example, we specified the 8 cylinder car to have about 20 mpg over a 4 cylinder car. The observed data doesn’t indicate this to be true (8 cylinder cars were on average 11 mpg LESS THAN a 4 cylinder car) so the coefficient is getting pulled down slightly, from our prior of 20 to 14.4.

Let’s plot the posterior distribution.

# posterior samples
post_rstan <- as.matrix(fit_rstan2) %>%
  as.data.frame() %>%
  rename('cyl4' = '(Intercept)')

post_rstan %>%
  head()

mu.cyl4 <- post_rstan$cyl4
mu.cyl6 <- post_rstan$cyl4 + post_rstan$cyl6
mu.cyl8 <- post_rstan$cyl4 + post_rstan$cyl8

rstan_results <- data.frame(mu.cyl4, mu.cyl6, mu.cyl8) %>%
  pivot_longer(cols = everything())


rstan_plt2 <- rstan_results %>%
  left_join(
    
    d %>%
      group_by(cyl) %>%
      summarize(avg = mean(mpg)) %>%
      rename(name = cyl) %>%
      mutate(name = case_when(name == "4" ~ "mu.cyl4",
                              name == "6" ~ "mu.cyl6",
                              name == "8" ~ "mu.cyl8"))
    
  ) %>%
  ggplot(aes(x = value, fill = name)) +
  geom_histogram(alpha = 0.4) +
  geom_vline(aes(xintercept = avg),
             color = "black",
             size = 1.2,
             linetype = "dashed") +
  facet_wrap(~name, scales = "free_x") +
  theme_light() +
  theme(strip.background = element_rect(fill = "black"),
        strip.text = element_text(color = "white", face = "bold")) +
  ggtitle("rstanarm 2")

rstan_plt2

Notice how different these posteriors are than the first Bayesian model. In every case, the predicted mpg from the number of cylinders are all over estimating the observed mpg by cylinder (dashed line).

Wrapping Up

Today we went through how to set priors on categorical variables using rstanarm. Additionally, we talked about some of the skepticism about priors and showed what can happen when the priors you select are too overconfident. The morale of the story is two-fold:

All statistics (Bayes, Frequentist, Machine Learning, etc) have some component of subjectivity as the human doing the analysis has to make decisions about what to do with their data at various points.
Don’t be overconfident with your priors. Minimally informative priors maybe be useful to allowing us to assert some level of knowledge of the outcome while letting that knowledge be influenced/updated by what we’ve just observed.

The full code is accessible on my GITHUB page.

If you notice any errors, please reach out!

Bayesian Linear Regression: Getting started with PyMC3

Previously I’ve used {rstanarm}, {brms}, and Stan for fitting Bayesian models. However, as I continue to work on improving my Python skills, I figured I’d try and delve into the PyMC3 framework for fitting such models. This article will go through the following steps:

Fitting the model
Making a point estimate prediction
Making a point estimate prediction with uncertainty
Calculating a posterior predictive distribution

I’ve covered the last three steps in a prior blog on making predictions with a Bayesian model. I know there are probably functions available in PyMC3 that can do these things automatically (just as there are in {rstanarm}) but instead of falling back on those, I create the posterior distributions here using numpy and build them myself.

The entire code and data are available on my GITHUB page, where I also have the model coded in {rstanarm}, for anyone interested in seeing the steps in a different code language.

Loading Libraries & Data

The data I’ll be using is the {mtcars} data set, which is available in R. I’ve saved a copy in .csv format so that I can load it into my Jupyter notebook.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
import pymc3 as pm
import arviz as az 
import os

# load mtcars
d = pd.read_csv('mtcars.csv')
d.head()

Exploratory Data Analysis

The model will regress mpg on engine weight (wt). Let’s plot and describe those two variables so that we have a sense for what we might be working with.

Linear Regression

Before fitting the Bayesian model, I want to fit a simple regression model to see what the coefficients look like.

import statsmodels.api as sm

x = d['wt']
y = d['mpg']

x = sm.add_constant(x)

fit = sm.OLS(y, x).fit()

fit.summary()

We can see that for every one unit increase in engine weight the miles per gallon decrease, on average, by about 5.3.

Bayesian Regression (PyMC3)

Fitting a Bayesian regression model in PyMC3 requires us to specify some priors. For this model I’ll use a prior intercept of 40 ± 10 and a prior beta for the wt variable of 0 ± 10. The thing to note here is that the priors I’m specifying priors were created by first looking at the data that has been collected (which is technically cheating). Normally we would have priors BEFORE collecting our data (using prior published research, data from a pilot study, prior intuition, etc) and then combine the prior with the observations to obtain a posterior distribution. However, the aim here is to understand how to code the model, so I’ll use these priors. I’ll write a longer blog on priors and what they do to a model in the coming weeks.

Some notes on fitting the model in PyMC3:

The model is named ‘fit_b‘
We specify the intercept as variable ‘a‘
The beta coefficient for wt is called ‘b‘
Both the intercept and slope re fit with normally distributed priors
Finally, ‘e‘ represents the model error and it is fit with a a Half Cauchy prior
Once the priors are set, the model is specified (y_pred) as mu = a + b * wt + e
The trace_b object stores our posterior samples, 2000 of them of which the first 1000 will be discarded because they are there to allow the model to tune itself

, sd = e, observed = d['mpg'])
    
    trace_b = pm.sample(2000, tune = 1000)

Once the model has been fit we plot the trace plots to see how well it performed.

We can also directly call the mean and standard deviation values of our fitted model, which are relatively similar to what we saw with the linear regression model above.

Point Predictions

Next, we want to make a single point prediction for the mpg would expect, on average, when wt is a specific value (in this example we will use wt = 3.3).

To do this, we simply store the average value of the posterior coefficients from our Bayesian regression and apply the specified model:

mu = a + b * new_wt

A car with an engine weight of 3.3 would get, on average, 19.7 mpg.

Point Prediction with Uncertainty

The point estimate is interesting (I guess), but there is uncertainty around that estimate as point predictions are never exact. We can compliment this point estimate by unveiling the uncertainty around it. The point prediction ± uncertainty interval informs us of the average value of mpg along with the uncertainty of the coefficients in our model.

To do this, we create a random sample of 1000 values from the posterior distributions for our model intercept and beta coefficient. Each of these 1000 values represent a potential intercept and slope that would be consistent with our data, which shows us the uncertainty that we have in our estimates. When we use the model equation, multiplying each of these 1000 values by the new_wt value we obtain 1000 possible predicted values of mpg given a weight of 3.3.

With this posterior distribution we can then plot a histogram of the results and obtain summary statistics such as the mean, standard deviation, and 90% credible interval.

Posterior Predictive Distribution

Finally, instead of just knowing the average predicted value of mpg ± uncertainty for the population, we might be interested in knowing what the predicted value of mpg would be for a new car in the population with a wt of 3.3. For that, we calculate the posterior predictive distribution. The uncertainty in this predictive distribution will be larger than the point prediction with uncertainty because we are using the posterior model error added to our prediction.

First, similar to the model coefficients, we have to get the random draws of our error term, which we will call sigma.

Next, we run the model as we did in step 2 above; however, we also add to each of the 1000 posterior predictions the sigma value by taking a random draw from a normal distribution with a mean of 0 and standard deviation equal to the sigma sample values.

pred_dist = intercept_sample + beta_sample * new_wt_rep + np.random.normal(loc = 0, scale = sigma_sample, size = 1000)

Finally, we plot the distribution of our predicted values along with the mean, standard deviation, and 90% credible interval. Notice that these values are larger than what we obtained in step 2 because we are now taking into account additional uncertainty about the new wt observation.

Wrapping Up

That’s a brief intro to Bayesian regression with PyMC3. There are a lot more things that we can do with PyMC3 and it’s available functions. My goal is to put together more blog articles on Bayesian modeling with both R and Python so show their flexibility. If you spot any errors, please let me know.

The data and full code (along with a companion code in {rstanarm}) is available on my GITHUB page.

Approximating a Bayesian Posterior Prediction

This past week on the Wharton Moneyball Podcast, during Quarter 2 of the show the hosts got into a discussion about Bayesian methods, simulating uncertainty, and frequentist approaches to statistical analysis. The show hosts are all strong Bayesian advocates but at one point in the discussion, Eric Bradlow mentioned that “frequenstist can answer similar questions by building distributions from their model predictions.” (paraphrasing)

This comment reminded me of Chapter 7 in Gelman and Hill’s brilliant book, Data Analysis Using Regression and Multilevel/Hierarchical Models. In this chapter, the authors do what they call an informal Bayesian approach by simulating the predictions from a linear regression model. It’s an interesting (and easy) approach that can be helpful for getting a sense for how Bayesian posterior predictions work without building a full Bayesian model. I call it, “building a bridge to Bayes”.

Since the entire book is coded in R, I decided to code an example in Python.

The Jupyter notebook is accessible on my GITHUB page.

(Special Material: In the Jupyter Notebook I also include additional material on how to calculate prediction intervals and confidence intervals by hand in python. I wont go over those entire sections here as I don’t want the post to get too long.)

Libraries & Data

We will start by loading up the libraries that we need and the data. I’ll use the iris data set, since it is conveniently available from the sklearn library. We will build the regression model using the statsmodels.api library.

## import libraries

from sklearn import datasets
import pandas as pd
import numpy as np
import statsmodels.api as smf
from scipy import stats
import matplotlib.pyplot as plt

## iris data set
iris = datasets.load_iris()

## convert to pandas sata frame
data = iris['data']
target = iris['target']

iris_df = pd.DataFrame(data)
iris_df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
iris_df.head(3)

Build a linear regression model

Next, we build a simple ordinary least squares regression model to predict petal_width from petal_length.

## Set up our X and y variables
X = iris_df['petal_length']
y = iris_df['petal_width']

# NOTE: statsmodels does not automatically add an intercept, so you need to do that manually

X = smf.add_constant(X)

# Build regression model
# NOTE: the X and y variables are reversed in the function compared to sklearn

fit_lm = smf.OLS(y, X).fit()

# Get an R-like output of the model

fit_lm.summary()

Simulate a distribution around the slope coefficient

The slope coefficient for the petal_length variable, from the model output above, is 0.4158 with a standard error of 0.01. We will store these two values in their own variables and then use them to create a simulation of 10,000 samples and plot the distribution.

## get summary stats
mu_slope = 0.4158
se_slope = 0.01

## create simulation
n_samples = 10000
coef_sim = np.random.normal(loc = mu_slope, scale = se_slope, size = n_samples)

## plot simulated distribution

plt.hist(coef_sim, bins = 60)
plt.show()

We can also grab the summary statistics from the simulated distribution. We will snag the mean and the 90% quantile interval.

## get summary stats from our simulation
summary_stats = {
    'Mean': coef_sim.mean(),
    'Low90': np.quantile(coef_sim, 0.05),
    'High90': np.quantile(coef_sim, 0.95)
}

summary_stats

Making a prediction on a new observation and building a posterior predictive distribution

Now that we’ve gotten a good sense for how to create a simulation in python, we can create a new observation of petal_length and make a prediction about what the petal_width would be based on our model. In addition, we will get the prediction intervals from the output and use them to calculate a standard error for the prediction, which we will use for the posterior predictive simulation.

Technical Note: In the prediction output you will see that python returns a mean_ci_lower and mean_ci_upper and an obs_ci_lower and obs_ci_higher. The latter two are the prediction intervals while the former two are the confidence intervals. I previously discussed the difference HERE and this is confirmed in the Jupyter Notebook where I calculate these values by hand.

# create a new petal_length observation
new_dat = np.array([[1, 1.7]])    # add a 1 upfront to indicate the model intercept

# make prediction of petal_width using the model
prediction = fit_lm.get_prediction(new_dat)

# get summary of the prediction
prediction.summary_frame()

Store the predicted value (0.343709) and then calculate the standard error from the lower and upper prediction intervals. Run a simulation and then plot the distribution of predictions.

mu_pred = 0.343709
se_pred = 0.754956 - 0.343709     # subtract the upper prediction interval from the mean to get the variability
n_sims = 10000

pred_obs = np.random.normal(loc = mu_pred, scale = se_pred, size = n_sims)

plt.hist(pred_obs, bins = 60)
plt.show()

Just as we did for the simulation of the slope coefficient we can extract our summary statistics (mean and 90% quantile intervals).

## get summary stats from our simulation
summary_stats = {
    'Mean': pred_obs.mean(),
    'Low90': np.quantile(pred_obs, 0.05),
    'High90': np.quantile(pred_obs, 0.95)
}

summary_stats

Wrapping Up

That is a pretty easy way to get a sense for approximating a Bayesian posterior predictive distribution. Rather than simply reporting the predicted value and a confidence interval or prediction interval, it is sometimes nice to build an entire distribution. Aside from it being visually appealing, it allows us to answer other questions we might have, for example, what percentage of the data is greater or less than 0.5 (or any other threshold value you might be interested in)?

As stated earlier, all of this code is accessible on my GITHUB page and the Jupyter notebook also has additional sections on how to calculate confidence intervals and prediction intervals by hand.

If you notice any errors, let me know!

Mixed Models in Sport Science – Frequentist & Bayesian

Intro

Tim Newans and colleagues recently published a nice paper discussing the value of mixed models compared to repeated measures ANOVA in sport science research (1). I thought the paper highlighted some key points, though I do disagree slightly with the notion that sport science research hasn’t adopted mixed models, as I feel like they have been relatively common in the field over the past decade. That said, I wanted to do a blog that goes a bit further into mixed models because I felt like, while the aim of the paper was to show their value compared with repeated measures ANOVA, there are some interesting aspects of mixed models that weren’t touched upon in the manuscript. In addition to building up several mixed models, it might be fun to extend the final model to a Bayesian mixed model to show the parallels between the two and some of the additional things that we can learn with posterior distributions. The data used by Newans et al. had independent variables that were categorical, level of competition and playing position. The data I will use here is slightly different in that the independent variable is a continuous variable, but the concepts still apply.

Obtain the Data and Perform EDA

For this tutorial, I’ll use the convenient sleepstudy data set in the {lme4} package. This study is a series of repeated observations of reaction time on subjects that were being sleep deprived over 10 days. This sort of repeated nature of data is not uncommon in sport science as serial measurements (e.g., GPS, RPE, Wellness, HRV, etc) are frequently recorded in the applied environment. Additionally, many sport and exercise studies collect time point data following a specific intervention. For example, measurements of creatine-kinase or force plate jumps at intervals of 0, 24, and 48 hours following a damaging bout of exercise or competition.

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(lme4)
library(broom)
library(gt)
library(patchwork)
library(arm)

theme_set(theme_bw())

dat <- sleepstudy dat %>% head()

The goal here is to obtain an estimate about the change in reaction time (longer reaction time means getting worse) following a number of days of sleep deprivation. Let’s get a plot of the average change in reaction time for each day of sleep deprivation.

dat %>%
  group_by(Days) %>%
  summarize(N = n(),
            avg = mean(Reaction),
            se = sd(Reaction) / sqrt(N)) %>%
  ggplot(aes(x = Days, y = avg)) +
  geom_ribbon(aes(ymin = avg - 1.96*se, ymax = avg + 1.96*se),
              fill = "light grey",
              alpha = 0.8) +
  geom_line(size = 1.2) +
  scale_x_continuous(labels = seq(from = 0, to = 9, by = 1),
                     breaks = seq(from = 0, to = 9, by = 1)) +
  labs(x = "Days of Sleep Deprivation",
       y = "Average Reaction Time",
       title = "Reaction Time ~ Days of Sleep Deprivation",
       subtitle = "Mean ± 95% CI")

Okay, we can clearly see that something is going on here. As the days of sleep deprivation increase the reaction time in the test is also increasing, a fairly intuitive finding.

However, we also know that we are dealing with repeated measures on a number of subjects. As such, some subjects might have differences that vary from the group. For example, some might have a worse effect than the population average while others might not be that impacted at all. Let’s tease this out visually.

dat %>%
  ggplot(aes(x = Days, y = Reaction)) +
  geom_line(size = 1) +
  geom_point(shape = 21,
             size = 2,
             color = "black",
             fill = "white") +
  geom_smooth(method = "lm",
              se = FALSE) +
  facet_wrap(~Subject) +
  scale_x_continuous(labels = seq(from = 0, to = 9, by = 3),
                     breaks = seq(from = 0, to = 9, by = 3)) +
  labs(x = "Days of Sleep Deprivation",
       y = "Average Reaction Time",
       title = "Reaction Time ~ Days of Sleep Deprivation")

This is interesting. While we do see that many of the subjects exhibit an increase reaction time as the number of days of sleep deprivation increase there are a few subjects that behave differently from the norm. For example, Subject 309 seems to have no real change in reaction time while Subject 335 is showing a slight negative trend!

Linear Model

Before we get into modeling data specific to the individuals in the study, let’s just build a simple linear model. This is technically “the wrong” model because it violates assumptions of independence. After all, repeated measures on each individual means that there is going to be a level of correlation from one test to the next for any one individual. That said, building a simple model, even if it is the wrong model, is sometimes a good place to start just to help get your bearings, understand what is there, and have a benchmark to compare future models against. Additionally, the simple model is easier to interpret, so it is a good first step.

fit_ols <- lm(Reaction ~ Days, data = dat)
summary(fit_ols)

We end up with a model that suggests that for every one unit increase in a day of sleep deprivation, on average, we see about a 10.5 second increase in reaction time. The coefficient for the Days variable also has a standard error of 1.238 and the model r-squared indicates that days of sleep deprivation explain about 28.7% of the variance in reaction time.

Let’s get the confidence intervals for the model intercept and Days coefficient.

confint(fit_ols)

Summary Measures Approach

Before moving to the mixed model approach, I want to touch on simple approach that was written about in 1990 by Matthews and Altman (2). I was introduced to this approach by one of my former PhD supervisors, Matt Weston, as he used it to analyze data in 2011 for a paper with my former lead PhD supervisor, Barry Drust, where they were quantifying the intensity of Premier League match-play for players and referees. This approach is extremely simple and, while you may be asking yourself, “Why not just jump into the mixed model and get on with it?”, just bear with me for a moment because this will make sense later when we begin discussing pooling effects in mixed models and Bayesian mixed models.

The basic approach suggested by Matthews (2) for this type of serially collected data is to treat each individual as the unit of measure and identify a single number, which summarizes that subject’s response over time. For our analysis here, the variable that we are interested in for each subject is the Days slope coefficient, as this value tells us the rate of change in reaction time for every passing day of sleep deprivation. Let’s construct a linear regression for each individual subject in the study and place their slope coefficients into a table.

ind_reg <- dat %>%
  group_by(Subject) %>%
  group_modify(~ tidy(lm(Reaction ~ Days, data = .x)))


ind_days_slope <- ind_reg %>%
  filter(term == "Days")

ind_days_slope %>%
  ungroup() %>%
  gt() %>%
  fmt_number(columns = estimate:statistic,
             decimals = 2) %>%
  fmt_number(columns = p.value,
             decimals = 3)

As we noticed in our visual inspection, Subject 335 had a negative trend and we can confirm their negative slope coefficient (-2.88). Subject 309 had a relatively flat relationship between days of sleep deprivation and reaction time. Here we see that their coefficient is 2.26 however the standard error, 0.98, indicates rather large uncertainty [95% CI: 0.3 to 4.22].

Now we can look at how each Subject’s response compares to the population. If we take the average of all of the slopes we get basically the same value that we got from our OLS model above, 10.47. I’ll build two figures, one that shows each Subject’s difference from the population average of 10.47 and one that shows each Subject’s difference from the population centered at 0 (being no difference from the population average). Both tell the same story, but offer different ways of visualizing the subjects relative to the population.

pop_avg

plt_to_avg <- ind_days_slope %>%
  mutate(pop_avg = pop_avg,
         diff = estimate - pop_avg) %>%
  ggplot(aes(x = estimate, y = as.factor(Subject))) +
  geom_vline(xintercept = pop_avg) +
  geom_segment(aes(x = diff + pop_avg, 
                   xend = pop_avg, 
                   y = Subject, 
                   yend = Subject),
               size = 1.2) +
  geom_point(size = 4) +
  labs(x = NULL,
       y = "Subject",
       title = "Difference compared to population\naverage change in reaction time (10.47 sec)")


plt_to_zero <- ind_days_slope %>%
  mutate(pop_avg = pop_avg,
         diff = estimate - pop_avg) %>%
  ggplot(aes(x = diff, y = as.factor(Subject))) +
  geom_vline(xintercept = 0) +
  geom_segment(aes(x = 0, 
                   xend = diff, 
                   y = Subject, 
                   yend = Subject),
               size = 1.2) +
  geom_point(size = 4) +
    labs(x = NULL,
       y = "Subject",
       title = "Difference compared to population\naverage reflected against a 0 change")

plt_to_avg | plt_to_zero

From the plots, it is clear to see how each individual behaves compared to the population average. Now let’s build some mixed models and compare the results.

Mixed Models

The aim of the mixed model here is to in someway acknowledge that we have these repeated measures on each individual and we need to account for that. In the simple linear model approach above, the Subjects shared the same variance, which isn’t accurate given that each subject behaves slightly differently, which we saw in our initial plot and in the individual linear regression models we constructed in the prior section. Our goal is to build a model that allows us to make an inference about the way in which the amount of sleep deprivation, measured in days, impacts a human’s reaction time performance. Therefore, we wouldn’t want to add each subject into a single regression model, creating a coefficient for each individual within the model, as that would be akin to making comparisons between each individual similar to what we would do in an ANOVA (and it will also use a lot of degrees of freedom). So, we want to acknowledge that there are individual subjects that may vary from the population, while also modeling our question of interest (Reaction Time ~ Days of Sleep Deprivation).

Intercept Only Model

We begin with a simple model that has an intercept only but allows that intercept to vary by subject. As such, this model is not accounting for days of sleep deprivation. Rather, it is simply telling us the average reaction time response for the group and how each subject varies from that response.

fit1 <- lmer(Reaction ~ 1 + (1|Subject), data = dat)
display(fit1)

The output provides us with an average estimate for reaction time of 298.51. Again, this is simply the group average for reaction time. We can confirm this easily.

mean(dat$Reaction)

The Error terms in our output above provide us with the standard deviation around the population intercept, 35.75, which is the between-subject variability. The residual here represents our within-subject standard variability. We can extract the actual values for each subject. Using the ranef() function we get the difference between each subject and the population average intercept (the fixed effect intercept).

ranef(fit1)

One Fixed Effect plus Random Intercept Model

We will now extend the above mixed model to reflect our simple linear regression model except with random effects specified for each subject. Again, the only thing we are allowing to vary here is the individual subject’s intercept for reaction time.

fit2 <- lmer(Reaction ~ Days + (1|Subject), data = dat)
display(fit2)

Adding the independent variable, Days, has changed the intercept, decreasing it from 298.51 to 251.41. The random effects have also changed from the first model. In the intercept only model we had an intercept standard deviation of 35.75 with a residual standard deviation of 44.26. In this model, accounting for days of sleep deprivation, the random effects intercept is now 37.12 and the residual (within subject) standard deviation has dropped to 30.99. The decline in the residual standard deviation tells us the model is picking up some of the individual differences that we have observed in plot of each subjects’ data.

Let’s look at the random effects intercept for each individual relative to the fixed effect intercept.

ranef(fit2)

For example, Subject 309 has an intercept that is 77.8 seconds below the fixed effect intercept, while Subject 308 is 40.8 seconds above the fixed effect intercept.

If we use the coef() function we get returned the actual individual linear regression equation for each subject. Their difference compared to the population will be added to the fixed effect intercept, to create their individual intercept, and we will see the days coefficient, which is the same for each subject because we haven’t specified a varying slope model (yet).

coef(fit2)

So, for example, the equation for Subject 308 has an equation of:

292.19 + 10.47*Days

Let’s also compare the results of this model to our original simple regression model. I’ll put all of the model results so far into a single data frame that we can keep adding to as we go, allowing us to compare the results.

We can see that once we add the day variable to the model (in fit_ols, and fit2) the model intercept for reaction time changes to reflect its output being dependent on this information. Notice that the intercept and slope values between the simple linear regression model and the fit2 model are the same. The mean result hasn’t changed but notice that the standard error for the intercept and slope as been altered. In particular, notice how much the standard error of the slope variable has declined once we moved from a simple regression to a mixed model. This is due do the fact that the model now knows that there are individuals with repeated measurements being recorded.

Looking at the random effects changes between fit1 (intercept only model) and fit2, we see that we’ve increased in standard deviation for the intercept (35.75 to 37.12), which represents the between-subject variability. Additionally, we have substantially decreased the residual standard deviation (44.26 to 30.99), which represents our within-subject variability. Again, because we have not explicitly told the model that certain measurements are coming from certain individuals, there appears to be more variability between subjects and less variability within-subject. Basically, there is some level of autocorrelation to the individual’s reaction time performance whereby they are more similar to themselves than they are to other people. This makes intuitive sense — if you can’t agree with yourself, who can you agree with?!

Random Slope and Intercept Model

The final mixed model we will build will allow not only the intercept to vary randomly between subjects but also the slope. Recall above that the coefficient for Days was the same across all subjects. This is because that model allowed the subjects to have different model intercepts but it made the assumption that they had the same model slope. However, we may have reason to believe that the slopes also differ between subjects after looking at the individual subject plots in section 1.

fit3 <- lmer(Reaction ~ Days + (1 + Days|Subject), data = dat)
display(fit3)

Let’s take a look at the regression model for each individual.

coef(fit3)

Now we see that the coefficient for the Days variable is different for each Subject. Let’s add the results of this model to our results data frame and compare everything.

Looking at this latest model we see that the intercept and slope coefficients have remained unchanged relative to the other models. Again, the only difference in fixed effects comes at the standard error for the intercept and the slope. This is because the variance in the data is being partitioned between the population estimate fixed effects and the individual random effects. Notice that for the random effects in this model, fit3, we see a substantial decline in the between-subject intercept, down to 24.74 from the mid 30’s in the previous two models. We also see a substantial decline the random effect residual, because we are now seeing less within individual variability as we account for the random slope. The addition here is the random effect standard deviation for the slope coefficient, which is 5.92.

We can plot the results of the varying intercepts and slopes across subjects using the {lattice} package.

lattice::dotplot(ranef(fit3, condVar = T))

Comparing the models

To make model comparisons, Newans and colleagues used Akaike Information Criterion (AIC) to determine which model fit the data best. With the anova() function we simply pass our three mixed models in as arguments and we get returned several model comparison metrics including AIC, BIC, and p-values. These values are lowest for fit3, indicating it is the better model at explaining the underlying data. Which shouldn’t be too surprising given how individual the responses looked in the initial plot of the data.

anova(fit1, fit2, fit3)

We can also plot the residuals of fit3 to see whether they violate any assumptions of homoscedasticity or normality.

plot(fit3)

par(mfrow = c(1,2))
hist(resid(fit3))
qqnorm(resid(fit3))
qqline(resid(fit3), col = "red", lwd = 3)

Pooling Effects

So what’s going on here? The mixed model is really helping us account for the repeated measures and the correlated nature of an individuals data. In doing so, it allows us to make an estimate of a population effect (fixed effect), while adjusting the standard errors to be more reasonable given the violation of independence assumption. The random effects allow us to see how much variance there is for each individual measured within the population. In a way, I tend to think about mixed models as a sort of bridge to Bayesian analysis. In my mind, the fixed effects seem to look a bit like priors and the random effects are the adjustment we make to those priors having seen some data for an individual. In our example here, each of the 18 subjects have 10 reaction time measurements. If we were dealing with a data set that had different numbers of observations for each individual, however, we would see that those with less observations are pulled closer towards the population average (the fixed effect coefficients) while those with a larger number of observations are allowed to deviate further from the population average, if they indeed are different. This is because with more observations we have more confidence about what we are observing for that individual. In effect, this is called pooling.

In their book, Data Analysis Using Regression and Multilevel/Hierarchical Models, Gelman and Hill (4) discuss no-, complete-, and partial-pooling models.

No-pooling models occur when the data from each subject are analyzed separately. This model ignores information in the data and could lead to poor inference.
Complete-pooling disregards any variation that might occur between subjects. Such suppression of between-subject variance is missing the point of conducting the analysis and looking at individual difference.
Partial-pooling strikes a balance between the two, allowing for some pooling of population data (fixed effects) while honoring the fact that there are individual variations occurring within the repeated measurements. This type of analysis is what we obtain with a mixed model.

I find it easiest to understand this by visualizing it. We already have a complete-pooling model (fit_ols) and a partial-pooling model (fit3). We need to fit the no-pooling model, which is a regression equation with each subject entered as a fixed effect. As a technical note, I will also add to the model -1 to have a coefficient for each subject returned instead of a model intercept.

fit_no_pool <- lm(Reaction ~ Days + Subject - 1, data = dat)
summary(fit_no_pool)

To keep the visual small, we will fit each of our three models to 4 subjects (308, 309, 335, and 331) and then plot the respective regression lines. In addition, I will also plot the regression line from the individualized regression/summary measures approach that we built first, just to show the difference.

## build a data frame for predictions to be stored
Subject <- as.factor(c(308, 309, 335, 331))
Days <- seq(from = 0, to = 9, by = 1)
pred_df <- crossing(Subject, Days)
pred_df

## complete pooling predictions
pred_df$complete_pool_pred <- predict(fit_ols, newdata = pred_df)

## no pooling predictions
pred_df$no_pool_pred <- predict(fit_no_pool, newdata = pred_df)

## summary measures/individual regression
subject_coefs <- ind_reg %>%
  filter(Subject %in% unique(pred_df$Subject)) %>%
  dplyr::select(Subject, term, estimate) %>%
  mutate(term = ifelse(term == "(Intercept)", "Intercept", "Days")) %>%
  pivot_wider(names_from = term,
              values_from = estimate)

subject_coefs %>%
  as.data.frame()

pred_df <- pred_df %>%
  mutate(ind_reg_pred = ifelse(
    Subject == 308, 244.19 + 21.8*Days,
    ifelse(
      Subject == 309, 205.05 + 2.26*Days,
    ifelse(
      Subject == 331, 285.74 + 5.27*Days,
    ifelse(Subject == 335, 263.03 - 2.88*Days, NA)
    )
    )
  ))


## partial pooling predictions
pred_df$partial_pool_pred <- predict(fit3, 
        newdata = pred_df,
        re.form = ~(1 + Days|Subject))


## get original results and add to the predicted data frame
subject_obs_data <- dat %>%
  filter(Subject %in% unique(pred_df$Subject)) %>%
  dplyr::select(Subject, Days, Reaction)

pred_df <- pred_df %>%
  left_join(subject_obs_data)


## final predicted data set with original observations
pred_df %>%
  head()

## plot results
pred_df %>%
  pivot_longer(cols = complete_pool_pred:partial_pool_pred,
               names_to = "model_pred") %>%
  arrange(model_pred) %>%
  ggplot(aes(x = Days, y = Reaction)) +
  geom_point(size = 4,
             shape = 21,
             color = "black",
             fill = "white") +
  geom_line(aes(y = value, color = model_pred),
            size = 1.1) +
  facet_wrap(~Subject)

Let’s unpack this a bit:

The complete-pooling line (orange) has the same intercept and slope for each of the subjects. Clearly it does not fit the data well for each subject.
The no-pooling line (dark green) has the same slope for each subject but we notice that the intercept varies as it is higher for some subject than others.
The partial-pooling line (purple) comes from the mixed model and we can see that it has a slope and intercept that is unique to each subject.
Finally, the independent regression/summary measures line (light green) is similar to the partial-pooling line, as a bespoke regression model was built for each subject. Note, that in this example the two lines have very little difference but, if we were dealing with subjects who had varying levels of sample size, this would not be the case. The reason is because those with lower sample size will have an independent regression line completely estimated from their observed data, however, their partial pooling line will be pulled more towards the population average, given their lower sample.

Bayesian Mixed Models

Since I mentioned that I tend to think of mixed models as sort of a bridge to the Bayesian universe, let’s go ahead at turn our mixed model into a Bayes model and see what else we can do with it.

I’ll keep things simple here and allow the {rstanarm} library to use the default, weakly informative priors.

library(rstanarm)

fit_bayes <- stan_lmer(Reaction ~ Days + (1 + Days|Subject), data = dat)
fit_bayes

# If you want to extract the mean, sd and 95% Credible Intervals
# summary(fit_bayes,
#         probs = c(0.025, 0.975),
#         digits = 2)

Let’s add the model output to our comparison table.

We see some slight changes between fit3 and fit_bayes, but nothing that drastic.

Let’s see what the posterior draws for all of the parameters in our model look like.

# Extract the posterior draws for all parameters
post_sims <- as.matrix(fit_bayes)
dim(post_sims)
head(post_sims)

Yikes! That is a large matrix of numbers (4000 rows x 42 columns). Each subject gets, by default, 4000 draws from the posterior and each column represents a posterior sample from each of the different parameters in the model.

What if we slim this down to see what is going on? Let’s see what possible parameters in our matrix are in our matrix we might be interested in extracting.

colnames(post_sims)

The first two columns are the fixed effect intercept and slope. Following that, we see that each subject has an intercept value and a Days value, coming from the random effects. Finally, we see that column names 39 to 42 are specific to the sigma values of the model.

Let’s keep things simple and see what we can do with the individual subject intercepts. Basically, we want to extract the posterior distribution of the fixed effects intercept and the posterior distribution of the random effects intercept per subject and combine those to reflect the posterior distribution of reaction time by subject.

## posterior draw from the fixed effects intercept
fixed_intercept_sims <- as.matrix(fit_bayes, 
                       pars = "(Intercept)")

## posterior draws from the individual subject intercepts
subject_ranef_intercept_sims <- as.matrix(fit_bayes, 
                    regex_pars = "b\\[\\(Intercept\\) Subject\\:")


## combine the posterior draws of the fixed effects intercept with each individual
posterior_intercept_sims <- as.numeric(fixed_intercept_sims) + subject_ranef_intercept_sims
head(posterior_intercept_sims[, 1:4])

After drawing our posterior intercepts, we can get the mean, standard deviation, median, and 95% Credible Interval of the 4000 simulations for each subject and then graph them in a caterpillar plot.

## mean per subject
intercept_mean <- colMeans(posterior_intercept_sims)

## sd per subject
intercept_sd <- apply(X = posterior_intercept_sims,
              MARGIN = 2,
              FUN = sd)

## median and 95% credible interval per subject
intercept_ci <- apply(X = posterior_intercept_sims, 
                 MARGIN = 2, 
                 FUN = quantile, 
                 probs = c(0.025, 0.50, 0.975))

## summary results in a single data frame
intercept_ci_df <- data.frame(t(intercept_ci))
names(intercept_ci_df) <- c("x2.5", "x50", "x97.5")

## Combine summary statistics of posterior simulation draws
bayes_df <- data.frame(intercept_mean, intercept_sd, intercept_ci_df)
round(head(bayes_df), 2)

## Create a column for each subject's ID
bayes_df$subject <- rownames(bayes_df)
bayes_df$subject <- extract_numeric(bayes_df$subject)


## Catepillar plot
ggplot(data = bayes_df, 
       aes(x = reorder(as.factor(subject), intercept_mean), 
           y = intercept_mean)) +
  geom_pointrange(aes(ymin = x2.5, 
                      ymax = x97.5),
                  position = position_jitter(width = 0.1, 
                                             height = 0)) + 
  geom_hline(yintercept = mean(bayes_df$intercept_mean), 
             size = 0.5, 
             col = "red") +
  labs(x = "Subject",
       y = "Reaction Time",
       title = "Reaction Time per Subject",
       subtitle = "Mean ± 95% Credible Interval")

We can also use the posterior distributions to compare two individuals. For example, let’s compare Subject 308 (column 1 of our posterior sim matrix) to Subject 309 (column 2 of our posterior sim matrix).

## create a difference between distributions
compare_dist <- posterior_intercept_sims[, 1] - posterior_intercept_sims[, 2]

# summary statistics of the difference
mean_diff <- mean(compare_dist)
sd_diff <- sd(compare_dist)

quantile_diff <- quantile(compare_dist, probs = c(0.025, 0.50, 0.975))
quantile_diff <- data.frame(t(quantile_diff))
names(quantile_diff) <- c("x2.5", "x50", "x97.5")

diff_df <- data.frame(mean_diff, sd_diff, quantile_diff)
round(diff_df, 2)

Subject 308 exhibits a 38.7 second higher reaction time than Subject 309 [95% Credible Interval 2.9 to 74.4].

We can plot the posterior differences as well.

# Histogram of the differences
hist(compare_dist, 
     main = "Posterior Distribution Comparison\n(Subject 308 - Subject 309)",
     xlab = "Difference in 4000 posterior simulations")
abline(v = 0,
       col = "red",
       lty = 2,
       lwd = 2)
abline(v = mean_diff,
       col = "black",
       lwd = 2)
abline(v = quantile_diff$x2.5,
       col = "black",
       lty = 2,
       lwd = 2)
abline(v = quantile_diff$x97.5,
       col = "black",
       lty = 2,
       lwd = 2)

We can also use the posterior distributions to make a probabilistic statement about how often, in the 4000 posterior draws, Subject 308 had a higher reaction time than Subject 309.

prop.table(table(posterior_intercept_sims[, 1] > posterior_intercept_sims[, 2]))

Here we see that mean posterior probability that Subject 308 has a higher reaction time than Subject 309 is 98.3%.

Of course we could (and should) go through and do a similar work up for the slope coefficient for the Days variable. However, this is getting long, so I’ll leave you to try that out on your own.

Wrapping Up

Mixed models are an interesting way of handling data consisting of multiple measurements taken on different individuals, as common in sport science. Thanks to Newans et al (1) for sharing their insights into these models. Hopefully this blog was useful in explaining a little bit about how mixed models work and how extending them to a Bayesian framework offers some additional ways of looking at the data. Just note that if you run the code on your own, the Bayesian results might be slightly different than mine as the Monte Carlo component of the Bayesian model produces some randomness in each simulation (and I didn’t set a seed prior to running the model).

The entire code is available on my GitHub page.

If you notice any errors in the code, please reach out!

References

1. Newans T, Bellinger P, Drovandi C, Buxton S, Minahan C. (2022). The utility of mixed models in sport science: A call for further adoption in longitudinal data sets. Int J Sports Phys Perf. Published ahead of print.

2. Matthews JNS, Altman DG, Campbell MJ, Royston P. (1990). Analysis of serial measurements in medical research. Br Med J; 300: 230-235.

3. Weston M, Drust B, Gregson W. (2011). Intensities of exercise during match-play in FA Premier League Referees and players. J Sports Sc; 29(5): 527-532.

4. Gelman A, Hill J. (2009). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.