Category Archives: Sports Analytics

Simulations in R Part 2: Bootstrapping & Simulating Bivariate and Multivariate Distributions

In Part 1 of this series we covered some of the basic functions that we will need to perform resampling and simulations in R.

In Part 2 we will now move to building bootstrap resamples and simulating distributions for both bivariate and multivariate relationships. Some stuff we will cover:

Coding bootstrap resampling by hand for both a single variable (mean) and regression coefficients.
Using the boot() function from the boot package to perform bootstrapping for both a single variable and regression coefficients.
Creating a simulation where two variables are in some way dependent on each other (correlate).
Creating a simulation where multiple variables are correlated with each other.
Finally, to understand what is going on with these simulated distributions we will also work through code that shows us the relationship between variables using both covariance and correlation matrices.

As always, all code is freely available in the Github repository.

Resampling

The resampling approach we will use here is bootstrapping. The general concept of bootstrapping is as follows:

Draw multiple random samples from observed data with replacement.
Draws must be independent and each observation must have an equal chance of being selected.
The bootstrap sample should be the same size as the observed data in order to use as much information from the sample as possible.
Calculate the mean resampled data and store it.
Repeat this process thousands of times and summarize the mean of resampled means and the standard deviation of resampled means to obtain summary statistics of your bootstrapped resamples.

Write the bootstrap resampling by hand

The code below goes through the process of creating some fake data and then writing a for() loop that produces 1000 bootstrap resamples. The for() loop was introduced in Part 1 of this series. In a nutshell, we are taking a random sample from the fake data (with replacement), calculating the mean of that random sample, and storing it in the element boot_dat. From there, we calculate the summary statistics or the original sample and the bootstrap resample, which we store in a data frame for comparison purposes, and then produce a histogram of the original sample and the bootstrap resample, which are visualized below the code.

library(tidyverse)

## create fake data
dat <- c(5, 10, 44, 3, 29, 10, 16.7, 22.3, 28, 1.4, 25)

### Bootstrap Resamples ###
# we want 1000 bootstrap resamples
n_boots <- 1000

## create an empty vector to store our bootstraps
boot_dat <- rep(NA, n_boots)

# set seed for reproducibility
set.seed(786)

# write for() loop for the resampling
for(i in 1:n_boots){
  # random sample of 1:n number of observations in our data, with replacement
  ind <- sample(1:length(dat), replace = TRUE)
  
  # Use the row indexes to select the given values from the vector and calculate the mean
  boot_dat[i] <- mean(dat[ind])
}

# Look at the first 6 bootstrapped means
head(boot_dat)

### Compare Bootstrap data to original data ###
## mean and standard deviation of the fake data
dat_mean <- mean(dat)
dat_sd <- sd(dat)

# standard error of the mean
dat_se <- sd(dat) / sqrt(length(dat))

# 95% confidence interval
dat_ci95 <- paste0(round(dat_mean - 1.96*dat_se, 1), ", ", round(dat_mean + 1.96*dat_se, 1))

# mean an SD of bootstrapped data
boot_mean <- mean(boot_dat)

# the vector is the mean of each bootstrap sample, so the standard deviation of these means represents the standard error
boot_se <- sd(boot_dat)

# to get the standard deviation we can convert the standard error back
boot_sd <- boot_se * sqrt(length(dat))

# 95% quantile interval
boot_ci95 <- paste0(round(boot_mean - 1.96*boot_se, 1), ", ", round(boot_mean + 1.96*boot_se, 1)) ## Put everything together data.frame(data = c("fake sample", "bootstrapped resamples"), N = c(length(dat), length(boot_dat)), mean = c(dat_mean, boot_mean), sd = c(dat_sd, boot_sd), se = c(dat_se, boot_se), ci95 = c(dat_ci95, boot_ci95)) %&gt;%
  knitr::kable()

# plot the distributions
par(mfrow = c(1, 2))
hist(dat,
     xlab = "Obsevations",
     main = "Fake Data")
abline(v = dat_mean,
       col = "red",
       lwd = 3,
       lty = 2)
hist(boot_dat,
     xlab = "bootstrapped means",
     main = "1000 bootstrap resamples")
abline(v = boot_mean,
       col = "red",
       lwd = 3,
       lty = 2)

R offers a bootstrap function from the boot package that allows you to do the same thing without writing out your own for() loop. Below is an example of coding the same procedure in the boot package and the outputs the function provides, which are similar to the output we get above, save for slight differences due to random sampling.

# write a function to calculate the mean of our sample data
sample_mean <- function(x, d){
     return(mean(x[d]))
}

# run the boot() function
library(boot)

# run the boot function
boot_func_output <- boot(dat, statistic = sample_mean, R = 1000)

# produce a plot of the output
plot(boot_func_output)

# get the mean and standard error
boot_func_output

# get 95% CI around the mean
boot.ci(boot_func_output, type = "basic", conf = 0.95)

We can bootstrap pretty much anything we want. We don’t have to limit ourselves to producing the distribution around the mean of a population. For example, let’s bootstrap regression coefficients to understand the uncertainty in them.

First, let’s use the boot() function to conduct our analysis. We will fit a simple linear regression predicting miles per gallon from engine weight using the mtcars package. We will then write a function for that uses a random sample of rows to create the same linear regression and store those coefficients from 1000 linear regressions so that we can plot a histogram representing the slope coefficient from the resampled models and also summarize the distribution with confidence intervals.

# load the mtcars data
d <- mtcars d %>%
  head()

# fit a regression model
fit_mpg <- lm(mpg ~ wt, data = d)
summary(fit_mpg)
coef(fit_mpg)
confint(fit_mpg)

# Write a function that can perform a bootstrap over the intercept and slope of the model
# bootstrap function
reg_coef_boot <- function(data, row_id){
  # we want to resample the rows
  fit <- lm(mpg ~ wt, data = d[row_id, ])
  coef(fit)
}

# run this once on a small subset of the row ids to see how it works
reg_coef_boot(data = d, row_id = 1:20)

# run the boot() function 1000 times
coef_boot <- boot(data = d, reg_coef_boot, 1000) 

# check the output (coefficient and SE) 
coef_boot 

# get the confidence intervals 
boot.ci(coef_boot, index= 2) 

# all 1000 of the bootstrap resamples can be called coef_boot$t %&gt;%
  head()

# plot the first 20 bootstrapped intercepts and slopes over the original data
plot(x = d$wt,
     y = d$mpg,
     pch = 19)
for(i in 1:20){
  abline(a = coef_boot$t[i, 1],
       b = coef_boot$t[i, 2],
       lty = 2,
       lwd = 3,
       col = "light grey")
}

## histogram of the slope coefficient
hist(coef_boot$t[, 2])

We can do this by hand if we don’t want to use the built in boot() function. (SIDE NOTE: I usually prefer to code my own resampling and simulations as it gives me more flexibility with respect to the things I’d like to add or the values I’d like to store from each iteration.

Below, instead of producing a histogram of just the slope coefficient, I use both the resampled intercept and slope and add 20 (of the 1000) lines to a scatter plot to show the way in which each of these lines represents a plausible regression line for the data. As you can see, the regression line confidence interval is starting to take shape, even with just 20 out of 1000 resamples, and this gives us a good understanding of not only the variability in our possible regression line fit for the underlying data but also, perhaps should make us less overconfident in our research findings knowing that there are many possible outcomes from the sample data we have obtained.

## 1000 resamples
n_samples <- 1000

## N observations
n_obs <- nrow(mtcars)

## empty storage data frame for the coefficients
coef_storage <- data.frame(
  intercept = rep(NA, n_samples),
  slope = rep(NA, n_samples)
)

for(i in 1:n_samples){
  
  ## sample dependent and independent variables
  row_ids <- sample(1:n_obs, size = n_obs, replace = TRUE)
  new_df <- d[row_ids, ]
  
  ## construct model
  model <- lm(mpg ~ wt, data = new_df)
  
  ## store coefficients
  # intercept
  coef_storage[i, 1] <- coef(model)[1]
  
  # slope
  coef_storage[i, 2] <- coef(model)[2]
  
}

## see results
head(coef_storage)
tail(coef_storage)

## Compare the results to those of the boot function
apply(X = coef_boot$t, MARGIN = 2, FUN = mean)
apply(X = coef_storage, MARGIN = 2, FUN = mean)

apply(X = coef_boot$t, MARGIN = 2, FUN = sd)
apply(X = coef_storage, MARGIN = 2, FUN = sd)

## plot first 20 lines
plot(x = d$wt,
     y = d$mpg,
     pch = 19)
for(i in 1:20){
  abline(a = coef_storage[i, 1],
       b = coef_storage[i, 2],
       lty = 2,
       lwd = 3,
       col = "light grey")
}

Simulating a relationship between two variables

As discussed in Part 1, simulation differs from resampling in that we use the parameters of the observed data to compute a new distribution, versus sampling from the data we have on hand.

For example, using the mean and standard deviation of mpg from the mtcars data set, we can simulate 1000 random draws from a normal distribution.

## load the mtcars data set
d <- mtcars

## make a random draw from the normal distribution for mph
set.seed(5)
mpg_sim <- rnorm(n = 1000, mean = mean(d$mpg), sd = sd(d$mpg))

## plot and summarize
hist(mpg_sim)

mean(mpg_sim)
sd(mpg_sim)

Frequently, we are interested in the relationship between two variables (e.g., correlation, regression, etc.). Let’s simulate two variables, x and y, which are linearly related in some way. To do this, we first simulate the variable x and then simulate y to be x plus some level of random noise.

# simulate x and y
set.seed(1098)
x <- rnorm(n = 10, mean = 50, sd = 10)
y <- x + rnorm(n = length(x), mean = 0, sd = 10)

# put the results in a data frame
dat <- data.frame(x, y)
dat

# how correlated are the two variables
cor.test(x, y)

# fit a regression for the two variables
fit <- lm(y ~ x)
summary(fit)

# plot the two variables with the regression line
plot(x, y, pch = 19)
abline(fit, col = "red", lwd = 2, lty = 2)

Simulating a data set with multiple variables

Frequently, we might have a hypothesis regarding how correlated multiple variables are with each other. The example above produced a relationship of two variables with a direct relationship between them along with some noise. We might want to specify this relationship given a correlation coefficient or covariance between them. Additionally, we might have more than two variables that we want to simulate relationships between.

To do this in R we can take advantage of two packages:

MASS via the mvrnorm()
mvtnorm via the mvrnorm()

Both packages have a function for simulating multivariate normal distributions. The primary difference is that the Sigma argument in the MASS package function, mvrnorm(), accepts a covariance matrix while the sigma argument in the mvtnorm package, rmvnorm() accepts a correlation matrix. I’ll show both examples but I tend to stick with the mtvnorm package because (at least for my brain) it is easier for me to think in terms of correlation coefficients instead of covariances.

First we simulate some data:

## create fake data
set.seed(1234)
fake_dat <- data.frame(
  group = rep(c("a", "b", "c"), each = 5),
  x = rnorm(n = 15, mean = 10, sd = 2),
  y = rnorm(n = 15, mean = 30, sd = 10),
  z = rnorm(n = 15, mean = 75, sd = 20)
)

fake_dat

Look at the correlation and variance between the three numeric variables.

# correlation
round(cor(fake_dat[, -1]), 3)

# variance
round(var(fake_dat[, -1]), 3)

We can use this information to simulate new x, y, or z variables.

Simulating x and y with the MASS package

Remember, for the MASS package, the Sigma argument is a matrix of covariances for the variables you are simulating from a multivariate normal distribution.

## get a vector of the mean for each variable
variable_means <- apply(X = fake_dat[, c("x", "y")], MARGIN = 2, FUN = mean)

## Get a matrix of the covariance between x and y
variable_sigmas <- var(fake_dat[, c("x", "y")])

## simulate 1000 new x and y variables using the MASS package
set.seed(98)
new_sim <- MASS::mvrnorm(n = 1000, mu = variable_means, Sigma = variable_sigmas)
head(new_sim)

### look at the results relative to the original x and y
## column means
variable_means
apply(X = new_sim, MARGIN = 2, FUN = mean)

## covariance
var(fake_dat[, c("x","y")])
var(new_sim)

Notice that the variance between x and y (the off diagonal of the matrix) is very similar between the fake data (the observed sample) and the simulated data (new_sim).

Simulating x and y with the mtvnorm package

Different than the MASS package, The rmvnorm() function from the mtvnorm package requires the sigma argument to be a correlations matrix.

Let’s repeat the above process with our fake_dat and simulate a relationship between x and y.

## get a vector of the mean for each variable
variable_means <- apply(X = fake_dat[, c("x", "y")], MARGIN = 2, FUN = mean)

## Get a matrix of the correlation between x and y
variable_cor <- cor(fake_dat[, c("x", "y")])

## simulate 1000 new x and y variables using the mvtnorm package
set.seed(98)
new_sim <- mvtnorm::rmvnorm(n = 1000, mean = variable_means, sigma = variable_cor)
head(new_sim)

### look at the results relative to the original x and y
## column means
variable_means
apply(X = new_sim, MARGIN = 2, FUN = mean)

## correlation
cor(fake_dat[, c("x","y")])
cor(new_sim)

Similar to the previous example, we notice that the off diagonal correlation coefficient between x and y is very similar when comparing the simulated data to the fake data.

So, what is happening here? Both packages produce the same result, one uses a covariance matrix and the other uses a correlation matrix. The kicker here is understanding the relationship between covariance and correlation. Covariance is explaining how two variables vary together, however, its units aren’t on a scale that is directly interpretable to us. But, we can convert the covariance between two variables to a correlation by dividing their covariance by the product of their individual standard deviations.

For example, here is the covariance matrix between x and y in the fake data set.

cov(fake_dat[, c("x", "y")])

The covariance between the two variables is on the off diagonal, 2.389. We can store this in its own element.

cov_xy <- cov(fake_dat[, c("x", "y")])[2,1]
cov_xy

Let’s store the standard deviation of both `x` and `y` in their own elements to make the equation easier to read.

sd_x <- sd(fake_dat$x)
sd_y <- sd(fake_dat$y)

Finally, we calculate the correlation by dividing the covariance by the product of the two standard deviations and check our results by calling the cor() function on the two variables.

## covariance to correlation
cov_to_cor <- cov_xy / (sd_x * sd_y)
cov_to_cor

## check results with the corr() function
cor(fake_dat[, c("x", "y")])

By dividing the covariance by the product of the standard deviation of the two variables we can see the relationship between covariance and correlation and now understand why the results from the MASS and mtvnorm produce similar results. Understanding this relationship becomes valuable when we move onto simulating more complex relationships, for example when simulating mixed models.

What about three variables?

What if we want to simulate all three variables — x, y, and z?

All we need is a larger covariance or correlation matrix, depending on which of the above packages you’d like to use. Since we usually won’t be creating these matrices from a data set, as I did above, I’ll show how to create your own matrix and run the simulation.

First, let’s store a vector of plausible mean values for x, y, and z.

## Look at the mean values we had in the fake data
apply(X = fake_dat[, c("x", "y", "z")], MARGIN = 2, FUN = mean)

## create a vector of possible mean values for the simulation
mus <- c(9, 26, 63)

## look at the correlation matrix for the three variables in the fake data
cor(fake_dat[, c("x", "y", "z")])

## Create a matrix that stores plausible correlations between the variables you want to simulate
r_matrix <- matrix(c(1, 0.14, -0.24,
                    0.14, 1, -0.35,
                    -0.24, -0.35, 1), 
                   nrow = 3, ncol = 3,
       dimnames = list(c("x", "y", "z"),
                       c("x", "y", "z")))

r_matrix

Next, we create 1000 simulations of a multivariate normal distribution between x, y, and z. We then compare our correlation coefficients between the fake data, which is our observed sample, and the simulated data

## simulate 1000 new x, y, and z variables using the mvtnorm package
set.seed(43)
new_sim <- mvtnorm::rmvnorm(n = 1000, mean = mus, sigma = r_matrix)
head(new_sim)

### look at the results relative to the original x, y, and z
## column means
apply(X = fake_dat[, c("x", "y", "z")], MARGIN = 2, FUN = mean)
apply(X = new_sim, MARGIN = 2, FUN = mean)

## correlation
cor(fake_dat[, c("x", "y", "z")])
cor(new_sim)

The results from the simulation are pretty similar to the fake_dat dataset. If you recall from the correlation matrix and the vector of means, we didn’t use exact values from the observed data, so that, along with the random draws from the multivariate normal distribution, leads to the small amount of differences.

Wrapping Up

In this second installment of the Simulations in R series we’ve walked through how to code our own bootstrap resampling for both means and regression coefficients. We then progressed to building simulations of both bivariate and multivariate normal distributions. This, along with the basic info in Part 1, will serve us well as we progress forward in our work and begin to explore using simulations for comparisons between group means (simulated t-tests) and building regression models.

As always, all of the code is available in the Github repository.

The High Performance Hockey Podcast Interview

This week, I had the great pleasure of being interviewed by my good friend and colleague Anthony Donskov for his High Performance Hockey Podcast.

Anthony has done a tremendous job for the sports science and strength and conditioning community in his teaching, writing, and podcasting. He brings a wealth of knowledge from both the applied strength coach realm all the way through to his PhD work.

In this podcast interview, Anthony and I discuss:

Data analysis
The PPDAC Framework for conducting research
My criticisms of applied sport science
The challenge of measuring hard things and things that matter in applied sport.

Check out the podcast HERE.

R Tips & Tricks: Normalizing test dates & calculating test differences

A friend of mine was downloading some force plate data from the software provider so that he could evaluate test data in a few of his athletes during return to play. The issue he was running into was that the different athletes all had different numbers of tests and different start and end testing times. The software exports the test outputs by date and he was wondering how he could normalize the dates to numeric values (e.g. Test 1, Test 2, etc.) so that he could model the date (since we can’t really use a Date in a regression model).

I’ll be the first to admit that working with dates and times can be an incredible pain in the butt. For reference, I covered the topic of converting Catapult GPS practice duration strings to actual training minutes, HERE. To help him out, I provided a few different solutions depending on the research question. I also add some code for calculating changes in test performance between tests and from each test to baseline.

The full code is available on my GITHUB page.

Loading Packages & Simulating Data

## load packages ----------------------------------------------
library(tidyverse)
library(lubridate)

## Simulate data ----------------------------------------------
set.seed(78)
dat <- tibble(
  
  athlete = rep(c("Tom", "Bob", "Franklin"), times = c(5, 10, 3)),
  test_dates = c(
    seq(as.Date("2023-01-01"), as.Date("2023-01-5"), by = "days"),
    seq(as.Date("2023-02-15"), as.Date("2023-02-24"), by = "days"),
    as.Date(c("2023-01-19", "2023-01-30", "2023-02-26"))
  ),
  jump_height = round(rnorm(n = 18, mean = 28, sd = 2.5), 1)
  
)

dat

We can see that Tom has 5 tests, Bob has 10, and Franklin has only 3. Additionally, Tom and Bob tested every day, consecutively, while Franklin was less compliant and has larger time frames between his tests.

Create a test number

First, let’s normalize the Dates so that they are numeric. Basically, instead of dates we want a value indicating whether the test was test 1, or test 5, or test N. We will do this by creating a row_number() id/counter for each individual athlete.

### Create a test number ------------------------------------------
dat <- dat %>%
  group_by(athlete) %>%
  mutate(test_day = row_number())

dat

Calculating the time between tests

Alternatively, we may not just want to know the test number of each test but we may want to determine the amount of days between each test.

The code to do this is a bit ugly looking so let’s unpack it.

Since we are dealing with dates we use the difftime() function which takes an argument for the two times you are looking to calculate the difference between. Here, we are trying to calculate the difference in time (days) between one date and the date preceding it for each individual athlete.
The difftime() function will produce a to time variable. If we want to make this numeric we need to convert it to a character so we do that with the as.character() function.
Once the variable is a character we use the as.numeric() function to convert it to a numeric value.
Finally, since the first value for each athlete will be an NA, since there is no date preceding the first test, we use the coalesce() function to fill in a 0 value for each of the NA’s, to indicate that this was the first test and thus there was no time between it and any other test.

### Calculate the time between tests -------------------------------
dat <- dat %>%
  group_by(athlete) %>%
  mutate(time_btw_tests = coalesce(as.numeric(as.character(difftime(test_dates, lag(test_dates)))), 0))

dat

Notice that Tom and Bob have 1 day between all of their tests while Franklin’s second test was 11 days after his first and his third test was 27 days after his second.

Calculate the difference in jump height from one test to the next

### Calculate difference in jump height from one day to the next -------------------
dat <- dat %>%
  group_by(athlete) %>%
  mutate(test_to_test_diff = jump_height - lag(jump_height))

dat

Here, we use the lag() function to calculate the difference in one value from the value before it within in the same column. Since we grouped by athlete, which is what we want, their first test will always have an NA, in this new column, since there was no test preceding it.

Calculating the difference in jump height from the baseline test

Finally, we might also be interested to evaluate the performance on each test relative to the athlete’s baseline test. To do this we simply subtract jump_height from the jump_height indexed in row one for each athlete.

### Calculate difference in jump height from each test to the baseline test -------------

dat <- dat %>%
  group_by(athlete) %>%
  mutate(test_to_baseline_diff = jump_height - jump_height[1])

dat

Wrapping Up

Dates and times are always tricky to deal with. Most of the sports technology providers will proved data as dates (or unix timestamps) meaning that we have to do some cleaning of the data to codify the dates as numeric values representing the test number or the days between tests (depending on the research question). Additionally, using lag functions can be helpful for calculating he difference from one test to the next or from each test to the baseline.

The entire code is available on my GITHUB page.

If you have any data cleaning issues that you are dealing with from various sports science technologies, feel free to reach out!

Different ways of calculating intervals of uncertainty

I’ve talked a lot in this blog about making predictions (see HERE, HERE, and HERE) as well as the difference between confidence intervals and prediction intervals and why you’d use one over the other (see HERE). Tonight I was having a discussion with a colleague about some models he was working on and he was building some confidence intervals around his predictions. That got me to thinking about the various ways we can code confidence intervals, quantile intervals, and prediction intervals in R. So, I decided to put together this quick tutorial to provide a few different ways of constructing these values (after all, unless we can calculate the uncertainty in our predictions, point estimate predictions are largely useless on their own).

The full code is available on my GITHUB page.

Load packages, get data, and fit regression model

The only package we will need is {tidyverse}, the data will be the mtcars dataset and the model will be a linear regression which attempts to predict mpg from wr and carb.

## Load packages
library(tidyverse)

theme_set(theme_classic())

## Get data
d <- mtcars d %>%
  head()

## fit model
fit_lm <- lm(mpg ~ wt + carb, data = d)
summary(fit_lm)

Get some data to make predictions on

We will just grab a random sample of 5 rows from the original data set and use that to make some predictions on.

## Get a few rows to make predictions on
set.seed(1234)
d_sample <- d %>%
  sample_n(size = 5) %>%
  select(mpg, wt, carb)

d_sample

Confidence Intervals with the predict() function

Using preidct() we calculate the predicted value with 95% Confidence Intervals.

## 95% Confidence Intervals
d_sample %>%
  bind_cols(
    predict(fit_lm, newdata = d_sample, interval = "confidence", level = 0.95)
  )

Calculate confidence intervals by hand

Instead of using the R function, we can calculate the confidence intervals by hand (and obtain the same result).

## Calculate the 95% confidence interval by hand
level <- 0.95
alpha <- 1 - (1 - level) / 2
t_crit <- qt(p = alpha, df = fit_lm$df.residual) 

d_sample %>%
  mutate(pred = predict(fit_lm, newdata = .),
         se_pred = predict(fit_lm, newdata = ., se = TRUE)$se.fit,
         cl95 = t_crit * se_pred,
         lwr = pred - cl95,
         upr = pred + cl95)

Calculate confidence intervals with the qnorm() function

Above, we calculated a 95% t-critical value for the degrees of freedom of our model. Alternatively, we could calculate 95% confidence intervals using the standard z-critical value for 95%, 1.96, which we obtain with the qnorm() function.

d_sample %>%
  mutate(pred = predict(fit_lm, newdata = .),
         se_pred = predict(fit_lm, newdata = ., se = TRUE)$se.fit,
         lwr = pred + qnorm(p = 0.025, mean = 0, sd = 1) * se_pred,
         upr = pred + qnorm(p = 0.975, mean = 0, sd = 1) * se_pred)

Calculate quantile intervals via simulation

Finally, we can calculate quantile intervals by simulating predictions using the predicted value and standard error for each of the observations. We simulate 1000 times from a normal distribution and then use the quantile() function to get our quantile intervals.

If all we care about is a predicted value and the lower and upper intervals, we can use the rowwise() function to indicate that we are going to do a simulation for each row and then store the end result (our lower and upper quantile intervals) in a new column.

## 95% Quantile Intervals via Simulation
d_sample %>%
  mutate(pred = predict(fit_lm, newdata = .),
         se_pred = predict(fit_lm, newdata = ., se = TRUE)$se.fit) %>%
  rowwise() %>%
  mutate(lwr = quantile(rnorm(n = 1000, mean = pred, sd = se_pred), probs = 0.025),
         upr = quantile(rnorm(n = 1000, mean = pred, sd = se_pred), probs = 0.975))

While that is useful, there might be times where we want to extract the full simulated distribution. We can create a simulated distribution (1000 simulations) for each of the 5 observations using a for() loop.

## 95% quantile intervals via Simulation with full distribution
N <- 1000
pred_sim <- list()

set.seed(8945)
for(i in 1:nrow(d_sample)){
  
  pred <- predict(fit_lm, newdata = d_sample[i, ])
  se_pred <- predict(fit_lm, newdata = d_sample[i, ], se = TRUE)$se.fit
  
  pred_sim[[i]] <- rnorm(n = N, mean = pred, sd = se_pred)
  
}

sim_df <- tibble( sample_row = rep(1:5, each = N), pred_sim = unlist(pred_sim) ) 

sim_df %>%
  head()

Next we summarize the simulation for each observation.

# get predictions and quantile intervals
sim_df %>%
  group_by(sample_row) %>%
  summarize(pred = mean(pred_sim),
         lwr = quantile(pred_sim, probs = 0.025),
         upr = quantile(pred_sim, probs = 0.975)) %>%
  mutate(sample_row = rownames(d_sample))

We can then plot the entire posterior distribution for each observation.

# plot the predicted distributions
sim_df %>%
  mutate(actual_value = rep(d_sample$mpg, each = N),
         sample_row = case_when(sample_row == 1 ~ "Hornet 4 Drive",
                                sample_row == 2 ~ "Toyota Corolla",
                                sample_row == 3 ~ "Honda Civic",
                                sample_row == 4 ~ "Ferrari Dino",
                                sample_row == 5 ~ "Pontiac Firebird")) %>%
  ggplot(aes(x = pred_sim)) +
  geom_histogram(color = "white",
                 fill = "light grey") +
  geom_vline(aes(xintercept = actual_value),
             color = "red",
             size = 1.2,
             linetype = "dashed") +
  facet_wrap(~sample_row, scale = "free_x") +
  labs(x = "Predicted Simulation",
       y = "count",
       title = "Predicted Simulation with actual observation (red line)",
       subtitle = "Note that the x-axis are specific to that simulation and not the same")

Prediction Intervals with the predict() function

Next we turn attention to prediction intervals, which will be wider than the confidence intervals because they are incorporating additional uncertainty.

The predict() function makes calculating prediction intervals very convenient.

## 95% Prediction Intervals
d_sample %>%
  bind_cols(
    predict(fit_lm, newdata = d_sample, interval = "predict", level = 0.95)
  )

Prediction Intervals from a simulated distribution

Similar to how we simulated a distribution for calculating quantile intervals, above, we will perform the same procedure here. The difference is that we need to get the residual standard error (RSE) from our model as we need to add this additional piece of uncertainty (on top of the predicted standard error) to each of the simulated predictions.

## 95% prediction intervals from a simulated distribution 
# store the model residual standard error
sigma <- summary(fit_lm)$sigma

# run simulation
N <- 1000
pred_sim2 <- list()

set.seed(85)
for(i in 1:nrow(d_sample)){
  
  pred <- predict(fit_lm, newdata = d_sample[i, ])
  se_pred <- predict(fit_lm, newdata = d_sample[i, ], se = TRUE)$se.fit
  
  pred_sim2[[i]] <- rnorm(n = N, mean = pred, sd = se_pred) + rnorm(n = N, mean = 0, sd = sigma)
  
}

# put results in a data frame
sim_df2 <- tibble( sample_row = rep(1:5, each = N), pred_sim2 = unlist(pred_sim2) ) 

sim_df2 %>%
  head()

We summarize our predictions and their intervals.

# get predictions and intervals
sim_df2 %>%
  group_by(sample_row) %>%
  summarize(pred = mean(pred_sim2),
            lwr = quantile(pred_sim2, probs = 0.025),
            upr = quantile(pred_sim2, probs = 0.975)) %>%
  mutate(sample_row = rownames(d_sample))

Finally, we plot the simulated distributions for each of the observations.

Wrapping Up

Uncertainty is important to be aware of and convey whenever you share your predictions. The point estimate prediction is one a single value of many plausible values given the data generating process. This article provided a few different approaches for calculating uncertainty intervals. The full code is available on my GITHUB page.

Plotting Mixed Model Outputs

This weekend I posted two new blog articles about building reports that contained both data tables and plots on the same canvas (see HERE and HERE). As a follow up, James Baker asked if I could do some plotting of mixed model outputs. That got me thinking, I’ve done a few blog tutorials on mixed models (see HERE and HERE) and this got me thinking. Because he left it pretty wide open (“Do you have any guides on visualizing mixed models?”) I was trying to think about what aspects of the mixed models he’d like to visualize. R makes it relatively easy to plot random effects using the {lattice} package, but I figured we could go a little deeper and customize some of our own plots of the random effects as well as show how we might plot future predictions from a mixed model.

All of the code for this article is available on my GITHUB page.

Loading Packages & Data

As always we begin by loading some of the packages we require and the data. In this case, we will use the sleepstudy dataset, which is freely available from the {lme4} package.

## Load packages
library(tidyverse)
library(lme4)
library(lattice)
library(patchwork)

theme_set(theme_bw())

## load data
dat <- sleepstudy dat %>%
  head()

Fit a mixed model

We will fit a mixed model that sets the dependent variable as Reaction time and the fixed effect as days of sleep deprivation. We will also allow both the intercept and slope to vary randomly by nesting the individual SubjectID within each Day of sleep deprivation.

## Fit mixed model
fit_lmer <- lmer(Reaction ~ Days + (1 + Days|Subject), data = dat)
summary(fit_lmer)

Inspect the random effects

We can see in the model output above that we have a random effect standard deviation for the Intercept (24.84) and for the slope, Days (5.92). We can extract out the random effect intercept and slope for each subject with the code below. This tells us how much each subject’s slope and intercept vary from the population fixed effects (251.4 and 10.5 for the intercept and slope, respectively).

# look at the random effects
random_effects <- ranef(fit_lmer) %>%
  pluck(1) %>%
  rownames_to_column() %>%
  rename(Subject = rowname, Intercept = "(Intercept)") 

random_effects %>%
  knitr::kable()

Plotting the random effects

Aside from looking at a table of numbers, which can sometimes be difficult to draw conclusions from (especially if there are a large number of subjects) we can plot the data and make some observational inference.

The {lattice} package allows us to create waterfall plots of the random effects for each subject with the dotplot() function.

## plot random effects
dotplot(ranef(fit_lmer))

That’s a pretty nice plot and easy to obtain with just a single line of code. But, we might want to create our own plot using {ggplot2} so that we have more control over the styling.

I’ll store the standard deviation of the random slope and intercept, from the model read out above, in their own element. Then, I’ll use the random effects table we made above, which contains the intercept and slope of each subject, to plot them and add the standard deviation to them as error bars.

## Make one in ggplot2
subject_intercept_sd <- 24.7
subject_days_sd <- 5.92

int_plt <- random_effects %>%
mutate(Subject = as.factor(Subject)) %>%
ggplot(aes(x = Intercept, y = reorder(Subject, Intercept))) +
geom_errorbar(aes(xmin = Intercept - subject_intercept_sd,
xmax = Intercept + subject_intercept_sd),
width = 0,
size = 1) +
geom_point(size = 3,
shape = 21,
color = "black",
fill = "white") +
geom_vline(xintercept = 0,
color = "red",
size = 1,
linetype = "dashed") +
scale_x_continuous(breaks = seq(-60, 60, 20)) +
labs(x = "Intercept",
y = "Subject ID",
title = "Random Intercepts")

slope_plt <- random_effects %>%
mutate(Subject = as.factor(Subject)) %>%
ggplot(aes(x = Days, y = reorder(Subject, Days))) +
geom_errorbar(aes(xmin = Days - subject_days_sd,
xmax = Days + subject_days_sd),
width = 0,
size = 1) +
geom_point(size = 3,
shape = 21,
color = "black",
fill = "white") +
geom_vline(xintercept = 0,
color = "red",
size = 1,
linetype = "dashed") +
xlim(-60, 60) +
labs(x = "Slope",
y = "Subject ID",
title = "Random Slopes")

slope_plt / int_plt

We get the same plot but now we have more control. We can color the dot specific subjects, or only choose to display specific subjects, or flip the x- and y-axes, etc.

Plotting the model residuals

We can also plot the model residuals. Using the residual() function we can get the residuals directly from our mixed model and the plot() function with automatically plot the Residual and Fitted values. These types of plots are useful for exploring assumptions such as normality of the residuals and homoscedasticity.

## Plot Residual
plot(fit_lmer)
hist(resid(fit_lmer))

As above, perhaps we want to have more control over the bottom plot, so that we can style it however we’d like. We can extract the fitted values and residuals and build our own plot using base R.

## Plotting our own residual ~ fitted
lmer_fitted <- predict(fit_lmer, newdata = dat, re.form = ~(1 + Days|Subject))
lmer_resid <- dat$Reaction - lmer_fitted

plot(x = lmer_fitted,
     y = lmer_resid,
     pch = 19,
     main = "Resid ~ Fitted",
     xlab = "Fitted",
     ylab = "Residuals")
abline(h = 0,
       col = "red",
       lwd = 3,
       lty = 2)

Plotting Predictions

The final plot I’ll build are the predictions of Reaction time as Days of sleep deprivation increase. This is time series data, so I’m going to extract the first 6 days of sleep deprivation for each subject and build the model using that data. Then, make predictions on the next 4 days of sleep deprivation for each subject and get both a predicted point estimate and 90% prediction interval. In this way, we can observe the next 4 days of sleep deprivation for each subject and see how far outside of what we would expect (from our mixed model predictions) those values fall.

### Plotting the time series on new data
# training set
dat_train <- dat %>%
  group_by(Subject) %>%
  slice(head(row_number(), 6)) %>%
  ungroup()

# testing set
dat_test <- dat %>%
  group_by(Subject) %>%
  slice(tail(row_number(), 4)) %>%
  ungroup()

## Fit mixed model
fit_lmer2 <- lmer(Reaction ~ Days + (1 + Days|Subject), data = dat_train)
summary(fit_lmer2)

# Predict on training set
train_preds  <- merTools::predictInterval(fit_lmer2, newdata = dat_train, n.sims = 100, returnSims = TRUE, seed = 657, level = 0.9) %>%
  as.data.frame()

dat_train <- dat_train %>% bind_cols(train_preds)

dat_train$group <- "train"

# Predict on test set with 90% prediction intervals
test_preds  <- merTools::predictInterval(fit_lmer2, newdata = dat_test, n.sims = 100, returnSims = TRUE, seed = 657, level = 0.9) %>%
  as.data.frame()

dat_test <- dat_test %>% bind_cols(test_preds)

dat_test$group <- "test"

## Combine the data together
combined_dat <- bind_rows(dat_train, dat_test) %>%
  arrange(Subject, Days)

## Plot the time series of predictions and observed data
combined_dat %>%
mutate(group = factor(group, levels = c("train", "test"))) %>%
ggplot(aes(x = Days, y = Reaction)) +
  geom_ribbon(aes(ymin = lwr,
                  ymax = upr),
              fill = "light grey",
              alpha = 0.8) +
  geom_line(aes(y = fit),
            col = "red",
            size = 1) +
  geom_point(aes(fill = group),
             size = 3,
             shape = 21) +
  geom_line() +
  facet_wrap(~Subject) +
  theme(strip.background = element_rect(fill = "black"),
        strip.text = element_text(face = "bold", color = "white"),
        legend.position = "top") +
  labs(x = "Days",
       y = "Reaction Time",
       title = "Reaction Time based on Days of Sleep Deprivation")

Wrapping Up

Above are a few different plot options we have with mixed model outputs. I’m not sure what James was after or what he had in mind because he left the question very wide open. Hopefully this article provides some useful ideas for your own mixed model plotting. If there are other things you are hoping to see or have other ideas of things to plot from the mixed model output, feel free to reach out!

The complete code for this article is available on my GITHUB page.