Category Archives: Sports Analytics

Tidymodels Workflowsets Tutorials

Workflowsets in {tidymodels} provide a useful way for analysts to run several models, even tune several machine learning models on their dataset, simultaneously. You can then extract model fit information, predictions, and make comparisons to identify the most effective model type and the optimized tuning parameters.

Below are links to two workflowsets tutorials I’ve written. The first link is to an older tutorial that wrote and posted in the blog 3 months ago. This tutorial builds several machine learning models on NWSL data to solve a classification task. The second link is a more recent tutorial that walks through workflowsets and model tuning for outcomes that are both binary and continuous.

Both tutorials go over model building, hyperparameter tuning, model comparisons, making predictions, and storing the final model for deployment at a later time in {tidymodels} using workflowsets.

Hopefully both tutorials provide analysts with a clear overview of how to set up workflowsets and run several models in parallel, helping to make their work more efficient.

Tutorial 1

Tutorial 2

 

Simulations in R Part 4: Simulating Regression Models

We’ve been working on building simulations in R over the past few articles. We are now to the point where we want to simulate data using regression models. To recap what we’ve done so far:

  • Part 1 discussed the basic functions for simulating and sampling data in R.
  • Part 2 walked us through how to perform bootstrap resampling and then simulate bivariate and multivariate distributions.
  • Part 3 we worked making group comparisons by simulating thousands of t-tests

In an upcoming article we will use simulation to evaluate regression assumptions. But, before we do that, it might be useful to use a regression model reported in a research paper to simulate a data set. This can be useful if you are trying to better understand the paper (you can recalculate statistics and explore the data) or if you want to work on applying a different type of model approach to build hypotheses for future research.

As always, all code is freely available in the Github repository.

I’m going to take the regression model from a paper by Ferrari et al (2022), Performance and anthropometrics of classic powerlifters: Which characteristics matter? J Strength Cond Res., which looked to predict power lifting performance in male and female lifters.

The paper used several variables to try and predict Squat, Bench, Deadlift, and Powerlifting Total in raw power lifters. To keep this simple, I’ll focus on the model for predicting squat performance, which was:


The model had a standard error of estimate (SEE) of 20.3 and an r-squared of 0.83.

Getting the necessary info from the paper

Aside from the model coefficients, above, we also need to get the mean and standard deviation for each of the variables in the model (for both male and female lifters). We will use those parameters to simulate 1000 male and female lifters.

library(tidyverse)

# set the see for reproducibility
set.seed(82)

# sample size - simulate 1000 male and 1000 female lifters
n_males <- 1000
n_females <- 1000

# set the model coefficients for squats
intercept <- -145.7
years_exp_beta <- 4.3
pct_bf_beta <- -1.7
upper_arm_girth_beta <- 6
thigh_girth_beta <- 1.9

# Standard Error of Estimate (SEE) is sampled from a normal distribution with a mean = 0 and 
# standard deviation of 20.3
see <- rnorm(n = n_males + n_females, mean = 0, sd = 20.3)

# get mean and sd for each of the model variables
years_exp_male <- rnorm(n = n_males, mean = 2.6, sd = 2.4)
years_exp_male <- ifelse(years_exp_male < 0, 0, years_exp_male)
years_exp_female <- rnorm(n = n_females, mean = 1.6, sd = 1.5)
years_exp_female <- ifelse(years_exp_female < 0, 0, years_exp_female)
pct_bf_male <- rnorm(n = n_males, mean = 11.1, sd = 3.8)
pct_bf_female <- rnorm(n = n_females, mean = 21.7, sd = 5.4)
upper_arm_girth_male <- rnorm(n = n_males, mean = 35.6, sd = 2.8)
upper_arm_girth_female <- rnorm(n = n_females, mean = 29.5, sd = 3.1)
thigh_girth_male <- rnorm(n = n_males, mean = 61.1, sd = 5.5)
thigh_girth_female <- rnorm(n = n_females, mean = 56.1, sd = 4.9)

# put the simulated data into a data frame
dat <- data.frame( gender = c(rep("male", times = n_males), rep("female", times = n_females)), years_exp = c(years_exp_male, years_exp_female), pct_bf = c(pct_bf_male, pct_bf_female), upper_arm_girth = c(upper_arm_girth_male, upper_arm_girth_female), thigh_girth = c(thigh_girth_male, thigh_girth_female) ) dat %>%
  head()

As a sanity check, we can quickly check the mean and standard deviation of each variable to ensure the simulation appears as we intended it to.

## check means and standard deviations of the simulation
dat %>%
  group_by(gender) %>%
  summarize(across(.cols = years_exp:thigh_girth,
                   list(avg = mean, SD = sd)),
            .groups = "drop") %>%
  pivot_longer(cols = -gender)

Estimate squat performance using the model

Next, we use the values that we simulated and the model coefficients to simulate the outcome of interest (Squat performance).

# estimate squat performance
dat$squat <- with(dat, intercept + years_exp_beta*years_exp + pct_bf_beta*pct_bf + upper_arm_girth_beta*upper_arm_girth + thigh_girth_beta*thigh_girth + see) dat %>%
  head()

## summary statistics
dat %>%
  group_by(gender) %>%
  summarize(avg_squat = mean(squat), 
            sd_squat = sd(squat))

# plots
hist(dat$squat)

dat %>%
  ggplot(aes(x = squat, fill = gender)) +
  geom_density(alpha = 0.5)

Look at the regression model

Now we fit a regression model with our simulated data.

fit_lm <- lm(squat ~ years_exp + pct_bf + upper_arm_girth + thigh_girth, data = dat)
summary(fit_lm)

  • The coefficients are similar to those in the paper (which makes sense because we coded them this way.
  • The standard error (Residual standard error) is 20.3, as expected.
  • The r-squared for the model is 0.81, close to what was observed in the paper.

 

Making a better simulation

One thing to notice is the mean and standard deviation of the squat in our simulation is a bit less compared to what is reported for the population in the paper.

Our estimates are a little low relative to what was reported in the paper. The model above still works because we constructed it with the regression coefficients and variable parameters in the model, so we can still play with the data and learn something. But, the estimated squats might be a little low because the anthropometric variables in the model (BF%, Upper Arm Girth, and Thigh Girth) are in some way going to be correlated with each other. So, we could make this simulation more informative by simulating those variables from a multivariate normal distribution, as we did in Part 3.

To start, we load the mvtnorm package and set up a vector of mean values for each variable. We will construct a vector for males and females separately. We will use the mean values for each variable reported in the paper.

library(mvtnorm)

## Order of means is: BF%, Upper Arm Girth, Thigh Girth
male_means <- c(11.1, 35.6, 61.1)
female_means <- c(21.7, 29.5, 56.1)

Next, we need to construct a correlation matrix between these three variables. Again, we will create a correlation matrix for males and females, separately. I’m not sure of the exact correlation between these variables, so I’ll just estimate what I believe it to be. For example, the correlations probably aren’t 0.99 but they also probably aren’t below 0.6. I’m also unsure how these correlations might differ between the two genders. Thus, I’ll keep the same correlation matrix for both. To keep it simple, I’ll set the correlation between BF% and upper arm and thigh girth at 0.85 and the correlation between upper arm girth and thigh girth to be 0.9. In theory, we could consult some scientific literature on these things and attempt to construct more plausible correlations.

## Create correlation matrices
# males
male_r_matrix <- matrix(c(1, 0.85, 0.85,
                          0.85, 1, 0.9,
                          0.85, 0.9, 1), 
                   nrow = 3, ncol = 3,
       dimnames = list(c("bf_pct", "upper_arm_girth", "thigh_girth"),
                       c("bf_pct", "upper_arm_girth", "thigh_girth")))

male_r_matrix

# females
female_r_matrix <- matrix(c(1, 0.85, 0.85,
                          0.85, 1, 0.9,
                          0.85, 0.9, 1), 
                   nrow = 3, ncol = 3,
       dimnames = list(c("bf_pct", "upper_arm_girth", "thigh_girth"),
                       c("bf_pct", "upper_arm_girth", "thigh_girth")))

female_r_matrix

Now we will create 1000 simulations from a multivariate normal distribution for both males and females and then row bind them together into a single big data frame.

## simulate 1000 new x, y, and z variables using the mvtnorm package
set.seed(777)
male_sim <- rmvnorm(n = n_males, mean = male_means, sigma = male_r_matrix) %>%
  as.data.frame() %>%
  setNames(c("pct_bf", "upper_arm_girth", "thigh_girth"))

female_sim <- rmvnorm(n = n_females, mean = female_means, sigma = female_r_matrix) %>%
  as.data.frame() %>%
  setNames(c("pct_bf", "upper_arm_girth", "thigh_girth"))

head(male_sim)
head(female_sim)

## put the two simulated data frames together
multi_sims <- bind_rows(male_sim, female_sim) multi_sims %>%
  head()

Finally, one last thing we’ll change is our simulation of years of experience. This variable is not a normally distributed variable because it is truncated at 0. Above, in our first simulation, we attempted to solve this with the ifelse() expression to assign any simulated value less than 0 to 0. Here, I’ll just get the quantiles of the simulated years of experience so that I have an idea of a plausible upper end of experience for the population used in this paper. Then, instead of simulating from a normal distribution I’ll do a random draw from a uniform distribution from 0 to the respective upper end for the male and female groups.

Now take the newly simulated variables and create a new data frame.

## new data frame
new_dat <- data.frame( gender = c(rep("male", times = n_males), rep("female", times = n_females)), years_exp = c(years_exp_male, years_exp_female) ) %>%
  bind_cols(multi_sims)

new_dat %>%
  head()

Finally, go through the steps we did above to estimate the squat from our four variables and the beta coefficients from the paper.

# estimate squat performance
new_dat$squat <- with(new_dat, intercept + years_exp_beta*years_exp + pct_bf_beta*pct_bf + upper_arm_girth_beta*upper_arm_girth + thigh_girth_beta*thigh_girth + see) new_dat %>%
  head()

## summary statistics
new_dat %>%
  group_by(gender) %>%
  summarize(avg_squat = mean(squat), 
            sd_squat = sd(squat))

# plots
hist(new_dat$squat)

new_dat %>%
  ggplot(aes(x = squat, fill = gender)) +
  geom_density(alpha = 0.5)

 

Look at the new regression model

fit_lm_new <- lm(squat ~ years_exp + pct_bf + upper_arm_girth + thigh_girth, data = new_dat)
summary(fit_lm_new)

Compare the squat simulated data to that which was reported in the paper

new_dat %>%
  group_by(gender) %>%
  summarize(avg_squat_sim = mean(squat), 
            sd_squat_sim = sd(squat)) %>%
  mutate(avg_squat_paper = c(118.3, 196.1),
         sd_squat_paper = c(26.6, 37.9))

Now we have mean values much closer to those observed in the population of the study. Our standard deviations differ from the study because, if you recall, the standard error of the estimate was 20.3 for the regression model. The model in the study did not include gender as an independent variable. This to me is a little strange and in the discussion of the paper the authors’ also indicate that it was strange to them too. However, the approach the authors’ took to building this deemed gender unnecessary as a predictor variable. Consequently, our simulation has a similar standard deviation in estimate squat for both the male and female populations. However, we now have a data set that is relatively close to what was observed in the paper and can therefore proceed with conducting other statistical tests or building other models.

Wrapping Up

We’ve come a long way in our simulation journey, from randomly drawing single vectors of data to now building up entire data sets using regression models and simulating data from research studies. Next time we will use simulation to explore regression assumptions.

As always, all code is freely available in the Github repository.

Simulations in R Part 3: Group comparisons via simulation (simulating t-tests)

To review where we are at so far:

  • Part 1 discussed the basic functions for simulating and sampling data in R.
  • Part 2 walked us through how to perform bootstrap resampling and then simulate bivariate and multivariate distributions.

In Part 3, we are ready to put this info to use and start to simulate data and construct models. To start things off, this tutorial will focus on simulating data for comparison of group means — a t-test.

As always, all code is freely available in the Github repository.

Simulating Two Groups for Comparison

We begin by simulating two groups using the rnorm() function to make a random draw from a normal distribution. Group 1 will have a mean of 50 and a standard deviation of 10 while Group 2 will have a mean of 60 with a standard deviation of 6.25. Both groups have a sample size of 10 (so this is a small study!).

In addition to simulating the data, we will store summary statistics for each group so that we can use them later.

set.seed(1759)
grp1 <- rnorm(n = 10,
              mean = 50,
              sd = 10)

grp2 <- rnorm(n = 10,
              mean = 60,
              sd = 6.25)

## get the summary statistics for both groups
# sample size
n_grp1 <- length(grp1)
n_grp2 <- length(grp2)

# means
mu_grp1 <- mean(grp1)
mu_grp2 <- mean(grp2)

# variances
var_grp1 <- var(grp1)
var_grp2 <- var(grp2)

# standard deviation
sd_grp1 <- sd(grp1)
sd_grp2 <- sd(grp2)

Next, we calculate the t-statistic, which is the difference between the two groups means divided by the pooled standard deviation of the two groups.

## pooled SD
sd_pool <- sqrt(((n_grp1 - 1) * var_grp1 + (n_grp2 - 1) * var_grp2) / (n_grp1 + n_grp2 - 2))
sd_pool

## Compute t-statistic
t_stat <- (mu_grp1 - mu_grp2) / (sd_pool * sqrt(1/n_grp1 + 1/n_grp2))
t_stat

We can use this t-statistic to determine if the difference is significant or not at a desired alpha level, say p < 0.05, by using a t-distribution (the qt() function, which we were introduced to in Part 1 of this series). We then check our work against R’s build in t.test() function.

alpha <- 0.05
df <- n_grp1 + n_grp2 - 2 if(abs(t_stat) > qt(1 - alpha / 2, df, lower.tail = TRUE)){
  "significant difference"} else {"not a significant difference"} 

# Get p-value
p_value <- 2*pt(abs(t_stat), df, lower=FALSE)
p_value

# check work
t.test(grp1, grp2)

The group difference is barely significant at the p < 0.05 level. We calculated the t-statistic by hand and got the exact same value as that which was produced by the t.test() function.

To make this approach more streamlined, let’s create our own t-test function so that all we need to do in the future is pass it a vector of values representing each group’s data and get returned the t-statistic.

 

## t-test function
t_stat_func <- function(x, y){
  n_grp1 <- length(x)
  n_grp2 <- length(y)
  sd_pool <- sqrt(((n_grp1 - 1) * sd(x)^2 + (n_grp2 - 1) * sd(y)^2) / (n_grp1 + n_grp2 - 2))
  t_stat <- (mean(x) - mean(y)) / (sd_pool * sqrt(1/n_grp1 + 1/n_grp2))
  return(t_stat)
}

## try out the function
t_stat_func(x = grp1, y = grp2)

We only have 10 observations in each of our groups. With such a small sample size, it may be challenging to really know if the difference we observed is real or just some sort of chance/luck occurrence (we will explore sample size issues via simulation in a later blog post). What we can do is use the function we just created and build simulated distributions using the data generating process (mean and SD) of each group to run a Monte Carlo simulations and explore how often we might reject or accept the null hypothesis at an alpha level of p < 0.05.

# Let's set alpha to 0.05
alpha <- 0.05

# We will run 10,000 simulations
N_sims <- 1e4

# create an empty vector for storing t-statistics and p-valies
t_stat <- rep(NA, N_sims)
p_values <- rep(NA, N_sims)

for(i in 1:N_sims){
  
  # simulate population 1
  grp1_sim <- rnorm(n = n_grp1, mean = mu_grp1, sd = sd_grp1)
  
  # simulate group 2
  grp2_sim <- rnorm(n = n_grp2, mean = mu_grp2, sd = sd_grp2)
  
  # compute the t-statistic with our function
  t_stat[i] <- t_stat_func(grp1_sim, grp2_sim)

  # get degrees of freedom for calculating the p-value
  df <- n_grp1 + n_grp2 - 2
  
  # obtain the p-value
  p_value[i] <- 2*pt(abs(t_stat[i]), df, lower=FALSE)
}

par(mfrow = c(1, 2))
hist(abs(t_stat),
     main = "T-Statistic Distribution")
abline(v = mean(abs(t_stat)),
       col = "red",
       lwd = 3,
       lty = 2)
hist(p_value,
     main = "p-value Distribution")
abline(v = 0.05,
       col = "red",
       lwd = 3,
       lty = 2)

# What percentage of times did we reject the null?
mean(p_value < 0.05)

You will get slightly different results, since I didn’t set a seed, but you will find that we end up rejecting the null about 52-53% of the time, meaning we probably wouldn’t want to be too confident about our “statistically significant” finding.

We could have saved a lot of lines of code and instead used the replicate() function and run 10,000 simulations of the t-test between the two groups (additionally, replicate() will run faster than the for() loop).

 

t_test <- function(){
  grp1 <- rnorm(n = n_grp1, mean = mu_grp1, sd = sd_grp1)
  grp2 <- rnorm(n = n_grp2, mean = mu_grp2, sd = sd_grp2)
  t_stat_func(grp1, grp2)
}


# Instead of a for loop, we will use the replicate() function to run this 10,000 times
t_stat_vector <- replicate(n = 10000,
                           t_test())

head(t_stat_vector)
hist(abs(t_stat_vector),
     main = "T-Test Simulation using replicate()",
     xlab = "T-Statistic")

Simulating Data from Studies

Why just create fake data and play around when we can use a similar approach to simulating data to help us further explore data contained in studies we read?! Most studies do not provide full data sets but do provide necessary summary statistics (e.g., sample size, mean, standard deviation, standard errors, confidence intervals, etc.), allowing us to use the data generating processes to simulate the study.

I was reading a 2010 paper from Saenz et al., Knee isokinetic test-retest: A multicenter knee isokinetic test-retest study of a fatigue protocol (Eur J Phys Rehabil Med), where they authors were conducing a series of test-retest protocols using a biodex. The tables in the paper lay out the information required to simulate the study. For example, in Table 1, the authors’ provide the mean, median, and standard deviation for all of the extension tests that were performed. I’ll use the data provided for the test, WRepMax.

## Saenz (2010) - Knee isokinetic test-retest - simulation
## Table 1: Extension
# We will simulate the test scores provided for WrepMax

# number of subjects
N <- 90

## Get the mean and SD for the test and retest from Table 1 for WrepMax
test1_mu <- 99.3
test1_sd <- 18.34
test2_mu <- 104.44
test2_sd <- 24.90

Next, we calculate the mean difference between test 1 and test 2 and grab the standard deviation for the test, provided in Table 3 of the paper.

## Get the difference and standard deviation of the difference between test 1 and 2 from Table 3 (row 1)
diff <- test1_mu - test2_mu
sd_diff <- 17.37

First, we simulate Test 1 using the above parameters (sample size, mean, and standard deviation).

## Simulate the first test using the summary statistics from Table 1
set.seed(925)
w_rep_max_1 <- rnorm(n = N, mean = test1_mu, sd = test1_sd)

Next, we simulate Test 2. However, we have to remember consider that, because we are trying to simulate the performance of the 90 participant in the study, Test 2 has a relationship to the performance of Test 1 because the authors’ are analyzing the difference in performance between the two tests. Therefore, to create Test 2 we will use our simulation of Test 1 for each participant and include some random error, which is specific to the mean and standard deviation of the difference between Test 1 and Test 2 reported in the study.

## Simulate test 2 by taking test 1 and applying the observed difference between tests (Table 3) 
set.seed(479)
w_rep_max_2 <- w_rep_max_1 + rnorm(n = N, mean = abs(diff), sd = sd_diff)

Now that we have Test 1 and Test 2 simulated for each participant, we can calculate the summary statistics and see how well they compare to the observed data reported in the study.

## Get the mean and SD of the simulated test1 and test 2
test1_mu_sim <- mean(w_rep_max_1) 
test1_sd_sim <- sd(w_rep_max_1)

test2_mu_sim <- mean(w_rep_max_2)
test2_sd_sim <- sd(w_rep_max_2)

test1_mu_sim
test1_sd_sim

test2_mu_sim
test2_sd_sim

## Get the mean and SD of the difference between simulations
diff_mu_sim <- mean(w_rep_max_1 - w_rep_max_2)
diff_sd_sim <- sd(w_rep_max_1 - w_rep_max_2)

diff_mu_sim
diff_sd_sim

The results are nearly identical to those presented in the study (see the link, as the study is free). Now that we have a simulated data set of all of the participants we can do other things with the data, such as plot it, conduct additional reliability metrics that weren’t performed in the study, recreate the analysis in the study to get a better understanding of the analysis that was performed, or perhaps try and model the data in different ways to explore its properties and generate new ideas for future research.

For example, let’s run a paired t-test on the simulated data, as the author’s did and look at the results in comparison to simulating this study 1000 times. We re-write our t_test() function from above to re-simulate the data in the study and then conduct a paired t-test, storing the t-statistic so that we can investigate how often we would reject the null hypothesis.

## Run a paired t-test
t.test(w_rep_max_1, w_rep_max_2, paired = TRUE)

## create a function for the paired t-test and extract the t-statistic
t_test_retest <- function(){
  w_rep_max_1 <- rnorm(n = N, mean = test1_mu, sd = test1_sd)
  w_rep_max_2 <- w_rep_max_1 + rnorm(n = N, mean = abs(diff), sd = sd_diff)
  t.test(w_rep_max_1, w_rep_max_2, paired = TRUE)$statistic
}

# Instead of a for loop, we will use the replicate() function to run this 10,000 times
t_stat_test_retest <- replicate(n = 10000,
                           t_test_retest())

hist(t_stat_test_retest)
mean(t_stat_test_retest)

# turn the t-statistics into p-values
p <- 2*pt(abs(t_stat_test_retest), df = 90 - 2, lower=FALSE)

# histogram of p-values
hist(p,
     main = "p-values of all simulated data\nreject the null ~79% of the time")

# what percentage of times did we reject the null at the p < 0.05 level?
mean(p < 0.05)

We reject the result approximately 79% of the time at the alpha level of p < 0.05.

Wrapping Up

This tutorial worked through using simulation to understand the difference in group means, as we would commonly do with a t-test. Next, we progress on to simulation linear regression models.

As always, all code is freely available in the Github repository.

Simulations in R Part 2: Bootstrapping & Simulating Bivariate and Multivariate Distributions

In Part 1 of this series we covered some of the basic functions that we will need to perform resampling and simulations in R.

In Part 2 we will now move to building bootstrap resamples and simulating distributions for both bivariate and multivariate relationships. Some stuff we will cover:

  • Coding bootstrap resampling by hand for both a single variable (mean) and regression coefficients.
  • Using the boot() function from the boot package to perform bootstrapping for both a single variable and regression coefficients.
  • Creating a simulation where two variables are in some way dependent on each other (correlate).
  • Creating a simulation where multiple variables are correlated with each other.
  • Finally, to understand what is going on with these simulated distributions we will also work through code that shows us the relationship between variables using both covariance and correlation matrices.

As always, all code is freely available in the Github repository.

Resampling

The resampling approach we will use here is bootstrapping. The general concept of bootstrapping is as follows:

  • Draw multiple random samples from observed data with replacement.
  • Draws must be independent and each observation must have an equal chance of being selected.
  • The bootstrap sample should be the same size as the observed data in order to use as much information from the sample as possible.
  • Calculate the mean resampled data and store it.
  • Repeat this process thousands of times and summarize the mean of resampled means and the standard deviation of resampled means to obtain summary statistics of your bootstrapped resamples.

Write the bootstrap resampling by hand

The code below goes through the process of creating some fake data and then writing a for() loop that produces 1000 bootstrap resamples. The for() loop was introduced in Part 1 of this series. In a nutshell, we are taking a random sample from the fake data (with replacement), calculating the mean of that random sample, and storing it in the element boot_dat. From there, we calculate the summary statistics or the original sample and the bootstrap resample, which we store in a data frame for comparison purposes, and then produce a histogram of the original sample and the bootstrap resample, which are visualized below the code.

library(tidyverse)

## create fake data
dat <- c(5, 10, 44, 3, 29, 10, 16.7, 22.3, 28, 1.4, 25)

### Bootstrap Resamples ###
# we want 1000 bootstrap resamples
n_boots <- 1000

## create an empty vector to store our bootstraps
boot_dat <- rep(NA, n_boots)

# set seed for reproducibility
set.seed(786)

# write for() loop for the resampling
for(i in 1:n_boots){
  # random sample of 1:n number of observations in our data, with replacement
  ind <- sample(1:length(dat), replace = TRUE)
  
  # Use the row indexes to select the given values from the vector and calculate the mean
  boot_dat[i] <- mean(dat[ind])
}

# Look at the first 6 bootstrapped means
head(boot_dat)

### Compare Bootstrap data to original data ###
## mean and standard deviation of the fake data
dat_mean <- mean(dat)
dat_sd <- sd(dat)

# standard error of the mean
dat_se <- sd(dat) / sqrt(length(dat))

# 95% confidence interval
dat_ci95 <- paste0(round(dat_mean - 1.96*dat_se, 1), ", ", round(dat_mean + 1.96*dat_se, 1))

# mean an SD of bootstrapped data
boot_mean <- mean(boot_dat)

# the vector is the mean of each bootstrap sample, so the standard deviation of these means represents the standard error
boot_se <- sd(boot_dat)

# to get the standard deviation we can convert the standard error back
boot_sd <- boot_se * sqrt(length(dat))

# 95% quantile interval
boot_ci95 <- paste0(round(boot_mean - 1.96*boot_se, 1), ", ", round(boot_mean + 1.96*boot_se, 1)) ## Put everything together data.frame(data = c("fake sample", "bootstrapped resamples"), N = c(length(dat), length(boot_dat)), mean = c(dat_mean, boot_mean), sd = c(dat_sd, boot_sd), se = c(dat_se, boot_se), ci95 = c(dat_ci95, boot_ci95)) %&gt;%
  knitr::kable()

# plot the distributions
par(mfrow = c(1, 2))
hist(dat,
     xlab = "Obsevations",
     main = "Fake Data")
abline(v = dat_mean,
       col = "red",
       lwd = 3,
       lty = 2)
hist(boot_dat,
     xlab = "bootstrapped means",
     main = "1000 bootstrap resamples")
abline(v = boot_mean,
       col = "red",
       lwd = 3,
       lty = 2)

R offers a bootstrap function from the boot package that allows you to do the same thing without writing out your own for() loop. Below is an example of coding the same procedure in the boot package and the outputs the function provides, which are similar to the output we get above, save for slight differences due to random sampling.

# write a function to calculate the mean of our sample data
sample_mean <- function(x, d){
     return(mean(x[d]))
}

# run the boot() function
library(boot)

# run the boot function
boot_func_output <- boot(dat, statistic = sample_mean, R = 1000)

# produce a plot of the output
plot(boot_func_output)

# get the mean and standard error
boot_func_output

# get 95% CI around the mean
boot.ci(boot_func_output, type = "basic", conf = 0.95)

We can bootstrap pretty much anything we want. We don’t have to limit ourselves to producing the distribution around the mean of a population. For example, let’s bootstrap regression coefficients to understand the uncertainty in them.

First, let’s use the boot() function to conduct our analysis. We will fit a simple linear regression predicting miles per gallon from engine weight using the mtcars package. We will then write a function for that uses a random sample of rows to create the same linear regression and store those coefficients from 1000 linear regressions so that we can plot a histogram representing the slope coefficient from the resampled models and also summarize the distribution with confidence intervals.

# load the mtcars data
d <- mtcars d %>%
  head()

# fit a regression model
fit_mpg <- lm(mpg ~ wt, data = d)
summary(fit_mpg)
coef(fit_mpg)
confint(fit_mpg)

# Write a function that can perform a bootstrap over the intercept and slope of the model
# bootstrap function
reg_coef_boot <- function(data, row_id){
  # we want to resample the rows
  fit <- lm(mpg ~ wt, data = d[row_id, ])
  coef(fit)
}

# run this once on a small subset of the row ids to see how it works
reg_coef_boot(data = d, row_id = 1:20)

# run the boot() function 1000 times
coef_boot <- boot(data = d, reg_coef_boot, 1000) 

# check the output (coefficient and SE) 
coef_boot 

# get the confidence intervals 
boot.ci(coef_boot, index= 2) 

# all 1000 of the bootstrap resamples can be called coef_boot$t %&gt;%
  head()

# plot the first 20 bootstrapped intercepts and slopes over the original data
plot(x = d$wt,
     y = d$mpg,
     pch = 19)
for(i in 1:20){
  abline(a = coef_boot$t[i, 1],
       b = coef_boot$t[i, 2],
       lty = 2,
       lwd = 3,
       col = "light grey")
}

## histogram of the slope coefficient
hist(coef_boot$t[, 2])

We can do this by hand if we don’t want to use the built in boot() function. (SIDE NOTE: I usually prefer to code my own resampling and simulations as it gives me more flexibility with respect to the things I’d like to add or the values I’d like to store from each iteration.

Below, instead of producing a histogram of just the slope coefficient, I use both the resampled intercept and slope and add 20 (of the 1000) lines to a scatter plot to show the way in which each of these lines represents a plausible regression line for the data. As you can see, the regression line confidence interval is starting to take shape, even with just 20 out of 1000 resamples, and this gives us a good understanding of not only the variability in our possible regression line fit for the underlying data but also, perhaps should make us less overconfident in our research findings knowing that there are many possible outcomes from the sample data we have obtained.

## 1000 resamples
n_samples <- 1000

## N observations
n_obs <- nrow(mtcars)

## empty storage data frame for the coefficients
coef_storage <- data.frame(
  intercept = rep(NA, n_samples),
  slope = rep(NA, n_samples)
)

for(i in 1:n_samples){
  
  ## sample dependent and independent variables
  row_ids <- sample(1:n_obs, size = n_obs, replace = TRUE)
  new_df <- d[row_ids, ]
  
  ## construct model
  model <- lm(mpg ~ wt, data = new_df)
  
  ## store coefficients
  # intercept
  coef_storage[i, 1] <- coef(model)[1]
  
  # slope
  coef_storage[i, 2] <- coef(model)[2]
  
}

## see results
head(coef_storage)
tail(coef_storage)

## Compare the results to those of the boot function
apply(X = coef_boot$t, MARGIN = 2, FUN = mean)
apply(X = coef_storage, MARGIN = 2, FUN = mean)

apply(X = coef_boot$t, MARGIN = 2, FUN = sd)
apply(X = coef_storage, MARGIN = 2, FUN = sd)

## plot first 20 lines
plot(x = d$wt,
     y = d$mpg,
     pch = 19)
for(i in 1:20){
  abline(a = coef_storage[i, 1],
       b = coef_storage[i, 2],
       lty = 2,
       lwd = 3,
       col = "light grey")
}

Simulating a relationship between two variables

As discussed in Part 1, simulation differs from resampling in that we use the parameters of the observed data to compute a new distribution, versus sampling from the data we have on hand.

For example, using the mean and standard deviation of mpg from the mtcars data set, we can simulate 1000 random draws from a normal distribution.

## load the mtcars data set
d <- mtcars

## make a random draw from the normal distribution for mph
set.seed(5)
mpg_sim <- rnorm(n = 1000, mean = mean(d$mpg), sd = sd(d$mpg))

## plot and summarize
hist(mpg_sim)

mean(mpg_sim)
sd(mpg_sim)

Frequently, we are interested in the relationship between two variables (e.g., correlation, regression, etc.). Let’s simulate two variables, x and y, which are linearly related in some way. To do this, we first simulate the variable x and then simulate y to be x plus some level of random noise.

# simulate x and y
set.seed(1098)
x <- rnorm(n = 10, mean = 50, sd = 10)
y <- x + rnorm(n = length(x), mean = 0, sd = 10)

# put the results in a data frame
dat <- data.frame(x, y)
dat

# how correlated are the two variables
cor.test(x, y)

# fit a regression for the two variables
fit <- lm(y ~ x)
summary(fit)

# plot the two variables with the regression line
plot(x, y, pch = 19)
abline(fit, col = "red", lwd = 2, lty = 2)

Simulating a data set with multiple variables

Frequently, we might have a hypothesis regarding how correlated multiple variables are with each other. The example above produced a relationship of two variables with a direct relationship between them along with some noise. We might want to specify this relationship given a correlation coefficient or covariance between them. Additionally, we might have more than two variables that we want to simulate relationships between.

To do this in R we can take advantage of two packages:

  • MASS via the mvrnorm()
  • mvtnorm via the mvrnorm()

Both packages have a function for simulating multivariate normal distributions. The primary difference is that the Sigma argument in the MASS package function, mvrnorm(), accepts a covariance matrix while the sigma argument in the mvtnorm package, rmvnorm() accepts a correlation matrix. I’ll show both examples but I tend to stick with the mtvnorm package because (at least for my brain) it is easier for me to think in terms of correlation coefficients instead of covariances.

First we simulate some data:

## create fake data
set.seed(1234)
fake_dat <- data.frame(
  group = rep(c("a", "b", "c"), each = 5),
  x = rnorm(n = 15, mean = 10, sd = 2),
  y = rnorm(n = 15, mean = 30, sd = 10),
  z = rnorm(n = 15, mean = 75, sd = 20)
)

fake_dat

Look at the correlation and variance between the three numeric variables.

# correlation
round(cor(fake_dat[, -1]), 3)

# variance
round(var(fake_dat[, -1]), 3)

We can use this information to simulate new x, y, or z variables.

Simulating x and y with the MASS package

Remember, for the MASS package, the Sigma argument is a matrix of covariances for the variables you are simulating from a multivariate normal distribution.

## get a vector of the mean for each variable
variable_means <- apply(X = fake_dat[, c("x", "y")], MARGIN = 2, FUN = mean)

## Get a matrix of the covariance between x and y
variable_sigmas <- var(fake_dat[, c("x", "y")])

## simulate 1000 new x and y variables using the MASS package
set.seed(98)
new_sim <- MASS::mvrnorm(n = 1000, mu = variable_means, Sigma = variable_sigmas)
head(new_sim)

### look at the results relative to the original x and y
## column means
variable_means
apply(X = new_sim, MARGIN = 2, FUN = mean)

## covariance
var(fake_dat[, c("x","y")])
var(new_sim)

 

Notice that the variance between x and y (the off diagonal of the matrix) is very similar between the fake data (the observed sample) and the simulated data (new_sim).

Simulating x and y with the mtvnorm package

Different than the MASS package, The rmvnorm() function from the mtvnorm package requires the sigma argument to be a correlations matrix.

Let’s repeat the above process with our fake_dat and simulate a relationship between x and y.

## get a vector of the mean for each variable
variable_means <- apply(X = fake_dat[, c("x", "y")], MARGIN = 2, FUN = mean)

## Get a matrix of the correlation between x and y
variable_cor <- cor(fake_dat[, c("x", "y")])

## simulate 1000 new x and y variables using the mvtnorm package
set.seed(98)
new_sim <- mvtnorm::rmvnorm(n = 1000, mean = variable_means, sigma = variable_cor)
head(new_sim)

### look at the results relative to the original x and y
## column means
variable_means
apply(X = new_sim, MARGIN = 2, FUN = mean)

## correlation
cor(fake_dat[, c("x","y")])
cor(new_sim)

Similar to the previous example, we notice that the off diagonal correlation coefficient between x and y is very similar when comparing the simulated data to the fake data.

So, what is happening here? Both packages produce the same result, one uses a covariance matrix and the other uses a correlation matrix. The kicker here is understanding the relationship between covariance and correlation. Covariance is explaining how two variables vary together, however, its units aren’t on a scale that is directly interpretable to us. But, we can convert the covariance between two variables to a correlation by dividing their covariance by the product of their individual standard deviations.

For example, here is the covariance matrix between x and y in the fake data set.

cov(fake_dat[, c("x", "y")])

The covariance between the two variables is on the off diagonal, 2.389. We can store this in its own element.

cov_xy <- cov(fake_dat[, c("x", "y")])[2,1]
cov_xy

Let’s store the standard deviation of both `x` and `y` in their own elements to make the equation easier to read.

sd_x <- sd(fake_dat$x)
sd_y <- sd(fake_dat$y)

Finally, we calculate the correlation by dividing the covariance by the product of the two standard deviations and check our results by calling the cor() function on the two variables.

## covariance to correlation
cov_to_cor <- cov_xy / (sd_x * sd_y)
cov_to_cor

## check results with the corr() function
cor(fake_dat[, c("x", "y")])

By dividing the covariance by the product of the standard deviation of the two variables we can see the relationship between covariance and correlation and now understand why the results from the MASS and mtvnorm produce similar results. Understanding this relationship becomes valuable when we move onto simulating more complex relationships, for example when simulating mixed models.

What about three variables?

What if we want to simulate all three variables — x, y, and z?

All we need is a larger covariance or correlation matrix, depending on which of the above packages you’d like to use. Since we usually won’t be creating these matrices from a data set, as I did above, I’ll show how to create your own matrix and run the simulation.

First, let’s store a vector of plausible mean values for x, y, and z.

## Look at the mean values we had in the fake data
apply(X = fake_dat[, c("x", "y", "z")], MARGIN = 2, FUN = mean)

## create a vector of possible mean values for the simulation
mus <- c(9, 26, 63)

## look at the correlation matrix for the three variables in the fake data
cor(fake_dat[, c("x", "y", "z")])

## Create a matrix that stores plausible correlations between the variables you want to simulate
r_matrix <- matrix(c(1, 0.14, -0.24,
                    0.14, 1, -0.35,
                    -0.24, -0.35, 1), 
                   nrow = 3, ncol = 3,
       dimnames = list(c("x", "y", "z"),
                       c("x", "y", "z")))

r_matrix

Next, we create 1000 simulations of a multivariate normal distribution between x, y, and z. We then compare our correlation coefficients between the fake data, which is our observed sample, and the simulated data

## simulate 1000 new x, y, and z variables using the mvtnorm package
set.seed(43)
new_sim <- mvtnorm::rmvnorm(n = 1000, mean = mus, sigma = r_matrix)
head(new_sim)

### look at the results relative to the original x, y, and z
## column means
apply(X = fake_dat[, c("x", "y", "z")], MARGIN = 2, FUN = mean)
apply(X = new_sim, MARGIN = 2, FUN = mean)

## correlation
cor(fake_dat[, c("x", "y", "z")])
cor(new_sim)

The results from the simulation are pretty similar to the fake_dat dataset. If you recall from the correlation matrix and the vector of means, we didn’t use exact values from the observed data, so that, along with the random draws from the multivariate normal distribution, leads to the small amount of differences.

Wrapping Up

In this second installment of the Simulations in R series we’ve walked through how to code our own bootstrap resampling for both means and regression coefficients. We then progressed to building simulations of both bivariate and multivariate normal distributions. This, along with the basic info in Part 1, will serve us well as we progress forward in our work and begin to explore using simulations for comparisons between group means (simulated t-tests) and building regression models.

As always, all of the code is available in the Github repository.

The High Performance Hockey Podcast Interview

This week, I had the great pleasure of being interviewed by my good friend and colleague Anthony Donskov for his High Performance Hockey Podcast.

Anthony has done a tremendous job for the sports science and strength and conditioning community in his teaching, writing, and podcasting. He brings a wealth of knowledge from both the applied strength coach realm all the way through to his PhD work.

In this podcast interview, Anthony and I discuss:

  • Data analysis
  • The PPDAC Framework for conducting research
  • My criticisms of applied sport science
  • The challenge of measuring hard things and things that matter in applied sport.

Check out the podcast HERE.