Category Archives: Sports Analytics

Simulations in R Part 6: Multicollinearity Assumption in Regression

For the next installment of our simulation blog series we will use simulation to look at the Multicollinearity assumption in linear regression.

For a refresher, here are the previous 5 articles in this series:

  • Part 1 discussed the basic functions for simulating and sampling data in R.
  • Part 2 walked us through how to perform bootstrap resampling and then simulate bivariate and multivariate distributions.
  • Part 3 we worked making group comparisons by simulating thousands of t-tests
  • Part 4 building simulations for linear regression
  • Part 5 using simulation to investigate the homoskedasticity assumption in regression.

The entire code for this series is accessible on my GITHUB page.

Multicollinearity

Multicollinearity occurs when two independent variables in a regression model are highly correlated with each other. Such a situation can produce problems with interpretation of the beta coefficients of the model, may increase standard errors in the model, and can lead to over fitting of the data. We can simulate this issue in order to get a better understanding of how multicollinearity can influence a regression model.

Constructing the simulation

We will use the mvnorm package to help us construct a simulation where the two independent variables share a certain level of correlation between each other.

First, we will create the true model parameters: an intercept of 2, a beta1 of 5, and a beta2 of 10. We also create a vector of correlation coefficients from 0 to 0.99 and a few data frames to store the results of our model. We will also specify that at each correlation coefficient we want 200 random draws from the multivariate normal distribution.

## load packages
library(tidymodels)
library(patchwork)
library(mvtnorm)

set.seed(999)

# create the true model parameters
intercept <- 2
beta1 <- 5
beta2 <- 10

## Number of draws from a multivariate normal distribution
n <- 200

## Create a data frame to store model results
sim_params <- data.frame(intercept = NA,
                      intercept_se = NA,
                      beta1 = NA,
                      beta1_se = NA,
                      beta2 = NA,
                      beta2_se = NA,
                      model_rse = NA)

## create levels of multicollinearity between the two independent variables
cor_coefs <- c(seq(from = 0, to = 0.9, by = 0.1), 0.99)

# data frame to store the average beta coefficient and their standard deviations form the simulation
mean_betas <- data.frame(beta1 = NA,
                       sd_beta1 = NA,
                       beta2 = NA,
                       sd_beta2 = NA)

Next, we will create a nested for() loop to construct out simulations.

  •  The outer part of the loop begins by creating the multivariate normal distribution. We use the rmvnorm(), which means we first specify a correlation matrix using our vector of correlation coefficients we specified above. Once we have the two correlated variables we can put them into the inner loop.
  • The inner loop is where we create a regression equation for the given correlation between the two variables. We create 100 regression simulations for each correlation coefficient.
  • Once the inner loop is finished running we store the results at the bottom of the outer loop. Instead of storing all of the results, we take the average of the 100 beta coefficients and their respective standard errors for each correlation coefficient.
## loop
for(j in 1:length(cor_coefs)){
  
  ## Create a correlation matrix between beta1 and beta2
  beta_corr <- matrix(c(1, cor_coefs[j], cor_coefs[j], 1), nrow = 2, ncol = 2)
  
  ## create a multivariate normal distribution 
  cor_df <- rmvnorm(n = n, mean = c(0, 0), sigma = beta_corr)
  X1 <- cor_df[, 1]
  X2 <- cor_df[, 2]
  
  ## simulate 100 regression simulations
  for(i in 1:100){
    
    # set up the model
    y_hat <- intercept + beta1*X1 + beta2*X2 + rnorm(n = n, mean = 0, sd = 1)
    
    # construct a regression equation
    model <- lm(y_hat ~ X1 + X2)
    
    # store the variance-covariance matrix
    vcv <- vcov(model)
    
    # estimates for the intercept
    sim_params[i, 1] <- model$coef[1]
  
    # estimates for the beta1
    sim_params[i, 3] <- model$coef[2]
    
    # estimates for beta2
    sim_params[i, 5] <- model$coef[3]
  
    # SE for the intercept
    sim_params[i, 2] <- sqrt(diag(vcv)[1])
    
    # SE for beta1
    sim_params[i, 4] <- sqrt(diag(vcv)[2])
    
    # SE for beta2
    sim_params[i, 6] <- sqrt(diag(vcv)[3])
  
    # model RSE
    sim_params[i, 7] <- summary(model)$sigma
  }

  mean_betas[j, ] <- c(mean(sim_params[, 3]), mean(sim_params[, 4]), mean(sim_params[, 5]), mean(sim_params[, 6]))

}

The final result is a data frame with 10 rows, one for each correlation coefficient and the average of 100 regression simulations for the beta coefficients and their standard errors.

# Add the correlation coefficients to the final data frame
mean_betas$cor_coef <- cor_coefs mean_betas %>%
  knitr::kable()

Finally, we plot each of the results with the correlation coefficients on the x-axis.

# Plot the results
par(mfrow = c(2,2))
plot(x = mean_betas$cor_coef,
     y = mean_betas$beta1,
     main = "Beta1",
     lwd = 3,
     type = "b",
     ylim = c(2, 7),
     xlab = "Correlation betwen X1 and X2",
     ylab = "Beta1")
plot(x = mean_betas$cor_coef,
     y = mean_betas$sd_beta1,
     main = "Beta1 Standard Error",
     lwd = 3,
     type = "b",
     xlab = "Correlation betwen X1 and X2",
     ylab = "SE of B1")
plot(x = mean_betas$cor_coef,
     y = mean_betas$beta2,
     main = "Beta2",
     lwd = 3,
     type = "b",
     ylim = c(7, 13),
     xlab = "Correlation betwen X1 and X2",
     ylab = "Beta 2")
plot(x = mean_betas$cor_coef,
     y = mean_betas$sd_beta2,
     main = "Beta2 Standard Error",
     lwd = 3,
     type = "b",
     xlab = "Correlation betwen X1 and X2",
     ylab = "SE of B2")

The beta coefficients themselves remain relatively unchanged in our simulation across the various correlations levels. However, once the correlation between the two independent variables reaches about 0.7 the standard errors around the beta coefficients begin to increase exponentially, increasing our uncertainty about the true parameter values.

Wrapping Up

We’ve now covered using simulation to investigate two assumptions of linear regression. Our next installment will investigate another linear regression assumption before we proceed on to simulating other types of models.

For the full code, check out my GITHUB page.

Comparing Tidymodels in R to Scikit Learn in Python

I did a previous blog providing a side-by-side comparisons of R’s {tidyverse} to Python for various data manipulation techniques. I’ve also talked extensively about using {tidymodels} for model fitting (see HERE and HERE). Today, we will work through a tutorial on how to fit the same random forest model in {tidyverse} and Scikit Learn.

This will be a side-by-side view of both coding languages.The tutorial will cover:

  • Loading the data
  • Basic exploratory data analysis
  • Creating a train/test split
  • Hyperparameter tuning by creating cross-validated folds on the training data
  • Identifying the optimal hyperparameters and fitting the final model
  • Applying the final model to the test data and evaluating model performance
  • Saving the model for downstream use
  • Loading the saved model and applying it to new data

To get the full code for each language and follow along with the tutorial visit my GITHUB page.

The Data

The data comes the tidytuesday project from 4/4/2023. The data set is Premier League match data (2021 – 2022) that provides a series of features with the goal of predicting the final result (Full Time Result, FTR) as to whether the home team won, the away team won, or the match resulted in a draw.

Load Data & Packages

First, we load the data directly from the tidytuesday website in both languages.

Exploratory Data Analysis

Next, we perform some exploratory data analysis to understand the potential features for our model.

  • Check each column for NAs
  • Plot a count of the outcome variable across the three levels (H = home team wins, A = away team wins, D = draw)
  • Select a few features for our model and then create box plots for each feature relative to the 3 levels of our outcome variable

Train/Test Split

We being the model building process by creating a train/test split of the data.

Create a Random Forest Classifier Instance

This is basically telling R and python that we want to build a random forest classifier. In {tidymodels} this is referred to as “specifying the model engine”.

Hyperparameter Tuning on Cross Validated Folds

The two random forest hyperparameters we will tune are:

  1. The number of variables randomly selected for candidate model at each split (R calls this mtry while Python calls it max_features)
  2. The number of trees to grow (R calls this trees and Python calls it n_estimators)

In {tidymodels} we will specify 5 cross validated folds on the training set, set up a recipe, which explains the model we want (predicting FTR from all of the other variables in the data), put all of this into a single workflow and then set up our tuning parameter grid.

In Scikit Learn, we set up a dictionary of parameters (NOTE: they must be stored in list format) and we will pass them into a cross validation structure that performs 5-fold cross-validation in parallel (to speed up the process). We then pass this into the GridSearchCV() function where we specify the model we are fitting (random forest), the parameter grid that we’ve specified, and how we want to compare the random forest models (scoring). Additionally, we’ll set n_jobs = -1 to allow Python to use all of the cores on our machine.

While the code looks different, we’ve essentially set up the same process in both languages.

Tune the model on the training data

We can now tune the hyperparameters by applying the cross-validated folds procedure to the training data.

Above, we indicated to Python that we wanted some parallel processing, to speed up the process. In {tidyverse} we specify parallel processing by setting up the number of cores we’d like to use on our machine. Additionally, we will want to save the results of each cross-validated iteration, so we use the control_sample() function to do this. All of these steps were specified in Python, above, so we are ready to now apply cross-validation to our training dataset and tune the hyperparameters.

Get the best parameters

Both R and Python provide numerous objects to explore the output for each of the cross-validated folds. I’ve placed some examples in the respective codes in the GITHUB page. For our purposes, we are most interested in the optimal number of variables and trees. Both coding languages found 4 and 400 to be the optimal number of variables and trees, respectively.

Fitting the Final Model

Now that we have the optimal hyperparameter values, we can refit the model. In both {tidymodels} and Scikit learn, we’ll just refit a random forest with those optimal values specified.

Variable Importance Plot

It’s helpful to see which variables were the most important contributors to the model’s predictions.

Side Note: This takes more code in python than in R. This is one of the drawbacks I’ve found with python compared to R. I can do things more efficiently and with less code in R than in python. I often find I have to work a lot harder in Scikit Learn to get model outputs and information about the model fit. It’s all in there but it is not clearly accessible (to me at least) and plotting in matplotlib is not as clean as plotting in ggplot2.

Get Model Predictions on the Test Set

Both languages offer some out of the box options for describing the model fit info. If you want more than this (which you should, because this isn’t much to go off of), then you’ll have to extract the predicted probabilities and the actual outcomes and code some additional analysis (potentially a future blog article).

Save The Model

If we want to use this model for any downstream analysis we will need to save it.

Load the Model and Make Predictions

Once we have the model saved we can load it and apply it to any new data that comes in. Here, our new data will just be a selection of rows from the original data set (we will pretend it is new).

NOTE: Python is 0 indexed while R is indexed starting at 1. So keep that in mind if selecting rows from the original data to make the same comparison in both languages.

Wrapping Up

Both {tidymodels} and Scikit Learn provide users with powerful machine learning frameworks for conducting analysis. While the code syntax differs, the general concepts are the same, so bouncing between the two languages shouldn’t be to cumbersome. Hopefully this tutorial provided a nice overview of how to conduct the same analysis in both languages, offering a bridge for those trying to learn Python from R and vice versa.

All code is available on my GITHUB page.

Tidymodels Workflowsets Tutorials

Workflowsets in {tidymodels} provide a useful way for analysts to run several models, even tune several machine learning models on their dataset, simultaneously. You can then extract model fit information, predictions, and make comparisons to identify the most effective model type and the optimized tuning parameters.

Below are links to two workflowsets tutorials I’ve written. The first link is to an older tutorial that wrote and posted in the blog 3 months ago. This tutorial builds several machine learning models on NWSL data to solve a classification task. The second link is a more recent tutorial that walks through workflowsets and model tuning for outcomes that are both binary and continuous.

Both tutorials go over model building, hyperparameter tuning, model comparisons, making predictions, and storing the final model for deployment at a later time in {tidymodels} using workflowsets.

Hopefully both tutorials provide analysts with a clear overview of how to set up workflowsets and run several models in parallel, helping to make their work more efficient.

Tutorial 1

Tutorial 2

 

Simulations in R Part 4: Simulating Regression Models

We’ve been working on building simulations in R over the past few articles. We are now to the point where we want to simulate data using regression models. To recap what we’ve done so far:

  • Part 1 discussed the basic functions for simulating and sampling data in R.
  • Part 2 walked us through how to perform bootstrap resampling and then simulate bivariate and multivariate distributions.
  • Part 3 we worked making group comparisons by simulating thousands of t-tests

In an upcoming article we will use simulation to evaluate regression assumptions. But, before we do that, it might be useful to use a regression model reported in a research paper to simulate a data set. This can be useful if you are trying to better understand the paper (you can recalculate statistics and explore the data) or if you want to work on applying a different type of model approach to build hypotheses for future research.

As always, all code is freely available in the Github repository.

I’m going to take the regression model from a paper by Ferrari et al (2022), Performance and anthropometrics of classic powerlifters: Which characteristics matter? J Strength Cond Res., which looked to predict power lifting performance in male and female lifters.

The paper used several variables to try and predict Squat, Bench, Deadlift, and Powerlifting Total in raw power lifters. To keep this simple, I’ll focus on the model for predicting squat performance, which was:


The model had a standard error of estimate (SEE) of 20.3 and an r-squared of 0.83.

Getting the necessary info from the paper

Aside from the model coefficients, above, we also need to get the mean and standard deviation for each of the variables in the model (for both male and female lifters). We will use those parameters to simulate 1000 male and female lifters.

library(tidyverse)

# set the see for reproducibility
set.seed(82)

# sample size - simulate 1000 male and 1000 female lifters
n_males <- 1000
n_females <- 1000

# set the model coefficients for squats
intercept <- -145.7
years_exp_beta <- 4.3
pct_bf_beta <- -1.7
upper_arm_girth_beta <- 6
thigh_girth_beta <- 1.9

# Standard Error of Estimate (SEE) is sampled from a normal distribution with a mean = 0 and 
# standard deviation of 20.3
see <- rnorm(n = n_males + n_females, mean = 0, sd = 20.3)

# get mean and sd for each of the model variables
years_exp_male <- rnorm(n = n_males, mean = 2.6, sd = 2.4)
years_exp_male <- ifelse(years_exp_male < 0, 0, years_exp_male)
years_exp_female <- rnorm(n = n_females, mean = 1.6, sd = 1.5)
years_exp_female <- ifelse(years_exp_female < 0, 0, years_exp_female)
pct_bf_male <- rnorm(n = n_males, mean = 11.1, sd = 3.8)
pct_bf_female <- rnorm(n = n_females, mean = 21.7, sd = 5.4)
upper_arm_girth_male <- rnorm(n = n_males, mean = 35.6, sd = 2.8)
upper_arm_girth_female <- rnorm(n = n_females, mean = 29.5, sd = 3.1)
thigh_girth_male <- rnorm(n = n_males, mean = 61.1, sd = 5.5)
thigh_girth_female <- rnorm(n = n_females, mean = 56.1, sd = 4.9)

# put the simulated data into a data frame
dat <- data.frame( gender = c(rep("male", times = n_males), rep("female", times = n_females)), years_exp = c(years_exp_male, years_exp_female), pct_bf = c(pct_bf_male, pct_bf_female), upper_arm_girth = c(upper_arm_girth_male, upper_arm_girth_female), thigh_girth = c(thigh_girth_male, thigh_girth_female) ) dat %>%
  head()

As a sanity check, we can quickly check the mean and standard deviation of each variable to ensure the simulation appears as we intended it to.

## check means and standard deviations of the simulation
dat %>%
  group_by(gender) %>%
  summarize(across(.cols = years_exp:thigh_girth,
                   list(avg = mean, SD = sd)),
            .groups = "drop") %>%
  pivot_longer(cols = -gender)

Estimate squat performance using the model

Next, we use the values that we simulated and the model coefficients to simulate the outcome of interest (Squat performance).

# estimate squat performance
dat$squat <- with(dat, intercept + years_exp_beta*years_exp + pct_bf_beta*pct_bf + upper_arm_girth_beta*upper_arm_girth + thigh_girth_beta*thigh_girth + see) dat %>%
  head()

## summary statistics
dat %>%
  group_by(gender) %>%
  summarize(avg_squat = mean(squat), 
            sd_squat = sd(squat))

# plots
hist(dat$squat)

dat %>%
  ggplot(aes(x = squat, fill = gender)) +
  geom_density(alpha = 0.5)

Look at the regression model

Now we fit a regression model with our simulated data.

fit_lm <- lm(squat ~ years_exp + pct_bf + upper_arm_girth + thigh_girth, data = dat)
summary(fit_lm)

  • The coefficients are similar to those in the paper (which makes sense because we coded them this way.
  • The standard error (Residual standard error) is 20.3, as expected.
  • The r-squared for the model is 0.81, close to what was observed in the paper.

 

Making a better simulation

One thing to notice is the mean and standard deviation of the squat in our simulation is a bit less compared to what is reported for the population in the paper.

Our estimates are a little low relative to what was reported in the paper. The model above still works because we constructed it with the regression coefficients and variable parameters in the model, so we can still play with the data and learn something. But, the estimated squats might be a little low because the anthropometric variables in the model (BF%, Upper Arm Girth, and Thigh Girth) are in some way going to be correlated with each other. So, we could make this simulation more informative by simulating those variables from a multivariate normal distribution, as we did in Part 3.

To start, we load the mvtnorm package and set up a vector of mean values for each variable. We will construct a vector for males and females separately. We will use the mean values for each variable reported in the paper.

library(mvtnorm)

## Order of means is: BF%, Upper Arm Girth, Thigh Girth
male_means <- c(11.1, 35.6, 61.1)
female_means <- c(21.7, 29.5, 56.1)

Next, we need to construct a correlation matrix between these three variables. Again, we will create a correlation matrix for males and females, separately. I’m not sure of the exact correlation between these variables, so I’ll just estimate what I believe it to be. For example, the correlations probably aren’t 0.99 but they also probably aren’t below 0.6. I’m also unsure how these correlations might differ between the two genders. Thus, I’ll keep the same correlation matrix for both. To keep it simple, I’ll set the correlation between BF% and upper arm and thigh girth at 0.85 and the correlation between upper arm girth and thigh girth to be 0.9. In theory, we could consult some scientific literature on these things and attempt to construct more plausible correlations.

## Create correlation matrices
# males
male_r_matrix <- matrix(c(1, 0.85, 0.85,
                          0.85, 1, 0.9,
                          0.85, 0.9, 1), 
                   nrow = 3, ncol = 3,
       dimnames = list(c("bf_pct", "upper_arm_girth", "thigh_girth"),
                       c("bf_pct", "upper_arm_girth", "thigh_girth")))

male_r_matrix

# females
female_r_matrix <- matrix(c(1, 0.85, 0.85,
                          0.85, 1, 0.9,
                          0.85, 0.9, 1), 
                   nrow = 3, ncol = 3,
       dimnames = list(c("bf_pct", "upper_arm_girth", "thigh_girth"),
                       c("bf_pct", "upper_arm_girth", "thigh_girth")))

female_r_matrix

Now we will create 1000 simulations from a multivariate normal distribution for both males and females and then row bind them together into a single big data frame.

## simulate 1000 new x, y, and z variables using the mvtnorm package
set.seed(777)
male_sim <- rmvnorm(n = n_males, mean = male_means, sigma = male_r_matrix) %>%
  as.data.frame() %>%
  setNames(c("pct_bf", "upper_arm_girth", "thigh_girth"))

female_sim <- rmvnorm(n = n_females, mean = female_means, sigma = female_r_matrix) %>%
  as.data.frame() %>%
  setNames(c("pct_bf", "upper_arm_girth", "thigh_girth"))

head(male_sim)
head(female_sim)

## put the two simulated data frames together
multi_sims <- bind_rows(male_sim, female_sim) multi_sims %>%
  head()

Finally, one last thing we’ll change is our simulation of years of experience. This variable is not a normally distributed variable because it is truncated at 0. Above, in our first simulation, we attempted to solve this with the ifelse() expression to assign any simulated value less than 0 to 0. Here, I’ll just get the quantiles of the simulated years of experience so that I have an idea of a plausible upper end of experience for the population used in this paper. Then, instead of simulating from a normal distribution I’ll do a random draw from a uniform distribution from 0 to the respective upper end for the male and female groups.

Now take the newly simulated variables and create a new data frame.

## new data frame
new_dat <- data.frame( gender = c(rep("male", times = n_males), rep("female", times = n_females)), years_exp = c(years_exp_male, years_exp_female) ) %>%
  bind_cols(multi_sims)

new_dat %>%
  head()

Finally, go through the steps we did above to estimate the squat from our four variables and the beta coefficients from the paper.

# estimate squat performance
new_dat$squat <- with(new_dat, intercept + years_exp_beta*years_exp + pct_bf_beta*pct_bf + upper_arm_girth_beta*upper_arm_girth + thigh_girth_beta*thigh_girth + see) new_dat %>%
  head()

## summary statistics
new_dat %>%
  group_by(gender) %>%
  summarize(avg_squat = mean(squat), 
            sd_squat = sd(squat))

# plots
hist(new_dat$squat)

new_dat %>%
  ggplot(aes(x = squat, fill = gender)) +
  geom_density(alpha = 0.5)

 

Look at the new regression model

fit_lm_new <- lm(squat ~ years_exp + pct_bf + upper_arm_girth + thigh_girth, data = new_dat)
summary(fit_lm_new)

Compare the squat simulated data to that which was reported in the paper

new_dat %>%
  group_by(gender) %>%
  summarize(avg_squat_sim = mean(squat), 
            sd_squat_sim = sd(squat)) %>%
  mutate(avg_squat_paper = c(118.3, 196.1),
         sd_squat_paper = c(26.6, 37.9))

Now we have mean values much closer to those observed in the population of the study. Our standard deviations differ from the study because, if you recall, the standard error of the estimate was 20.3 for the regression model. The model in the study did not include gender as an independent variable. This to me is a little strange and in the discussion of the paper the authors’ also indicate that it was strange to them too. However, the approach the authors’ took to building this deemed gender unnecessary as a predictor variable. Consequently, our simulation has a similar standard deviation in estimate squat for both the male and female populations. However, we now have a data set that is relatively close to what was observed in the paper and can therefore proceed with conducting other statistical tests or building other models.

Wrapping Up

We’ve come a long way in our simulation journey, from randomly drawing single vectors of data to now building up entire data sets using regression models and simulating data from research studies. Next time we will use simulation to explore regression assumptions.

As always, all code is freely available in the Github repository.

Simulations in R Part 3: Group comparisons via simulation (simulating t-tests)

To review where we are at so far:

  • Part 1 discussed the basic functions for simulating and sampling data in R.
  • Part 2 walked us through how to perform bootstrap resampling and then simulate bivariate and multivariate distributions.

In Part 3, we are ready to put this info to use and start to simulate data and construct models. To start things off, this tutorial will focus on simulating data for comparison of group means — a t-test.

As always, all code is freely available in the Github repository.

Simulating Two Groups for Comparison

We begin by simulating two groups using the rnorm() function to make a random draw from a normal distribution. Group 1 will have a mean of 50 and a standard deviation of 10 while Group 2 will have a mean of 60 with a standard deviation of 6.25. Both groups have a sample size of 10 (so this is a small study!).

In addition to simulating the data, we will store summary statistics for each group so that we can use them later.

set.seed(1759)
grp1 <- rnorm(n = 10,
              mean = 50,
              sd = 10)

grp2 <- rnorm(n = 10,
              mean = 60,
              sd = 6.25)

## get the summary statistics for both groups
# sample size
n_grp1 <- length(grp1)
n_grp2 <- length(grp2)

# means
mu_grp1 <- mean(grp1)
mu_grp2 <- mean(grp2)

# variances
var_grp1 <- var(grp1)
var_grp2 <- var(grp2)

# standard deviation
sd_grp1 <- sd(grp1)
sd_grp2 <- sd(grp2)

Next, we calculate the t-statistic, which is the difference between the two groups means divided by the pooled standard deviation of the two groups.

## pooled SD
sd_pool <- sqrt(((n_grp1 - 1) * var_grp1 + (n_grp2 - 1) * var_grp2) / (n_grp1 + n_grp2 - 2))
sd_pool

## Compute t-statistic
t_stat <- (mu_grp1 - mu_grp2) / (sd_pool * sqrt(1/n_grp1 + 1/n_grp2))
t_stat

We can use this t-statistic to determine if the difference is significant or not at a desired alpha level, say p < 0.05, by using a t-distribution (the qt() function, which we were introduced to in Part 1 of this series). We then check our work against R’s build in t.test() function.

alpha <- 0.05
df <- n_grp1 + n_grp2 - 2 if(abs(t_stat) > qt(1 - alpha / 2, df, lower.tail = TRUE)){
  "significant difference"} else {"not a significant difference"} 

# Get p-value
p_value <- 2*pt(abs(t_stat), df, lower=FALSE)
p_value

# check work
t.test(grp1, grp2)

The group difference is barely significant at the p < 0.05 level. We calculated the t-statistic by hand and got the exact same value as that which was produced by the t.test() function.

To make this approach more streamlined, let’s create our own t-test function so that all we need to do in the future is pass it a vector of values representing each group’s data and get returned the t-statistic.

 

## t-test function
t_stat_func <- function(x, y){
  n_grp1 <- length(x)
  n_grp2 <- length(y)
  sd_pool <- sqrt(((n_grp1 - 1) * sd(x)^2 + (n_grp2 - 1) * sd(y)^2) / (n_grp1 + n_grp2 - 2))
  t_stat <- (mean(x) - mean(y)) / (sd_pool * sqrt(1/n_grp1 + 1/n_grp2))
  return(t_stat)
}

## try out the function
t_stat_func(x = grp1, y = grp2)

We only have 10 observations in each of our groups. With such a small sample size, it may be challenging to really know if the difference we observed is real or just some sort of chance/luck occurrence (we will explore sample size issues via simulation in a later blog post). What we can do is use the function we just created and build simulated distributions using the data generating process (mean and SD) of each group to run a Monte Carlo simulations and explore how often we might reject or accept the null hypothesis at an alpha level of p < 0.05.

# Let's set alpha to 0.05
alpha <- 0.05

# We will run 10,000 simulations
N_sims <- 1e4

# create an empty vector for storing t-statistics and p-valies
t_stat <- rep(NA, N_sims)
p_values <- rep(NA, N_sims)

for(i in 1:N_sims){
  
  # simulate population 1
  grp1_sim <- rnorm(n = n_grp1, mean = mu_grp1, sd = sd_grp1)
  
  # simulate group 2
  grp2_sim <- rnorm(n = n_grp2, mean = mu_grp2, sd = sd_grp2)
  
  # compute the t-statistic with our function
  t_stat[i] <- t_stat_func(grp1_sim, grp2_sim)

  # get degrees of freedom for calculating the p-value
  df <- n_grp1 + n_grp2 - 2
  
  # obtain the p-value
  p_value[i] <- 2*pt(abs(t_stat[i]), df, lower=FALSE)
}

par(mfrow = c(1, 2))
hist(abs(t_stat),
     main = "T-Statistic Distribution")
abline(v = mean(abs(t_stat)),
       col = "red",
       lwd = 3,
       lty = 2)
hist(p_value,
     main = "p-value Distribution")
abline(v = 0.05,
       col = "red",
       lwd = 3,
       lty = 2)

# What percentage of times did we reject the null?
mean(p_value < 0.05)

You will get slightly different results, since I didn’t set a seed, but you will find that we end up rejecting the null about 52-53% of the time, meaning we probably wouldn’t want to be too confident about our “statistically significant” finding.

We could have saved a lot of lines of code and instead used the replicate() function and run 10,000 simulations of the t-test between the two groups (additionally, replicate() will run faster than the for() loop).

 

t_test <- function(){
  grp1 <- rnorm(n = n_grp1, mean = mu_grp1, sd = sd_grp1)
  grp2 <- rnorm(n = n_grp2, mean = mu_grp2, sd = sd_grp2)
  t_stat_func(grp1, grp2)
}


# Instead of a for loop, we will use the replicate() function to run this 10,000 times
t_stat_vector <- replicate(n = 10000,
                           t_test())

head(t_stat_vector)
hist(abs(t_stat_vector),
     main = "T-Test Simulation using replicate()",
     xlab = "T-Statistic")

Simulating Data from Studies

Why just create fake data and play around when we can use a similar approach to simulating data to help us further explore data contained in studies we read?! Most studies do not provide full data sets but do provide necessary summary statistics (e.g., sample size, mean, standard deviation, standard errors, confidence intervals, etc.), allowing us to use the data generating processes to simulate the study.

I was reading a 2010 paper from Saenz et al., Knee isokinetic test-retest: A multicenter knee isokinetic test-retest study of a fatigue protocol (Eur J Phys Rehabil Med), where they authors were conducing a series of test-retest protocols using a biodex. The tables in the paper lay out the information required to simulate the study. For example, in Table 1, the authors’ provide the mean, median, and standard deviation for all of the extension tests that were performed. I’ll use the data provided for the test, WRepMax.

## Saenz (2010) - Knee isokinetic test-retest - simulation
## Table 1: Extension
# We will simulate the test scores provided for WrepMax

# number of subjects
N <- 90

## Get the mean and SD for the test and retest from Table 1 for WrepMax
test1_mu <- 99.3
test1_sd <- 18.34
test2_mu <- 104.44
test2_sd <- 24.90

Next, we calculate the mean difference between test 1 and test 2 and grab the standard deviation for the test, provided in Table 3 of the paper.

## Get the difference and standard deviation of the difference between test 1 and 2 from Table 3 (row 1)
diff <- test1_mu - test2_mu
sd_diff <- 17.37

First, we simulate Test 1 using the above parameters (sample size, mean, and standard deviation).

## Simulate the first test using the summary statistics from Table 1
set.seed(925)
w_rep_max_1 <- rnorm(n = N, mean = test1_mu, sd = test1_sd)

Next, we simulate Test 2. However, we have to remember consider that, because we are trying to simulate the performance of the 90 participant in the study, Test 2 has a relationship to the performance of Test 1 because the authors’ are analyzing the difference in performance between the two tests. Therefore, to create Test 2 we will use our simulation of Test 1 for each participant and include some random error, which is specific to the mean and standard deviation of the difference between Test 1 and Test 2 reported in the study.

## Simulate test 2 by taking test 1 and applying the observed difference between tests (Table 3) 
set.seed(479)
w_rep_max_2 <- w_rep_max_1 + rnorm(n = N, mean = abs(diff), sd = sd_diff)

Now that we have Test 1 and Test 2 simulated for each participant, we can calculate the summary statistics and see how well they compare to the observed data reported in the study.

## Get the mean and SD of the simulated test1 and test 2
test1_mu_sim <- mean(w_rep_max_1) 
test1_sd_sim <- sd(w_rep_max_1)

test2_mu_sim <- mean(w_rep_max_2)
test2_sd_sim <- sd(w_rep_max_2)

test1_mu_sim
test1_sd_sim

test2_mu_sim
test2_sd_sim

## Get the mean and SD of the difference between simulations
diff_mu_sim <- mean(w_rep_max_1 - w_rep_max_2)
diff_sd_sim <- sd(w_rep_max_1 - w_rep_max_2)

diff_mu_sim
diff_sd_sim

The results are nearly identical to those presented in the study (see the link, as the study is free). Now that we have a simulated data set of all of the participants we can do other things with the data, such as plot it, conduct additional reliability metrics that weren’t performed in the study, recreate the analysis in the study to get a better understanding of the analysis that was performed, or perhaps try and model the data in different ways to explore its properties and generate new ideas for future research.

For example, let’s run a paired t-test on the simulated data, as the author’s did and look at the results in comparison to simulating this study 1000 times. We re-write our t_test() function from above to re-simulate the data in the study and then conduct a paired t-test, storing the t-statistic so that we can investigate how often we would reject the null hypothesis.

## Run a paired t-test
t.test(w_rep_max_1, w_rep_max_2, paired = TRUE)

## create a function for the paired t-test and extract the t-statistic
t_test_retest <- function(){
  w_rep_max_1 <- rnorm(n = N, mean = test1_mu, sd = test1_sd)
  w_rep_max_2 <- w_rep_max_1 + rnorm(n = N, mean = abs(diff), sd = sd_diff)
  t.test(w_rep_max_1, w_rep_max_2, paired = TRUE)$statistic
}

# Instead of a for loop, we will use the replicate() function to run this 10,000 times
t_stat_test_retest <- replicate(n = 10000,
                           t_test_retest())

hist(t_stat_test_retest)
mean(t_stat_test_retest)

# turn the t-statistics into p-values
p <- 2*pt(abs(t_stat_test_retest), df = 90 - 2, lower=FALSE)

# histogram of p-values
hist(p,
     main = "p-values of all simulated data\nreject the null ~79% of the time")

# what percentage of times did we reject the null at the p < 0.05 level?
mean(p < 0.05)

We reject the result approximately 79% of the time at the alpha level of p < 0.05.

Wrapping Up

This tutorial worked through using simulation to understand the difference in group means, as we would commonly do with a t-test. Next, we progress on to simulation linear regression models.

As always, all code is freely available in the Github repository.