Category Archives: Model Building in R

Simulations in R Part 6: Multicollinearity Assumption in Regression

For the next installment of our simulation blog series we will use simulation to look at the Multicollinearity assumption in linear regression.

For a refresher, here are the previous 5 articles in this series:

  • Part 1 discussed the basic functions for simulating and sampling data in R.
  • Part 2 walked us through how to perform bootstrap resampling and then simulate bivariate and multivariate distributions.
  • Part 3 we worked making group comparisons by simulating thousands of t-tests
  • Part 4 building simulations for linear regression
  • Part 5 using simulation to investigate the homoskedasticity assumption in regression.

The entire code for this series is accessible on my GITHUB page.

Multicollinearity

Multicollinearity occurs when two independent variables in a regression model are highly correlated with each other. Such a situation can produce problems with interpretation of the beta coefficients of the model, may increase standard errors in the model, and can lead to over fitting of the data. We can simulate this issue in order to get a better understanding of how multicollinearity can influence a regression model.

Constructing the simulation

We will use the mvnorm package to help us construct a simulation where the two independent variables share a certain level of correlation between each other.

First, we will create the true model parameters: an intercept of 2, a beta1 of 5, and a beta2 of 10. We also create a vector of correlation coefficients from 0 to 0.99 and a few data frames to store the results of our model. We will also specify that at each correlation coefficient we want 200 random draws from the multivariate normal distribution.

## load packages
library(tidymodels)
library(patchwork)
library(mvtnorm)

set.seed(999)

# create the true model parameters
intercept <- 2
beta1 <- 5
beta2 <- 10

## Number of draws from a multivariate normal distribution
n <- 200

## Create a data frame to store model results
sim_params <- data.frame(intercept = NA,
                      intercept_se = NA,
                      beta1 = NA,
                      beta1_se = NA,
                      beta2 = NA,
                      beta2_se = NA,
                      model_rse = NA)

## create levels of multicollinearity between the two independent variables
cor_coefs <- c(seq(from = 0, to = 0.9, by = 0.1), 0.99)

# data frame to store the average beta coefficient and their standard deviations form the simulation
mean_betas <- data.frame(beta1 = NA,
                       sd_beta1 = NA,
                       beta2 = NA,
                       sd_beta2 = NA)

Next, we will create a nested for() loop to construct out simulations.

  •  The outer part of the loop begins by creating the multivariate normal distribution. We use the rmvnorm(), which means we first specify a correlation matrix using our vector of correlation coefficients we specified above. Once we have the two correlated variables we can put them into the inner loop.
  • The inner loop is where we create a regression equation for the given correlation between the two variables. We create 100 regression simulations for each correlation coefficient.
  • Once the inner loop is finished running we store the results at the bottom of the outer loop. Instead of storing all of the results, we take the average of the 100 beta coefficients and their respective standard errors for each correlation coefficient.
## loop
for(j in 1:length(cor_coefs)){
  
  ## Create a correlation matrix between beta1 and beta2
  beta_corr <- matrix(c(1, cor_coefs[j], cor_coefs[j], 1), nrow = 2, ncol = 2)
  
  ## create a multivariate normal distribution 
  cor_df <- rmvnorm(n = n, mean = c(0, 0), sigma = beta_corr)
  X1 <- cor_df[, 1]
  X2 <- cor_df[, 2]
  
  ## simulate 100 regression simulations
  for(i in 1:100){
    
    # set up the model
    y_hat <- intercept + beta1*X1 + beta2*X2 + rnorm(n = n, mean = 0, sd = 1)
    
    # construct a regression equation
    model <- lm(y_hat ~ X1 + X2)
    
    # store the variance-covariance matrix
    vcv <- vcov(model)
    
    # estimates for the intercept
    sim_params[i, 1] <- model$coef[1]
  
    # estimates for the beta1
    sim_params[i, 3] <- model$coef[2]
    
    # estimates for beta2
    sim_params[i, 5] <- model$coef[3]
  
    # SE for the intercept
    sim_params[i, 2] <- sqrt(diag(vcv)[1])
    
    # SE for beta1
    sim_params[i, 4] <- sqrt(diag(vcv)[2])
    
    # SE for beta2
    sim_params[i, 6] <- sqrt(diag(vcv)[3])
  
    # model RSE
    sim_params[i, 7] <- summary(model)$sigma
  }

  mean_betas[j, ] <- c(mean(sim_params[, 3]), mean(sim_params[, 4]), mean(sim_params[, 5]), mean(sim_params[, 6]))

}

The final result is a data frame with 10 rows, one for each correlation coefficient and the average of 100 regression simulations for the beta coefficients and their standard errors.

# Add the correlation coefficients to the final data frame
mean_betas$cor_coef <- cor_coefs mean_betas %>%
  knitr::kable()

Finally, we plot each of the results with the correlation coefficients on the x-axis.

# Plot the results
par(mfrow = c(2,2))
plot(x = mean_betas$cor_coef,
     y = mean_betas$beta1,
     main = "Beta1",
     lwd = 3,
     type = "b",
     ylim = c(2, 7),
     xlab = "Correlation betwen X1 and X2",
     ylab = "Beta1")
plot(x = mean_betas$cor_coef,
     y = mean_betas$sd_beta1,
     main = "Beta1 Standard Error",
     lwd = 3,
     type = "b",
     xlab = "Correlation betwen X1 and X2",
     ylab = "SE of B1")
plot(x = mean_betas$cor_coef,
     y = mean_betas$beta2,
     main = "Beta2",
     lwd = 3,
     type = "b",
     ylim = c(7, 13),
     xlab = "Correlation betwen X1 and X2",
     ylab = "Beta 2")
plot(x = mean_betas$cor_coef,
     y = mean_betas$sd_beta2,
     main = "Beta2 Standard Error",
     lwd = 3,
     type = "b",
     xlab = "Correlation betwen X1 and X2",
     ylab = "SE of B2")

The beta coefficients themselves remain relatively unchanged in our simulation across the various correlations levels. However, once the correlation between the two independent variables reaches about 0.7 the standard errors around the beta coefficients begin to increase exponentially, increasing our uncertainty about the true parameter values.

Wrapping Up

We’ve now covered using simulation to investigate two assumptions of linear regression. Our next installment will investigate another linear regression assumption before we proceed on to simulating other types of models.

For the full code, check out my GITHUB page.

Comparing Tidymodels in R to Scikit Learn in Python

I did a previous blog providing a side-by-side comparisons of R’s {tidyverse} to Python for various data manipulation techniques. I’ve also talked extensively about using {tidymodels} for model fitting (see HERE and HERE). Today, we will work through a tutorial on how to fit the same random forest model in {tidyverse} and Scikit Learn.

This will be a side-by-side view of both coding languages.The tutorial will cover:

  • Loading the data
  • Basic exploratory data analysis
  • Creating a train/test split
  • Hyperparameter tuning by creating cross-validated folds on the training data
  • Identifying the optimal hyperparameters and fitting the final model
  • Applying the final model to the test data and evaluating model performance
  • Saving the model for downstream use
  • Loading the saved model and applying it to new data

To get the full code for each language and follow along with the tutorial visit my GITHUB page.

The Data

The data comes the tidytuesday project from 4/4/2023. The data set is Premier League match data (2021 – 2022) that provides a series of features with the goal of predicting the final result (Full Time Result, FTR) as to whether the home team won, the away team won, or the match resulted in a draw.

Load Data & Packages

First, we load the data directly from the tidytuesday website in both languages.

Exploratory Data Analysis

Next, we perform some exploratory data analysis to understand the potential features for our model.

  • Check each column for NAs
  • Plot a count of the outcome variable across the three levels (H = home team wins, A = away team wins, D = draw)
  • Select a few features for our model and then create box plots for each feature relative to the 3 levels of our outcome variable

Train/Test Split

We being the model building process by creating a train/test split of the data.

Create a Random Forest Classifier Instance

This is basically telling R and python that we want to build a random forest classifier. In {tidymodels} this is referred to as “specifying the model engine”.

Hyperparameter Tuning on Cross Validated Folds

The two random forest hyperparameters we will tune are:

  1. The number of variables randomly selected for candidate model at each split (R calls this mtry while Python calls it max_features)
  2. The number of trees to grow (R calls this trees and Python calls it n_estimators)

In {tidymodels} we will specify 5 cross validated folds on the training set, set up a recipe, which explains the model we want (predicting FTR from all of the other variables in the data), put all of this into a single workflow and then set up our tuning parameter grid.

In Scikit Learn, we set up a dictionary of parameters (NOTE: they must be stored in list format) and we will pass them into a cross validation structure that performs 5-fold cross-validation in parallel (to speed up the process). We then pass this into the GridSearchCV() function where we specify the model we are fitting (random forest), the parameter grid that we’ve specified, and how we want to compare the random forest models (scoring). Additionally, we’ll set n_jobs = -1 to allow Python to use all of the cores on our machine.

While the code looks different, we’ve essentially set up the same process in both languages.

Tune the model on the training data

We can now tune the hyperparameters by applying the cross-validated folds procedure to the training data.

Above, we indicated to Python that we wanted some parallel processing, to speed up the process. In {tidyverse} we specify parallel processing by setting up the number of cores we’d like to use on our machine. Additionally, we will want to save the results of each cross-validated iteration, so we use the control_sample() function to do this. All of these steps were specified in Python, above, so we are ready to now apply cross-validation to our training dataset and tune the hyperparameters.

Get the best parameters

Both R and Python provide numerous objects to explore the output for each of the cross-validated folds. I’ve placed some examples in the respective codes in the GITHUB page. For our purposes, we are most interested in the optimal number of variables and trees. Both coding languages found 4 and 400 to be the optimal number of variables and trees, respectively.

Fitting the Final Model

Now that we have the optimal hyperparameter values, we can refit the model. In both {tidymodels} and Scikit learn, we’ll just refit a random forest with those optimal values specified.

Variable Importance Plot

It’s helpful to see which variables were the most important contributors to the model’s predictions.

Side Note: This takes more code in python than in R. This is one of the drawbacks I’ve found with python compared to R. I can do things more efficiently and with less code in R than in python. I often find I have to work a lot harder in Scikit Learn to get model outputs and information about the model fit. It’s all in there but it is not clearly accessible (to me at least) and plotting in matplotlib is not as clean as plotting in ggplot2.

Get Model Predictions on the Test Set

Both languages offer some out of the box options for describing the model fit info. If you want more than this (which you should, because this isn’t much to go off of), then you’ll have to extract the predicted probabilities and the actual outcomes and code some additional analysis (potentially a future blog article).

Save The Model

If we want to use this model for any downstream analysis we will need to save it.

Load the Model and Make Predictions

Once we have the model saved we can load it and apply it to any new data that comes in. Here, our new data will just be a selection of rows from the original data set (we will pretend it is new).

NOTE: Python is 0 indexed while R is indexed starting at 1. So keep that in mind if selecting rows from the original data to make the same comparison in both languages.

Wrapping Up

Both {tidymodels} and Scikit Learn provide users with powerful machine learning frameworks for conducting analysis. While the code syntax differs, the general concepts are the same, so bouncing between the two languages shouldn’t be to cumbersome. Hopefully this tutorial provided a nice overview of how to conduct the same analysis in both languages, offering a bridge for those trying to learn Python from R and vice versa.

All code is available on my GITHUB page.

Simulations in R Part 5: Homoskedasticity Assumption in Regression

We’ve worked through a number of tutorials on building simulations and in Part 4 we worked up to building simulations for linear regression. Here are the previous 4 parts:

  • Part 1 discussed the basic functions for simulating and sampling data in R.
  • Part 2 walked us through how to perform bootstrap resampling and then simulate bivariate and multivariate distributions.
  • Part 3 we worked making group comparisons by simulating thousands of t-tests
  • Part 4 building simulations for linear regression

There are a number of assumptions that underpin linear regression models. Simulation can be a useful way of exploring these assumptions and understanding how violating these assumptions can lead to bias, large variance in the regression coefficients, and/or poor predictions.

Some typical assumptions include:

  1. Homoskedasticity
  2. Multicollinearity of independent variables
  3. Measurement Error
  4. Serial correlation

Today, we will explore the assumption of homoskedasticity.

As always, all code is freely available in the Github repository.

Creating the baseline simulation

Before exploring how violations of the homoskedasticity assumption influence a regression model, we need a baseline model to compare it against. So, we will begin by simulating a simple linear regression with 1 predictor. Our model will look like this:

y = 2 + 5*x + e

Where e will be random error from a normal distribution with a mean of 0 and standard deviation of 1.

The code below should look familiar as we’ve been building up simulations like this in the previous 4 tutorials. We specify the intercept to be 2 and the slope to be 5. The independent variable, x, is drawn from a uniform distribution between -1 and 1. With each of the 500 iterations of the for() loop we store the simulated intercept, slope, and their corresponding standard errors, which we calculate using the variance-covariance matrix (which we discussed in the previous tutorial). Finally, we also store the residual standard error (RSE) of each of the simulated models.

library(tidymodels)
library(patchwork)

## set seed for reproducibility
set.seed(58968)

## create a data frame to store intercept values, slope values, their standard errors, and the model residual standard error, for each simulation
sim_params <- data.frame(intercept = NA,
                      slope = NA,
                      intercept_se = NA,
                      slope_se = NA,
                      model_rse = NA)

# true intercept value
intercept <- 2

# true slope value
slope <- 5

## Number of simulations to run
n <- 500

# random draw from a uniform distribution to simulate the independent variable
X <- runif(n = n, min = -1, max = 1)

## loop for regression model
for(i in 1:n){
  
  # create dependent variable, Y
  Y <- intercept + slope*X + rnorm(n = n, mean = 0, sd = 1)
  
  # build model
  model <- lm(Y ~ X)
  
  ## store predictions
  fitted_vals <- model$fitted.values

  ## store residuals
  # output_df[i, 2] &lt;- model$residuals
  
  # variance-covariance matrix for the model
  vcv <- vcov(model)
  
  # estimates for the intercept
  sim_params[i, 1] <- model$coef[1]
  
  # estimates for the slope
  sim_params[i, 2] <- model$coef[2]
  
  # SE for the intercept
  sim_params[i, 3] <- sqrt(diag(vcv)[1])
  
  # SE for the slope
  sim_params[i, 4] <- sqrt(diag(vcv)[2])
  
  # model RSE
  sim_params[i, 5] <- summary(model)$sigma
  
}

head(sim_params)

Now we summarize the data to see if we have values close to the specified model parameters.

sim_params %>%
  summarize(across(.cols = everything(),
                   ~mean(.x)))

The average intercept and slope of the 500 simulated models are pretty much identical to the specified intercept and slope of 2 and 5, respectively.

The final model of the 500 iterations is also stored from our for loop and we can look directly at it and create plots of the model fit.

# model summary
summary(model)

# model fit plots
par(mfrow = c(2,2))
plot(model)

We can also create a function that lets us evaluate how often the 95% confidence interval of our simulated beta coefficients cover the true beta coefficients that we specified for the simulation. From there, we can get a coverage probability and a 95% probability coverage interval.

 

### Create a coverage probability function
coverage_interval95 <- function(beta_coef, se_beta_coef, true_beta_val, model_df){
  
  level95 <- 1 - (1 - 0.95) / 2
  
  # lower 95
  lower95 <- beta_coef - qt(level95, df = model_df)*se_beta_coef
  
  # upper 95
  upper95 <- beta_coef + qt(level95, df = model_df)*se_beta_coef
  
  # what rate did we cover the true value (hits and misses)
  hits <- ifelse(true_beta_val >= lower95 &amp; true_beta_val <= upper95, 1, 0)
  prob_cover <- mean(hits)
  
  # create the probability coverage intervals
  low_coverage_interval <- prob_cover - 1.96 * sqrt((prob_cover * (1 - prob_cover)) / length(beta_coef))
  
  upper_coverage_interval <- prob_cover + 1.96 * sqrt((prob_cover * (1 - prob_cover)) / length(beta_coef))
  
  # results in a list
  return(list('Probability of Covering the True Value' = prob_cover,
              '95% Prob ability Coverage Intervals' = c(low_coverage_interval, upper_coverage_interval)))
  
}

Let’s apply the function to the intercept.

coverage_interval95(beta_coef = sim_params$intercept,
                    se_beta_coef = sim_params$intercept_se,
                    true_beta = intercept,
                    model_df = model$df.residual)

Now apply the function to the slope.

coverage_interval95(beta_coef = sim_params$slope,
                    se_beta_coef = sim_params$slope_se,
                    true_beta = slope,
                    model_df = model$df.residual)

In both cases we are covering the true betas around 95% of the time, with relatively small intervals.

Homoskedasticity

Linear models make an assumption that the variance of the residuals remain constant across the predicted values (homoskedastic). We can see what this looks like by plotting the fitted values relative to the residuals, which was the first plot in the model check plots we created for the 500th simulation above. We can see that the residuals exhibit relatively the same amount of variance across the fitted values.

plot(model, which = 1)

Let’s simulate a model with heteroskedastic residuals and see what it looks like. We will keep the same intercept and slope parameters as above. The only thing will we do is add an exponential parameter to the error term  of the model to create a heteroskedastic outcome in the residuals.

## parameter for heteroskedasticity 
heteroskedasticity_param <- 2

## set seed for reproducibility
set.seed(22)

## data frame for results
heteroskedastic_sim_params <- data.frame(intercept = NA,
                      slope = NA,
                      intercept_se = NA,
                      slope_se = NA,
                      model_rse = NA)

## for loop
for(i in 1:n ){
  
  # the error variance of Y is a function of X plus some random noise
  Y <- intercept + slope*X + rnorm(n = n, mean = 0, sd = exp(X*heteroskedasticity_param))
  
  # model
  heteroskedastic_model <- lm(Y ~ X)
  
  
  # variance-covariance matrix
  vcv <- vcov(heteroskedastic_model)
  
  # estimates for the intercept
  heteroskedastic_sim_params[i, 1] <- heteroskedastic_model$coef[1]
  
  # estimates for the slope
  heteroskedastic_sim_params[i, 2] <- heteroskedastic_model$coef[2]
  
  # SE for the intercept
  heteroskedastic_sim_params[i, 3] <- sqrt(diag(vcv)[1])
  
  # SE for the slope
  heteroskedastic_sim_params[i, 4] <- sqrt(diag(vcv)[2])
  
  # model RSE
  heteroskedastic_sim_params[i, 5] <- summary(heteroskedastic_model)$sigma
  
}

head(heteroskedastic_sim_params)


plot(X, Y, pch = 19)

The relationship between X and Y certainly looks weird given how it starts very tightly on the left side and then fans out on the right side.

Let’s take the average across all 500 simulations for each coefficient and their corresponding standard errors.

heteroskedastic_sim_params %>%
  summarize(across(.cols = everything(),
                   ~mean(.x)))

The coefficients of 2.0 for the intercept and 5 for the slope are exactly what we set them as for the simulation. However, notice how much larger the standard errors are for the intercept and slope compared to the original model above. Additionally, notice that the model residual standard error has increased substantially compared to the previous model.

Let’s get the 500th model again and check out the fitted vs residual plot.

# fitted vs residuals
plot(heteroskedastic_model, which = 1)

That looks like a large amount of heteroskedasticity as the residual variance is no longer homogenous across the range of fitted values. Notice the large fanning out towards the right side of the plot. As the predictions get larger so two does the variability in residuals, which we noticed when we plotted Y and X above.

What we’ve learned is that the estimate of intercept and slope is unbiased for both the heteroskedastic and homoskedastic models, as they both are centered on the parameters that we specified for the simulation (intercept = 2, slope = 5). However, the heteroskedastic model creates greater variance in our coefficients. We can visualize how much uncertainty there is under the heteroskedastic model relative to the homoskedastic model by visualizing the density of the coefficient estimates from our two model simulations.

plt_intercept <- sim_params %>%
  mutate(model = 'homoskedastic model') %>%
  bind_rows(
    heteroskedastic_sim_params %>%
      mutate(model = 'heteroskedastic model')
  ) %>%
  ggplot(aes(x = intercept, fill = model)) +
  geom_density(alpha = 0.6) +
  theme_classic() +
  theme(legend.position = "top")

plt_slope <- sim_params %>%
  mutate(model = 'homoskedastic model') %>%
  bind_rows(
    heteroskedastic_sim_params %>%
      mutate(model = 'heteroskedastic model')
  ) %>%
  ggplot(aes(x = slope, fill = model)) +
  geom_density(alpha = 0.6) +
  theme_classic() +
  theme(legend.position = "none")

plt_intercept | plt_slope

Finally, let’s see how often the 95% coverage interval is covering the true intercept and slope in the heteroskedastic model.

coverage_interval95(beta_coef = heteroskedastic_sim_params$intercept,
                    se_beta_coef = heteroskedastic_sim_params$intercept_se,
                    true_beta = intercept,
                    model_df = model$df.residual)


coverage_interval95(beta_coef = heteroskedastic_sim_params$slope,
                    se_beta_coef = heteroskedastic_sim_params$slope_se,
                    true_beta = slope,
                    model_df = model$df.residual)

Notice that we are no longer covering the true model values at the 95% level.

Wrapping Up

Simulations can be useful for evaluating model assumptions and seeing how a model may be have under different circumstances. In this tutorial we saw that a model suffering from severe heteroskedasticity is still able to return the true values for the intercept and slope. However, the variance around those values is large, meaning that we have a lot of uncertainty about what those true parameters actually are. In upcoming tutorials we will explore other linear regression assumptions. If this is a topic you enjoy, a book that I recommend (which heavily influenced today’s tutorial) is Monte Carlo Simulation and Resampling Methods for Social Sciences by Carsey & Harden.

As always, all code is freely available in the Github repository.

Tidymodels Workflowsets Tutorials

Workflowsets in {tidymodels} provide a useful way for analysts to run several models, even tune several machine learning models on their dataset, simultaneously. You can then extract model fit information, predictions, and make comparisons to identify the most effective model type and the optimized tuning parameters.

Below are links to two workflowsets tutorials I’ve written. The first link is to an older tutorial that wrote and posted in the blog 3 months ago. This tutorial builds several machine learning models on NWSL data to solve a classification task. The second link is a more recent tutorial that walks through workflowsets and model tuning for outcomes that are both binary and continuous.

Both tutorials go over model building, hyperparameter tuning, model comparisons, making predictions, and storing the final model for deployment at a later time in {tidymodels} using workflowsets.

Hopefully both tutorials provide analysts with a clear overview of how to set up workflowsets and run several models in parallel, helping to make their work more efficient.

Tutorial 1

Tutorial 2

 

Simulations in R Part 4: Simulating Regression Models

We’ve been working on building simulations in R over the past few articles. We are now to the point where we want to simulate data using regression models. To recap what we’ve done so far:

  • Part 1 discussed the basic functions for simulating and sampling data in R.
  • Part 2 walked us through how to perform bootstrap resampling and then simulate bivariate and multivariate distributions.
  • Part 3 we worked making group comparisons by simulating thousands of t-tests

In an upcoming article we will use simulation to evaluate regression assumptions. But, before we do that, it might be useful to use a regression model reported in a research paper to simulate a data set. This can be useful if you are trying to better understand the paper (you can recalculate statistics and explore the data) or if you want to work on applying a different type of model approach to build hypotheses for future research.

As always, all code is freely available in the Github repository.

I’m going to take the regression model from a paper by Ferrari et al (2022), Performance and anthropometrics of classic powerlifters: Which characteristics matter? J Strength Cond Res., which looked to predict power lifting performance in male and female lifters.

The paper used several variables to try and predict Squat, Bench, Deadlift, and Powerlifting Total in raw power lifters. To keep this simple, I’ll focus on the model for predicting squat performance, which was:


The model had a standard error of estimate (SEE) of 20.3 and an r-squared of 0.83.

Getting the necessary info from the paper

Aside from the model coefficients, above, we also need to get the mean and standard deviation for each of the variables in the model (for both male and female lifters). We will use those parameters to simulate 1000 male and female lifters.

library(tidyverse)

# set the see for reproducibility
set.seed(82)

# sample size - simulate 1000 male and 1000 female lifters
n_males <- 1000
n_females <- 1000

# set the model coefficients for squats
intercept <- -145.7
years_exp_beta <- 4.3
pct_bf_beta <- -1.7
upper_arm_girth_beta <- 6
thigh_girth_beta <- 1.9

# Standard Error of Estimate (SEE) is sampled from a normal distribution with a mean = 0 and 
# standard deviation of 20.3
see <- rnorm(n = n_males + n_females, mean = 0, sd = 20.3)

# get mean and sd for each of the model variables
years_exp_male <- rnorm(n = n_males, mean = 2.6, sd = 2.4)
years_exp_male <- ifelse(years_exp_male < 0, 0, years_exp_male)
years_exp_female <- rnorm(n = n_females, mean = 1.6, sd = 1.5)
years_exp_female <- ifelse(years_exp_female < 0, 0, years_exp_female)
pct_bf_male <- rnorm(n = n_males, mean = 11.1, sd = 3.8)
pct_bf_female <- rnorm(n = n_females, mean = 21.7, sd = 5.4)
upper_arm_girth_male <- rnorm(n = n_males, mean = 35.6, sd = 2.8)
upper_arm_girth_female <- rnorm(n = n_females, mean = 29.5, sd = 3.1)
thigh_girth_male <- rnorm(n = n_males, mean = 61.1, sd = 5.5)
thigh_girth_female <- rnorm(n = n_females, mean = 56.1, sd = 4.9)

# put the simulated data into a data frame
dat <- data.frame( gender = c(rep("male", times = n_males), rep("female", times = n_females)), years_exp = c(years_exp_male, years_exp_female), pct_bf = c(pct_bf_male, pct_bf_female), upper_arm_girth = c(upper_arm_girth_male, upper_arm_girth_female), thigh_girth = c(thigh_girth_male, thigh_girth_female) ) dat %>%
  head()

As a sanity check, we can quickly check the mean and standard deviation of each variable to ensure the simulation appears as we intended it to.

## check means and standard deviations of the simulation
dat %>%
  group_by(gender) %>%
  summarize(across(.cols = years_exp:thigh_girth,
                   list(avg = mean, SD = sd)),
            .groups = "drop") %>%
  pivot_longer(cols = -gender)

Estimate squat performance using the model

Next, we use the values that we simulated and the model coefficients to simulate the outcome of interest (Squat performance).

# estimate squat performance
dat$squat <- with(dat, intercept + years_exp_beta*years_exp + pct_bf_beta*pct_bf + upper_arm_girth_beta*upper_arm_girth + thigh_girth_beta*thigh_girth + see) dat %>%
  head()

## summary statistics
dat %>%
  group_by(gender) %>%
  summarize(avg_squat = mean(squat), 
            sd_squat = sd(squat))

# plots
hist(dat$squat)

dat %>%
  ggplot(aes(x = squat, fill = gender)) +
  geom_density(alpha = 0.5)

Look at the regression model

Now we fit a regression model with our simulated data.

fit_lm <- lm(squat ~ years_exp + pct_bf + upper_arm_girth + thigh_girth, data = dat)
summary(fit_lm)

  • The coefficients are similar to those in the paper (which makes sense because we coded them this way.
  • The standard error (Residual standard error) is 20.3, as expected.
  • The r-squared for the model is 0.81, close to what was observed in the paper.

 

Making a better simulation

One thing to notice is the mean and standard deviation of the squat in our simulation is a bit less compared to what is reported for the population in the paper.

Our estimates are a little low relative to what was reported in the paper. The model above still works because we constructed it with the regression coefficients and variable parameters in the model, so we can still play with the data and learn something. But, the estimated squats might be a little low because the anthropometric variables in the model (BF%, Upper Arm Girth, and Thigh Girth) are in some way going to be correlated with each other. So, we could make this simulation more informative by simulating those variables from a multivariate normal distribution, as we did in Part 3.

To start, we load the mvtnorm package and set up a vector of mean values for each variable. We will construct a vector for males and females separately. We will use the mean values for each variable reported in the paper.

library(mvtnorm)

## Order of means is: BF%, Upper Arm Girth, Thigh Girth
male_means <- c(11.1, 35.6, 61.1)
female_means <- c(21.7, 29.5, 56.1)

Next, we need to construct a correlation matrix between these three variables. Again, we will create a correlation matrix for males and females, separately. I’m not sure of the exact correlation between these variables, so I’ll just estimate what I believe it to be. For example, the correlations probably aren’t 0.99 but they also probably aren’t below 0.6. I’m also unsure how these correlations might differ between the two genders. Thus, I’ll keep the same correlation matrix for both. To keep it simple, I’ll set the correlation between BF% and upper arm and thigh girth at 0.85 and the correlation between upper arm girth and thigh girth to be 0.9. In theory, we could consult some scientific literature on these things and attempt to construct more plausible correlations.

## Create correlation matrices
# males
male_r_matrix <- matrix(c(1, 0.85, 0.85,
                          0.85, 1, 0.9,
                          0.85, 0.9, 1), 
                   nrow = 3, ncol = 3,
       dimnames = list(c("bf_pct", "upper_arm_girth", "thigh_girth"),
                       c("bf_pct", "upper_arm_girth", "thigh_girth")))

male_r_matrix

# females
female_r_matrix <- matrix(c(1, 0.85, 0.85,
                          0.85, 1, 0.9,
                          0.85, 0.9, 1), 
                   nrow = 3, ncol = 3,
       dimnames = list(c("bf_pct", "upper_arm_girth", "thigh_girth"),
                       c("bf_pct", "upper_arm_girth", "thigh_girth")))

female_r_matrix

Now we will create 1000 simulations from a multivariate normal distribution for both males and females and then row bind them together into a single big data frame.

## simulate 1000 new x, y, and z variables using the mvtnorm package
set.seed(777)
male_sim <- rmvnorm(n = n_males, mean = male_means, sigma = male_r_matrix) %>%
  as.data.frame() %>%
  setNames(c("pct_bf", "upper_arm_girth", "thigh_girth"))

female_sim <- rmvnorm(n = n_females, mean = female_means, sigma = female_r_matrix) %>%
  as.data.frame() %>%
  setNames(c("pct_bf", "upper_arm_girth", "thigh_girth"))

head(male_sim)
head(female_sim)

## put the two simulated data frames together
multi_sims <- bind_rows(male_sim, female_sim) multi_sims %>%
  head()

Finally, one last thing we’ll change is our simulation of years of experience. This variable is not a normally distributed variable because it is truncated at 0. Above, in our first simulation, we attempted to solve this with the ifelse() expression to assign any simulated value less than 0 to 0. Here, I’ll just get the quantiles of the simulated years of experience so that I have an idea of a plausible upper end of experience for the population used in this paper. Then, instead of simulating from a normal distribution I’ll do a random draw from a uniform distribution from 0 to the respective upper end for the male and female groups.

Now take the newly simulated variables and create a new data frame.

## new data frame
new_dat <- data.frame( gender = c(rep("male", times = n_males), rep("female", times = n_females)), years_exp = c(years_exp_male, years_exp_female) ) %>%
  bind_cols(multi_sims)

new_dat %>%
  head()

Finally, go through the steps we did above to estimate the squat from our four variables and the beta coefficients from the paper.

# estimate squat performance
new_dat$squat <- with(new_dat, intercept + years_exp_beta*years_exp + pct_bf_beta*pct_bf + upper_arm_girth_beta*upper_arm_girth + thigh_girth_beta*thigh_girth + see) new_dat %>%
  head()

## summary statistics
new_dat %>%
  group_by(gender) %>%
  summarize(avg_squat = mean(squat), 
            sd_squat = sd(squat))

# plots
hist(new_dat$squat)

new_dat %>%
  ggplot(aes(x = squat, fill = gender)) +
  geom_density(alpha = 0.5)

 

Look at the new regression model

fit_lm_new <- lm(squat ~ years_exp + pct_bf + upper_arm_girth + thigh_girth, data = new_dat)
summary(fit_lm_new)

Compare the squat simulated data to that which was reported in the paper

new_dat %>%
  group_by(gender) %>%
  summarize(avg_squat_sim = mean(squat), 
            sd_squat_sim = sd(squat)) %>%
  mutate(avg_squat_paper = c(118.3, 196.1),
         sd_squat_paper = c(26.6, 37.9))

Now we have mean values much closer to those observed in the population of the study. Our standard deviations differ from the study because, if you recall, the standard error of the estimate was 20.3 for the regression model. The model in the study did not include gender as an independent variable. This to me is a little strange and in the discussion of the paper the authors’ also indicate that it was strange to them too. However, the approach the authors’ took to building this deemed gender unnecessary as a predictor variable. Consequently, our simulation has a similar standard deviation in estimate squat for both the male and female populations. However, we now have a data set that is relatively close to what was observed in the paper and can therefore proceed with conducting other statistical tests or building other models.

Wrapping Up

We’ve come a long way in our simulation journey, from randomly drawing single vectors of data to now building up entire data sets using regression models and simulating data from research studies. Next time we will use simulation to explore regression assumptions.

As always, all code is freely available in the Github repository.