Category Archives: Sports Analytics

Validity, Reliability, & Responsiveness — A few papers on measurement in sport science

I had the pleasure of speaking at the National Strength and Conditioning Association‘s (NSCA) National Conference this summer and while there I made it a point to attend the Sport Science & Performance Technology Special Interest Group meeting as well.

One thing that immediately stood out to me was the number of questions raised specific to what types of technologies to purchase (e.g. “Which brand of force plates should we buy?”, “Does anyone have a list comparing and contrasting different technologies so that we can determine what would be best for us?”, etc.).

While these are fine questions, I do feel they are a bit like putting the cart before the horse. Before thinking about what technology to purchase, we should spend a good bit of time gaining clarity on the question(s) we are attempting to answer. Once we have a firm understanding of the question we can then begin the process of evaluating whether a technology exists that can help us collect the necessary data to explore that question. In fact, this was the main crux of my lecture at the conference, as I spoke about using the PPDAC Framework in practice (I wrote an article about this framework a couple of years ago).

A force plate, a GPS unit, or an accelerometer won’t solve all of our problems. In fact, depending on our question, they might not solve any of our problems! Moreover, as sport scientists we need to concern ourselves not only with the research question but, also whether the desired technology is useful within our ecological setting. Just because something worked in a controlled lab environment or was valid in a different sport does not mean it will be useful for our sport, or in our setting, or with our athletes, or given our unique constraints.

So, I decided to share a few resources pertaining to measurement theory concepts such as validity, reliability, and responsiveness/sensitivity for those working in the sport science space who are interested in more critical approaches to evaluating the technology we use in practice.

Additionally, for those interested, several years ago I wrote a full R code blog for the last paper above (Swinton et al) ,which can he accessed HERE.

Happy reading!

Simulations in R Part 6: Multicollinearity Assumption in Regression

For the next installment of our simulation blog series we will use simulation to look at the Multicollinearity assumption in linear regression.

For a refresher, here are the previous 5 articles in this series:

  • Part 1 discussed the basic functions for simulating and sampling data in R.
  • Part 2 walked us through how to perform bootstrap resampling and then simulate bivariate and multivariate distributions.
  • Part 3 we worked making group comparisons by simulating thousands of t-tests
  • Part 4 building simulations for linear regression
  • Part 5 using simulation to investigate the homoskedasticity assumption in regression.

The entire code for this series is accessible on my GITHUB page.

Multicollinearity

Multicollinearity occurs when two independent variables in a regression model are highly correlated with each other. Such a situation can produce problems with interpretation of the beta coefficients of the model, may increase standard errors in the model, and can lead to over fitting of the data. We can simulate this issue in order to get a better understanding of how multicollinearity can influence a regression model.

Constructing the simulation

We will use the mvnorm package to help us construct a simulation where the two independent variables share a certain level of correlation between each other.

First, we will create the true model parameters: an intercept of 2, a beta1 of 5, and a beta2 of 10. We also create a vector of correlation coefficients from 0 to 0.99 and a few data frames to store the results of our model. We will also specify that at each correlation coefficient we want 200 random draws from the multivariate normal distribution.

## load packages
library(tidymodels)
library(patchwork)
library(mvtnorm)

set.seed(999)

# create the true model parameters
intercept <- 2
beta1 <- 5
beta2 <- 10

## Number of draws from a multivariate normal distribution
n <- 200

## Create a data frame to store model results
sim_params <- data.frame(intercept = NA,
                      intercept_se = NA,
                      beta1 = NA,
                      beta1_se = NA,
                      beta2 = NA,
                      beta2_se = NA,
                      model_rse = NA)

## create levels of multicollinearity between the two independent variables
cor_coefs <- c(seq(from = 0, to = 0.9, by = 0.1), 0.99)

# data frame to store the average beta coefficient and their standard deviations form the simulation
mean_betas <- data.frame(beta1 = NA,
                       sd_beta1 = NA,
                       beta2 = NA,
                       sd_beta2 = NA)

Next, we will create a nested for() loop to construct out simulations.

  •  The outer part of the loop begins by creating the multivariate normal distribution. We use the rmvnorm(), which means we first specify a correlation matrix using our vector of correlation coefficients we specified above. Once we have the two correlated variables we can put them into the inner loop.
  • The inner loop is where we create a regression equation for the given correlation between the two variables. We create 100 regression simulations for each correlation coefficient.
  • Once the inner loop is finished running we store the results at the bottom of the outer loop. Instead of storing all of the results, we take the average of the 100 beta coefficients and their respective standard errors for each correlation coefficient.
## loop
for(j in 1:length(cor_coefs)){
  
  ## Create a correlation matrix between beta1 and beta2
  beta_corr <- matrix(c(1, cor_coefs[j], cor_coefs[j], 1), nrow = 2, ncol = 2)
  
  ## create a multivariate normal distribution 
  cor_df <- rmvnorm(n = n, mean = c(0, 0), sigma = beta_corr)
  X1 <- cor_df[, 1]
  X2 <- cor_df[, 2]
  
  ## simulate 100 regression simulations
  for(i in 1:100){
    
    # set up the model
    y_hat <- intercept + beta1*X1 + beta2*X2 + rnorm(n = n, mean = 0, sd = 1)
    
    # construct a regression equation
    model <- lm(y_hat ~ X1 + X2)
    
    # store the variance-covariance matrix
    vcv <- vcov(model)
    
    # estimates for the intercept
    sim_params[i, 1] <- model$coef[1]
  
    # estimates for the beta1
    sim_params[i, 3] <- model$coef[2]
    
    # estimates for beta2
    sim_params[i, 5] <- model$coef[3]
  
    # SE for the intercept
    sim_params[i, 2] <- sqrt(diag(vcv)[1])
    
    # SE for beta1
    sim_params[i, 4] <- sqrt(diag(vcv)[2])
    
    # SE for beta2
    sim_params[i, 6] <- sqrt(diag(vcv)[3])
  
    # model RSE
    sim_params[i, 7] <- summary(model)$sigma
  }

  mean_betas[j, ] <- c(mean(sim_params[, 3]), mean(sim_params[, 4]), mean(sim_params[, 5]), mean(sim_params[, 6]))

}

The final result is a data frame with 10 rows, one for each correlation coefficient and the average of 100 regression simulations for the beta coefficients and their standard errors.

# Add the correlation coefficients to the final data frame
mean_betas$cor_coef <- cor_coefs mean_betas %>%
  knitr::kable()

Finally, we plot each of the results with the correlation coefficients on the x-axis.

# Plot the results
par(mfrow = c(2,2))
plot(x = mean_betas$cor_coef,
     y = mean_betas$beta1,
     main = "Beta1",
     lwd = 3,
     type = "b",
     ylim = c(2, 7),
     xlab = "Correlation betwen X1 and X2",
     ylab = "Beta1")
plot(x = mean_betas$cor_coef,
     y = mean_betas$sd_beta1,
     main = "Beta1 Standard Error",
     lwd = 3,
     type = "b",
     xlab = "Correlation betwen X1 and X2",
     ylab = "SE of B1")
plot(x = mean_betas$cor_coef,
     y = mean_betas$beta2,
     main = "Beta2",
     lwd = 3,
     type = "b",
     ylim = c(7, 13),
     xlab = "Correlation betwen X1 and X2",
     ylab = "Beta 2")
plot(x = mean_betas$cor_coef,
     y = mean_betas$sd_beta2,
     main = "Beta2 Standard Error",
     lwd = 3,
     type = "b",
     xlab = "Correlation betwen X1 and X2",
     ylab = "SE of B2")

The beta coefficients themselves remain relatively unchanged in our simulation across the various correlations levels. However, once the correlation between the two independent variables reaches about 0.7 the standard errors around the beta coefficients begin to increase exponentially, increasing our uncertainty about the true parameter values.

Wrapping Up

We’ve now covered using simulation to investigate two assumptions of linear regression. Our next installment will investigate another linear regression assumption before we proceed on to simulating other types of models.

For the full code, check out my GITHUB page.

Comparing Tidymodels in R to Scikit Learn in Python

I did a previous blog providing a side-by-side comparisons of R’s {tidyverse} to Python for various data manipulation techniques. I’ve also talked extensively about using {tidymodels} for model fitting (see HERE and HERE). Today, we will work through a tutorial on how to fit the same random forest model in {tidyverse} and Scikit Learn.

This will be a side-by-side view of both coding languages.The tutorial will cover:

  • Loading the data
  • Basic exploratory data analysis
  • Creating a train/test split
  • Hyperparameter tuning by creating cross-validated folds on the training data
  • Identifying the optimal hyperparameters and fitting the final model
  • Applying the final model to the test data and evaluating model performance
  • Saving the model for downstream use
  • Loading the saved model and applying it to new data

To get the full code for each language and follow along with the tutorial visit my GITHUB page.

The Data

The data comes the tidytuesday project from 4/4/2023. The data set is Premier League match data (2021 – 2022) that provides a series of features with the goal of predicting the final result (Full Time Result, FTR) as to whether the home team won, the away team won, or the match resulted in a draw.

Load Data & Packages

First, we load the data directly from the tidytuesday website in both languages.

Exploratory Data Analysis

Next, we perform some exploratory data analysis to understand the potential features for our model.

  • Check each column for NAs
  • Plot a count of the outcome variable across the three levels (H = home team wins, A = away team wins, D = draw)
  • Select a few features for our model and then create box plots for each feature relative to the 3 levels of our outcome variable

Train/Test Split

We being the model building process by creating a train/test split of the data.

Create a Random Forest Classifier Instance

This is basically telling R and python that we want to build a random forest classifier. In {tidymodels} this is referred to as “specifying the model engine”.

Hyperparameter Tuning on Cross Validated Folds

The two random forest hyperparameters we will tune are:

  1. The number of variables randomly selected for candidate model at each split (R calls this mtry while Python calls it max_features)
  2. The number of trees to grow (R calls this trees and Python calls it n_estimators)

In {tidymodels} we will specify 5 cross validated folds on the training set, set up a recipe, which explains the model we want (predicting FTR from all of the other variables in the data), put all of this into a single workflow and then set up our tuning parameter grid.

In Scikit Learn, we set up a dictionary of parameters (NOTE: they must be stored in list format) and we will pass them into a cross validation structure that performs 5-fold cross-validation in parallel (to speed up the process). We then pass this into the GridSearchCV() function where we specify the model we are fitting (random forest), the parameter grid that we’ve specified, and how we want to compare the random forest models (scoring). Additionally, we’ll set n_jobs = -1 to allow Python to use all of the cores on our machine.

While the code looks different, we’ve essentially set up the same process in both languages.

Tune the model on the training data

We can now tune the hyperparameters by applying the cross-validated folds procedure to the training data.

Above, we indicated to Python that we wanted some parallel processing, to speed up the process. In {tidyverse} we specify parallel processing by setting up the number of cores we’d like to use on our machine. Additionally, we will want to save the results of each cross-validated iteration, so we use the control_sample() function to do this. All of these steps were specified in Python, above, so we are ready to now apply cross-validation to our training dataset and tune the hyperparameters.

Get the best parameters

Both R and Python provide numerous objects to explore the output for each of the cross-validated folds. I’ve placed some examples in the respective codes in the GITHUB page. For our purposes, we are most interested in the optimal number of variables and trees. Both coding languages found 4 and 400 to be the optimal number of variables and trees, respectively.

Fitting the Final Model

Now that we have the optimal hyperparameter values, we can refit the model. In both {tidymodels} and Scikit learn, we’ll just refit a random forest with those optimal values specified.

Variable Importance Plot

It’s helpful to see which variables were the most important contributors to the model’s predictions.

Side Note: This takes more code in python than in R. This is one of the drawbacks I’ve found with python compared to R. I can do things more efficiently and with less code in R than in python. I often find I have to work a lot harder in Scikit Learn to get model outputs and information about the model fit. It’s all in there but it is not clearly accessible (to me at least) and plotting in matplotlib is not as clean as plotting in ggplot2.

Get Model Predictions on the Test Set

Both languages offer some out of the box options for describing the model fit info. If you want more than this (which you should, because this isn’t much to go off of), then you’ll have to extract the predicted probabilities and the actual outcomes and code some additional analysis (potentially a future blog article).

Save The Model

If we want to use this model for any downstream analysis we will need to save it.

Load the Model and Make Predictions

Once we have the model saved we can load it and apply it to any new data that comes in. Here, our new data will just be a selection of rows from the original data set (we will pretend it is new).

NOTE: Python is 0 indexed while R is indexed starting at 1. So keep that in mind if selecting rows from the original data to make the same comparison in both languages.

Wrapping Up

Both {tidymodels} and Scikit Learn provide users with powerful machine learning frameworks for conducting analysis. While the code syntax differs, the general concepts are the same, so bouncing between the two languages shouldn’t be to cumbersome. Hopefully this tutorial provided a nice overview of how to conduct the same analysis in both languages, offering a bridge for those trying to learn Python from R and vice versa.

All code is available on my GITHUB page.

Tidymodels Workflowsets Tutorials

Workflowsets in {tidymodels} provide a useful way for analysts to run several models, even tune several machine learning models on their dataset, simultaneously. You can then extract model fit information, predictions, and make comparisons to identify the most effective model type and the optimized tuning parameters.

Below are links to two workflowsets tutorials I’ve written. The first link is to an older tutorial that wrote and posted in the blog 3 months ago. This tutorial builds several machine learning models on NWSL data to solve a classification task. The second link is a more recent tutorial that walks through workflowsets and model tuning for outcomes that are both binary and continuous.

Both tutorials go over model building, hyperparameter tuning, model comparisons, making predictions, and storing the final model for deployment at a later time in {tidymodels} using workflowsets.

Hopefully both tutorials provide analysts with a clear overview of how to set up workflowsets and run several models in parallel, helping to make their work more efficient.

Tutorial 1

Tutorial 2

 

Simulations in R Part 4: Simulating Regression Models

We’ve been working on building simulations in R over the past few articles. We are now to the point where we want to simulate data using regression models. To recap what we’ve done so far:

  • Part 1 discussed the basic functions for simulating and sampling data in R.
  • Part 2 walked us through how to perform bootstrap resampling and then simulate bivariate and multivariate distributions.
  • Part 3 we worked making group comparisons by simulating thousands of t-tests

In an upcoming article we will use simulation to evaluate regression assumptions. But, before we do that, it might be useful to use a regression model reported in a research paper to simulate a data set. This can be useful if you are trying to better understand the paper (you can recalculate statistics and explore the data) or if you want to work on applying a different type of model approach to build hypotheses for future research.

As always, all code is freely available in the Github repository.

I’m going to take the regression model from a paper by Ferrari et al (2022), Performance and anthropometrics of classic powerlifters: Which characteristics matter? J Strength Cond Res., which looked to predict power lifting performance in male and female lifters.

The paper used several variables to try and predict Squat, Bench, Deadlift, and Powerlifting Total in raw power lifters. To keep this simple, I’ll focus on the model for predicting squat performance, which was:


The model had a standard error of estimate (SEE) of 20.3 and an r-squared of 0.83.

Getting the necessary info from the paper

Aside from the model coefficients, above, we also need to get the mean and standard deviation for each of the variables in the model (for both male and female lifters). We will use those parameters to simulate 1000 male and female lifters.

library(tidyverse)

# set the see for reproducibility
set.seed(82)

# sample size - simulate 1000 male and 1000 female lifters
n_males <- 1000
n_females <- 1000

# set the model coefficients for squats
intercept <- -145.7
years_exp_beta <- 4.3
pct_bf_beta <- -1.7
upper_arm_girth_beta <- 6
thigh_girth_beta <- 1.9

# Standard Error of Estimate (SEE) is sampled from a normal distribution with a mean = 0 and 
# standard deviation of 20.3
see <- rnorm(n = n_males + n_females, mean = 0, sd = 20.3)

# get mean and sd for each of the model variables
years_exp_male <- rnorm(n = n_males, mean = 2.6, sd = 2.4)
years_exp_male <- ifelse(years_exp_male < 0, 0, years_exp_male)
years_exp_female <- rnorm(n = n_females, mean = 1.6, sd = 1.5)
years_exp_female <- ifelse(years_exp_female < 0, 0, years_exp_female)
pct_bf_male <- rnorm(n = n_males, mean = 11.1, sd = 3.8)
pct_bf_female <- rnorm(n = n_females, mean = 21.7, sd = 5.4)
upper_arm_girth_male <- rnorm(n = n_males, mean = 35.6, sd = 2.8)
upper_arm_girth_female <- rnorm(n = n_females, mean = 29.5, sd = 3.1)
thigh_girth_male <- rnorm(n = n_males, mean = 61.1, sd = 5.5)
thigh_girth_female <- rnorm(n = n_females, mean = 56.1, sd = 4.9)

# put the simulated data into a data frame
dat <- data.frame( gender = c(rep("male", times = n_males), rep("female", times = n_females)), years_exp = c(years_exp_male, years_exp_female), pct_bf = c(pct_bf_male, pct_bf_female), upper_arm_girth = c(upper_arm_girth_male, upper_arm_girth_female), thigh_girth = c(thigh_girth_male, thigh_girth_female) ) dat %>%
  head()

As a sanity check, we can quickly check the mean and standard deviation of each variable to ensure the simulation appears as we intended it to.

## check means and standard deviations of the simulation
dat %>%
  group_by(gender) %>%
  summarize(across(.cols = years_exp:thigh_girth,
                   list(avg = mean, SD = sd)),
            .groups = "drop") %>%
  pivot_longer(cols = -gender)

Estimate squat performance using the model

Next, we use the values that we simulated and the model coefficients to simulate the outcome of interest (Squat performance).

# estimate squat performance
dat$squat <- with(dat, intercept + years_exp_beta*years_exp + pct_bf_beta*pct_bf + upper_arm_girth_beta*upper_arm_girth + thigh_girth_beta*thigh_girth + see) dat %>%
  head()

## summary statistics
dat %>%
  group_by(gender) %>%
  summarize(avg_squat = mean(squat), 
            sd_squat = sd(squat))

# plots
hist(dat$squat)

dat %>%
  ggplot(aes(x = squat, fill = gender)) +
  geom_density(alpha = 0.5)

Look at the regression model

Now we fit a regression model with our simulated data.

fit_lm <- lm(squat ~ years_exp + pct_bf + upper_arm_girth + thigh_girth, data = dat)
summary(fit_lm)

  • The coefficients are similar to those in the paper (which makes sense because we coded them this way.
  • The standard error (Residual standard error) is 20.3, as expected.
  • The r-squared for the model is 0.81, close to what was observed in the paper.

 

Making a better simulation

One thing to notice is the mean and standard deviation of the squat in our simulation is a bit less compared to what is reported for the population in the paper.

Our estimates are a little low relative to what was reported in the paper. The model above still works because we constructed it with the regression coefficients and variable parameters in the model, so we can still play with the data and learn something. But, the estimated squats might be a little low because the anthropometric variables in the model (BF%, Upper Arm Girth, and Thigh Girth) are in some way going to be correlated with each other. So, we could make this simulation more informative by simulating those variables from a multivariate normal distribution, as we did in Part 3.

To start, we load the mvtnorm package and set up a vector of mean values for each variable. We will construct a vector for males and females separately. We will use the mean values for each variable reported in the paper.

library(mvtnorm)

## Order of means is: BF%, Upper Arm Girth, Thigh Girth
male_means <- c(11.1, 35.6, 61.1)
female_means <- c(21.7, 29.5, 56.1)

Next, we need to construct a correlation matrix between these three variables. Again, we will create a correlation matrix for males and females, separately. I’m not sure of the exact correlation between these variables, so I’ll just estimate what I believe it to be. For example, the correlations probably aren’t 0.99 but they also probably aren’t below 0.6. I’m also unsure how these correlations might differ between the two genders. Thus, I’ll keep the same correlation matrix for both. To keep it simple, I’ll set the correlation between BF% and upper arm and thigh girth at 0.85 and the correlation between upper arm girth and thigh girth to be 0.9. In theory, we could consult some scientific literature on these things and attempt to construct more plausible correlations.

## Create correlation matrices
# males
male_r_matrix <- matrix(c(1, 0.85, 0.85,
                          0.85, 1, 0.9,
                          0.85, 0.9, 1), 
                   nrow = 3, ncol = 3,
       dimnames = list(c("bf_pct", "upper_arm_girth", "thigh_girth"),
                       c("bf_pct", "upper_arm_girth", "thigh_girth")))

male_r_matrix

# females
female_r_matrix <- matrix(c(1, 0.85, 0.85,
                          0.85, 1, 0.9,
                          0.85, 0.9, 1), 
                   nrow = 3, ncol = 3,
       dimnames = list(c("bf_pct", "upper_arm_girth", "thigh_girth"),
                       c("bf_pct", "upper_arm_girth", "thigh_girth")))

female_r_matrix

Now we will create 1000 simulations from a multivariate normal distribution for both males and females and then row bind them together into a single big data frame.

## simulate 1000 new x, y, and z variables using the mvtnorm package
set.seed(777)
male_sim <- rmvnorm(n = n_males, mean = male_means, sigma = male_r_matrix) %>%
  as.data.frame() %>%
  setNames(c("pct_bf", "upper_arm_girth", "thigh_girth"))

female_sim <- rmvnorm(n = n_females, mean = female_means, sigma = female_r_matrix) %>%
  as.data.frame() %>%
  setNames(c("pct_bf", "upper_arm_girth", "thigh_girth"))

head(male_sim)
head(female_sim)

## put the two simulated data frames together
multi_sims <- bind_rows(male_sim, female_sim) multi_sims %>%
  head()

Finally, one last thing we’ll change is our simulation of years of experience. This variable is not a normally distributed variable because it is truncated at 0. Above, in our first simulation, we attempted to solve this with the ifelse() expression to assign any simulated value less than 0 to 0. Here, I’ll just get the quantiles of the simulated years of experience so that I have an idea of a plausible upper end of experience for the population used in this paper. Then, instead of simulating from a normal distribution I’ll do a random draw from a uniform distribution from 0 to the respective upper end for the male and female groups.

Now take the newly simulated variables and create a new data frame.

## new data frame
new_dat <- data.frame( gender = c(rep("male", times = n_males), rep("female", times = n_females)), years_exp = c(years_exp_male, years_exp_female) ) %>%
  bind_cols(multi_sims)

new_dat %>%
  head()

Finally, go through the steps we did above to estimate the squat from our four variables and the beta coefficients from the paper.

# estimate squat performance
new_dat$squat <- with(new_dat, intercept + years_exp_beta*years_exp + pct_bf_beta*pct_bf + upper_arm_girth_beta*upper_arm_girth + thigh_girth_beta*thigh_girth + see) new_dat %>%
  head()

## summary statistics
new_dat %>%
  group_by(gender) %>%
  summarize(avg_squat = mean(squat), 
            sd_squat = sd(squat))

# plots
hist(new_dat$squat)

new_dat %>%
  ggplot(aes(x = squat, fill = gender)) +
  geom_density(alpha = 0.5)

 

Look at the new regression model

fit_lm_new <- lm(squat ~ years_exp + pct_bf + upper_arm_girth + thigh_girth, data = new_dat)
summary(fit_lm_new)

Compare the squat simulated data to that which was reported in the paper

new_dat %>%
  group_by(gender) %>%
  summarize(avg_squat_sim = mean(squat), 
            sd_squat_sim = sd(squat)) %>%
  mutate(avg_squat_paper = c(118.3, 196.1),
         sd_squat_paper = c(26.6, 37.9))

Now we have mean values much closer to those observed in the population of the study. Our standard deviations differ from the study because, if you recall, the standard error of the estimate was 20.3 for the regression model. The model in the study did not include gender as an independent variable. This to me is a little strange and in the discussion of the paper the authors’ also indicate that it was strange to them too. However, the approach the authors’ took to building this deemed gender unnecessary as a predictor variable. Consequently, our simulation has a similar standard deviation in estimate squat for both the male and female populations. However, we now have a data set that is relatively close to what was observed in the paper and can therefore proceed with conducting other statistical tests or building other models.

Wrapping Up

We’ve come a long way in our simulation journey, from randomly drawing single vectors of data to now building up entire data sets using regression models and simulating data from research studies. Next time we will use simulation to explore regression assumptions.

As always, all code is freely available in the Github repository.