Category Archives: Model Building in R

Tidymodels Workflow Sets Tutorial

Intro

The purpose of workflow sets are to allow you to seamlessly fit multiply different models (and even tune them) simultaneously. This provide an efficient approach to the model building process as the models can then be compared to each other to determine which model is the optimal model for deployment. Therefore, the aim of this tutorial is to provide a simple walk through of how to set up a workflow_set() and build multiple models simultaneously using the tidymodels framework.

The full code (which will include code not directly embedded in this tutorial) is available on my GITHUB page.

Load Packages & Data

Data comes from the nwslR package, which provides a lot of really nice National Women’s Soccer League data.

We will be using stats for field players to determine those who received the the Best XI award (there will only be 10 players per season since we are dealing with field player stats, no goalies).

## packages
library(tidyverse)
library(tidymodels)
library(nwslR)
library(tictoc)

theme_set(theme_light() +
            theme(strip.background = element_rect(fill = "black"),
                  strip.text = element_text(face = "bold")))


## data sets required
data(player)
data(fieldplayer_overall_season_stats)
data(award)

## join all data sets to make a primary data set
d <- fieldplayer_overall_season_stats %>%
  left_join(player) %>% 
  left_join(award) %>% 
  select(-name_other) %>% 
  mutate(best_11 = case_when(award == "Best XI" ~ 1,
                             TRUE ~ 0)) %>% 
  select(-award)

d %>% 
  head()


Our features will be all of the play stats: mp, starts, min, gls, ast, pk, p_katt and the position (pos) that the player played.

Exploratory Data Analysis

Let’s explore some of the variables that we will be modeling.

How many NAs are there in the data set?

  • It looks like there are some players that matches played (mp) and starts yet the number of minutes was not recorded. We will need to handle this in our pre-processing. The alternative approach would be to just remove those 79 players, however I will add an imputation step in the recipe section of our model building process to show how it works.
  • There are also a number of players that played in games but never attempted a penalty kick. We will set these columns to 0 (the median value).

How many matches did those who have an NA for minutes play in?


Let’s get a look at the relationship between matches played, `mp`, and `min` to see if maybe we can impute the value for those who have NA.

fit_min <- lm(min ~ mp, data = d)
summary(fit_min)

plot(x = d$mp, 
     y = d$min,
     main = "Minutes Played ~ Matches Played",
     xlab = "Matches Played",
     ylab = "Minutes Played",
     col = "light grey",
     pch = 19)
abline(summary(fit_min),
       col = "red",
       lwd = 5,
       lty = 2)

  • There is a large amount of error in this model (residual standard error = 264) and the variance in the relationship appears to increase as matches played increases. This is all we have in this data set to really go on. It is probably best to figure out why no minutes were recorded for those players or see if there are other features in a different data set that can help us out. For now, we will stick with this simple model and use it in our model `recipe` below.

Plot the density of the continuous predictor variables based on the `best_11` award.

d %>%
  select(mp:p_katt, best_11) %>%
  pivot_longer(cols = -best_11) %>%
  ggplot(aes(x = value, fill = as.factor(best_11))) +
  geom_density(alpha = 0.6) +
  facet_wrap(~name, scales = "free") +
  labs(x = "Value",
       y = "Density",
       title = "Distribution of variables relative to Best XI designation",
       subtitle = "NOTE: axes are specific to the value in question")

How many field positions are there?

Some players appear to play multiple positions. Maybe they are more versatile? Have players with position versatility won more Best XI awards?

Data Splitting

First, I’ll create a data set of just the predictors and outcome variables (and get rid of the other variables in the data that we won’t be using). I’ll also convert our binary outcome variable from a number to a factor, for model fitting purposes.

Split the data into train/test splits.

## Train/Test
set.seed(398)
init_split <- initial_split(d_model, prop = 0.7, strat = "best_11")

train <- training(init_split)
test <- testing(init_split)

Further split the training set into 5 cross validation folds.

## Cross Validation Split of Training Data
set.seed(764)
cv_folds <- vfold_cv(
  data = train, 
  v = 5
  ) 


Prepare the data with a recipe

Recipes help us set up the data for modeling purposes. It is here that we can handle missing values, scale/nornmalize our features, and create dummy variables. More importantly, creating the recipe ensure that if we deploy our model for future predictions the steps in the data preparation process will be consistent and standardized with what we did when we fit the model.

You can find all of the recipe options HERE.

The pre-processing steps we will use are:

  • Impute any NA minutes, `min` using the `mp` variable.
  • Create one hot encoded dummy variables for the player’s position
  • Impute the median (0) when penalty kicks attempted and penalty kicks made are NA
  • Normalize the numeric data to have a mean of 0 and standard deviation of 1
nwsl_rec <- recipe(best_11 ~ ., data = train) %>%
  step_impute_linear(min, impute_with = imp_vars(mp)) %>%
  step_dummy(pos, one_hot = TRUE) %>%
  step_impute_median(pk, p_katt, ast) %>%
  step_normalize(mp:p_katt)

nwsl_rec

Here is what the pre-processed training set looks like when we apply this recipe:

Specifying the models

We will fit three models at once:

  1. Random Forest
  2. XGBoost
  3. K-Nearest Neighbor
## Random forest
rf_model <- rand_forest( mtry = tune(), trees = tune(), ) %>%
  set_mode("classification") %>%
  set_engine("randomForest", importance = TRUE)

## XGBoost
xgb_model <- boost_tree( trees = tune(), mtry = tune(), tree_depth = tune(), learn_rate = .01 ) %>%
  set_mode("classification") %>% 
  set_engine("xgboost",importance = TRUE)

## Naive Bayes Classifier
knn_model <- nearest_neighbor(neighbors = 4) %>%
  set_mode("classification")

Workflow Set

We are now ready to combine the pre-processing recipes and the three models together in a workflow_set().

nwsl_wf <-workflow_set(
  preproc = list(nwsl_rec),
  models = list(rf_model, xgb_model, knn_model),
  cross = TRUE
  )

nwsl_wf

Tune & fit the 3 workflows

Once the models are set up we use workflow_map() to fit the workflow to the cross-validated folds we created. We will set up a few tuning parameters for the Random Forest and XGBOOST models so during the fitting process we can determine which of parameter pairings optimize the model performance.

I also use the ‘tic()’ and ‘toc()’ functions from the tictoc package to determine the length of time it takes the model to fit, in case there are potential opportunities to optimize the fitting process.

doParallel::registerDoParallel(cores = 10)

tic()

fit_wf <- nwsl_wf %>%  
  workflow_map(
    seed = 44, 
    fn = "tune_grid",
    grid = 10,           ## parameters to pass to tune grid
    resamples = cv_folds
  )

toc()

# Took 1.6 minutes to fit

doParallel::stopImplicitCluster()

fit_wf


Evaluate each model’s performance on the train set

We can plot the model predictions across the range of models we fit using autoplot(), get a summary of the model predictions with the collect_metrics() function, and rank the results of the model using rank_results().

 

## plot each of the model's performance and ROC
autoplot(fit_wf)

## Look at the model metrics for each of the models
collect_metrics(fit_wf) 

## Rank the results based on model accuracy
rank_results(fit_wf, rank_metric = "accuracy", select_best = TRUE)


We see that the Random Forest models out performed the XGBOOST and KNN models.

Extract the model with the best performance

Now that we know that the Random Forest performed the best. We will grab the model ID for the Random Forest Models and their corresponding workflows.

## get the workflow ID for the best model
best_model_id <- fit_wf %>% 
  rank_results(
    rank_metric = "accuracy",
    select_best = TRUE
  ) %>% 
  head(1) %>% 
  pull(wflow_id)

best_model_id

## Extract the workflow for the best model
best_model <- extract_workflow(fit_wf, id = best_model_id)
best_model

Extract the tuned results from workflow of the best model

We know the best model was the Random Forest model so we can use the best_model_id to get all of the Random Forest models out and look at how each one did during the tuning process.

First we extract the Random Forest models.

## extract the Random Forest models
best_workflow <- fit_wf[fit_wf$wflow_id == best_model_id,
                               "result"][[1]][[1]]

best_workflow

With the collect_metrics() function we can see the iterations of mtry, trees, and tree_depth that were evaluated in the tuning process. We can also use select_best() to get the model parameters that performed the best of the Random Forest models.

collect_metrics(best_workflow)
select_best(best_workflow, "accuracy")

Fit the final model

We saw above that the best model had the following tuning parameters:

  • mtry = 1
  • trees = 944

We can extract this optimized workflow using the finalize_workflow() function and then fit that final workflow to the initial training split data.

## get the finalized workflow
final_wf <- finalize_workflow(best_model, select_best(best_workflow, "accuracy"))
final_wf

## fit the final workflow to the initial data split
doParallel::registerDoParallel(cores = 8)

final_fit <- final_wf %>% 
  last_fit(
    split = init_split
  )

doParallel::stopImplicitCluster()

final_fit

Extract Predictions on Test Data and evaluate model

First we can evaluate the variable importance plot for the random forest model.

library(vip)

final_fit %>%
  extract_fit_parsnip() %>%
  vip(geom = "col",
      aesthetics = list(
              color = "black",
              fill = "palegreen",
              alpha = 0.5)) +
  theme_classic()

Next we will look at the accuracy and ROC on the test set by using the collect_metrics() function on the final_fit. Additionally, if we use the collect_predictions() function we will get the predicted class and predicted probabilities for each row of the test set.

## Look at the accuracy and ROC on the test data
final_fit %>% 
  collect_metrics()

## Get the model predictions on the test data
fit_test <- final_fit %>% 
  collect_predictions()

fit_test %>%
  head()

Next, create a confusion matrix of the class of interest, best_11 and our predicted class, .pred_class.

fit_test %>% 
  count(.pred_class, best_11)

table(fit_test$best_11, fit_test$.pred_class)

We see that the model never actually predicted a person to be in class 1, indicating that they would be ranked as one of the Best X1 for a given season. We have such substantial class imbalance that the model can basically guess that no one will will Best XI and end up with a high accuracy.

The predicted class for a binary prediction ends up coming from a default threshold of 0.5, meaning that the predicted probability of being one of the Best XI needs to exceed 50% in order for that class to be predicted. This might be a bit high/extreme for our data! Additionally, in many instances we may not care so much about a specific predicted class but instead we want to just understand the probability of being predicted in one class or another.

Let’s plot the distribution of Best XI predicted probabilities colored by whether the individual was actually one of the Best XI players.

fit_test %>%
  ggplot(aes(x = .pred_1, fill = best_11)) +
  geom_density(alpha = 0.6)

We can see that those who were actually given the Best XI designation had a higher probability of being indicated as Best XI, just not high enough to exceed the 0.5 default threshold. What if we set the threshold for being classified as Best XI at 0.08?

fit_test %>%
  mutate(pred_best_11_v2 = ifelse(.pred_1 > 0.08, 1, 0)) %>%
  count(pred_best_11_v2, best_11)

Wrapping Up

In the final code output above we see that there are 20 total instances where the model predicted the individual would be a Best XI player. Some of those instances the model correctly identified one of the Best XI and other times the model prediction led to a false positive (the model thought the person had a Best XI season but it was incorrect). There is a lot to unpack here. Binary thresholds like this can often be messy as predicting one class or another can be weird as you get close to the threshold line. Additionally, changing the threshold line will change the classification outcome. This would need to be considered based on your tolerance for risk of committing a Type I or Type II error, which may depend on the goal of your model, among other things. Finally, we often care more about the probability of being in one class or another versus a specific class outcome. All of these things need to be considered and thought through and are out of the scope of this tutorial, which had the aim of simply walking through how to set up a workflow_set() and fit multiple models simultaneously. Perhaps a future tutorial can cover such matters more in depth.

The complete code for this tutorial is available on my GITHUB page.

tidymodels: bootstrapping for coefficient uncertainty and prediction

Julia Silge recently posted a new #tidytuesday blog article using the {tidymodels} package to build bootstrapped samples of a data set and then fit a linear to those bootstrapped samples as a means of exploring the uncertainty around the model coefficients.

I’ve written a few pieces on resampling methods here (See TidyX Episode 98 and THIS article I wrote about how to approximate a Bayesian Posterior Prediction). I enjoyed Julia’s article (and corresponding screen cast) so I decided to expand on what she shared, this time using Baseball data, and show additional ways of evaluating the uncertainty in model coefficients as well as extending out the approach to using the bootstrapped models for prediction uncertainty.

Side Note: We interviewed Julia on TidyX Episode 86.

Load Packages & Data

We will be using the {Lahman} package in R to obtain hitting statistics of all players with a minimum of 200 at bats, from the 2010 season or greater.

Our goal here is to work with a simple linear model that regresses the dependent variable, Runs, on the independent variable, Hits. (Note: Runs is really a count variable, so we could have modeled this differently, but we will stick with a simple linear model for purposes of simplicity and to show how bootstrapping can be used to understand uncertainty.)

## packages
library(tidyverse)
library(Lahman)
library(tidymodels)
library(broom)

theme_set(theme_light())

## data
d <- Batting %>%
  filter(yearID >= 2010) %>%
  select(playerID, yearID, AB, R, H) %>%
  group_by(playerID, yearID) %>%
  summarize(across(.cols = everything(),
                   ~sum(.x)),
            .groups = "drop") %>%
  filter(AB >= 200)

d %>%
  head() %>%
  knitr::kable()

Exploratory Data Analysis

Before we get into the model, we will just make a simple plot of the data and produce some basic summary statistics (all of the code for this will is available on my GITHUB page).


Linear Regression

First, we produce a simple linear regression using all the data to see what the coefficients look like. I’m doing this to have something to compare the bootstrapped regression coefficients to.

## Model
fit_lm <- lm(R ~ H, data = d)
tidy(fit_lm)

It looks like, for every 1 extra hit that a player gets it increases their Run total, on average, by approximately 0.518 runs. The intercept here is not interpretable since 0 hits wouldn’t lead to negative runs. We could mean scale the Hits variable to fix this problem but we will leave it as is for the this example since it isn’t the primary focus. For now, we can think of the intercept as a value that it just helping calibrate our Runs data to a fixed value on the y-axis.

{tidymodels} regression with bootstrapping

First, we create 1000 bootstrap resamples of the data.

### 1000 Bootstrap folds
set.seed(9183)
boot_samples <- bootstraps(d, times = 1000)
boot_samples


Next, we fit our linear model to each of the 1000 bootstrapped samples. We do this with the map() function, as we loop over each of the splits.

fit_boot <- boot_samples %>%
  mutate(
    model = map(
      splits,
      ~ lm(R ~ H,
           data = .x)
    ))

fit_boot


Notice that we have each of our bootstrap samples stored in a list (splits) with a corresponding bootstrap id. We’ve added a new column, which stores a list for each bootstrap id representing the linear model information for that bootstrap sample.

Again, with the power of the map() function, we will loop over the model lists and extract the model coefficients, their standard errors, t-statistics, and p-values for each of the bootstrapped samples. We do this using the tidy() function from the {broom} package.

boot_coefs <- fit_boot %>%
  mutate(coefs = map(model, tidy))

boot_coefs %>%
  unnest(coefs)


The estimate column is the coefficient value for each of the model terms (Intercept and H). Notice that the values bounce around a bit. This is because the bootstrapped resamples are each slightly different as we resample the data, with replacement. Thus, slightly different models are fit to each of those samples.

Uncertainty in the Coefficients

Now that we have all of 1000 different model coefficients, for each of the resampled data sets, we can begin to explore their uncertainty.

We start with a histogram of the 1000 model coefficients to show how large the uncertainty is around the slope and intercept.

boot_coefs %>%
  unnest(coefs) %>%
  select(term, estimate) %>%
  ggplot(aes(x = estimate)) +
  geom_histogram(color = "black",
                 fill = "grey") +
  facet_wrap(~term, scales = "free_x") +
  theme(strip.background = element_rect(fill = "black"),
        strip.text = element_text(color = "white", face = "bold"))

We can also calculate the mean and standard deviation of the 1000 model coefficients and compare them to what we obtained with the original linear model fit to all the data.

## bootstrapped coefficient's mean and SD
boot_coefs %>%
  unnest(coefs) %>%
  select(term, estimate) %>%
  group_by(term) %>%
  summarize(across(.cols = estimate,
                   list(mean = mean, sd = sd)))

# check results against linear model
tidy(fit_lm)

Notice that the values obtained by taking the mean and standard deviation of the 1000 bootstrap samples is very close the model coefficients from the linear model. They aren’t exact because the bootstraps are unique resamples. If you were to change the seed or not set the seed when producing bootstrap samples you would get different coefficients yet again. However, the bootstrap coefficients will always be relatively close approximations of the linear model regression coefficients within some margin of error.

We can explore the coefficient for Hits, which was our independent variable of interest, by extracting its coefficients and calculating things like 90% Quantile Intervals and 90% Confidence Intervals.

beta_h <- boot_coefs %>%
  unnest(coefs) %>%
  select(term, estimate) %>%
  filter(term == "H")

beta_h %>%
  head()

## 90% Quantile Intervals
quantile(beta_h$estimate, probs = c(0.05, 0.5, 0.95))


## 90% Confidence Intervals
beta_mu <- mean(beta_h$estimate)
beta_se <- sd(beta_h$estimate)

beta_mu
beta_se

beta_mu + qnorm(p = c(0.05, 0.95))*beta_se

Of course, if we didn’t want to go through the trouble of coding all that, {tidymodels} provides us with a helper function called int_pctl() which will produce 95% Confidence Intervals by default and we can set the alpha argument to 0.1 to obtain 90% confidence intervals.

## can use the built in function from {tidymodels}
# defaults to a 95% Confidence Interval
int_pctl(boot_coefs, coefs)

# get 90% Confidence Interval
int_pctl(boot_coefs, coefs, alpha = 0.1)

Notice that the 90% Confidence Interval for the Hits coefficient is the same as I calculated above.

Using the Bootstrapped Samples for Prediction

To use these bootstrapped samples for prediction I will first extract the model coefficients and then structure them in a wide data frame.

boot_coefs_wide <- boot_coefs %>%
  unnest(coefs) %>%
  select(term, estimate) %>%
  mutate(term = case_when(term == "(Intercept)" ~ "intercept",
                          TRUE ~ term)) %>%
  pivot_wider(names_from = term,
                  values_from = estimate,
              values_fn = 'list') %>%
  unnest(cols = everything())

boot_coefs_wide %>%
  head()


In a previous blog I talked about three types of predictions (as indicated in Gelman & Hill’s Regression and Other Stories) we might choose to make from our models:

  1. Point prediction
  2. Point prediction with uncertainty
  3. A predictive distribution for a new observation in the population

Let’s say we observe a new batter with 95 Hits on the season. How many Runs would we expect this batter to have?

To do this, I will apply the new batters 95 hits to the coefficients for each of the bootstrapped regression models, producing 1000 estimates of Runs for this hitter.

new_H <- 95

new_batter <- boot_coefs_wide %>%
  mutate(pred_R = intercept + H * new_H)

new_batter

 

We can plot the distribution of these estimates.

## plot the distribution of predictions
new_batter %>%
  ggplot(aes(x = pred_R)) +
  geom_histogram(color = "black",
                 fill = "light grey") +
  geom_vline(aes(xintercept = mean(pred_R)),
             color = "red",
             linetype = "dashed",
             size = 1.4)

Next, we can get our point prediction by taking the average and standard deviation over the 1000 estimates.

## mean and standard deviation of bootstrap predictions
new_batter %>%
  summarize(avg = mean(pred_R),
            SD = sd(pred_R))

We can compare this to the predicted value and standard error from the original linear model.

## compare to linear model
predict(fit_lm, newdata = data.frame(H = new_H), se = TRUE)

Pretty similar!

For making predictions about uncertainty we can make predictions either at the population level, saying something about the average person in the population (point 2 above) or at the individual level (point 3 above). The former would require us to calculate the Confidence Interval while the latter would require the Prediction Interval.

(NOTE: If you’d like to read more about the different between Confidence and Prediction Intervals, check out THIS BLOG I did, discussing both from a Frequentist and Bayesian perspective).

We’ll start by extracting the vector of estimated runs and then calculating 90% Quantile Intervals and 90% Confidence Intervals.

## get a vector of the predicted runs
pred_runs <- new_batter %>% 
  pull(pred_R)

## 90% Quantile Intervals
quantile(pred_runs, probs = c(0.05, 0.5, 0.95))

## 90% Confidence Interval
mean(pred_runs) + qnorm(p = c(0.025, 0.975)) * sd(pred_runs)


We can compare the 90% Confidence Interval of our bootstrapped samples to that of the linear model.

## Compare to 90% confidence intervals from linear model
predict(fit_lm, newdata = data.frame(H = new_H), interval = "confidence", level = 0.90)


Again, pretty close!

Now we are ready to create prediction intervals. This is a little tricky because we need the model sigma from each of the bootstrapped models. The model sigma is represented as the Residual Standard Error in the original linear model output. Basically, this informs us about how much error there is in our model, indicating how far off our predictions might be. In this case, our predictions are, on average, off by about 10.87 Runs.

To extract this residual standard error value for each of the bootstrapped resamples, we will use the glance() function from the {broom} package, which produces model fit variables. Again, we use the map() function to loop over each of the bootstrapped models, extracting sigma.

boot_sigma <- fit_boot %>%
  mutate(coefs = map(model, glance)) %>%
  unnest(coefs) %>%
  select(id, sigma)

Next, we’ll recreate the previous wide data frame of the model coefficients but this time we retain the bootstrap id column so that we can join the sigma value of each of those models to it.

## Get the bootstrap coefficients and the bootstrap id to join the sigma with it
boot_coefs_sigma <- boot_coefs %>%
  unnest(coefs) %>%
  select(id, term, estimate) %>%
  mutate(term = case_when(term == "(Intercept)" ~ "intercept",
                          TRUE ~ term)) %>%
  pivot_wider(names_from = term,
                  values_from = estimate,
              values_fn = 'list') %>%
  unnest(everything()) %>%
  left_join(boot_sigma)


Now we have 4 columns for each of the 1000 bootstrapped samples: An Id, an intercept, a coefficient for Hits, and a residual standard error (sigma).

We make a prediction for Runs the new batter with 95 Hits. This time, we add in some model error by drawing a random number with a mean of 0 and standard deviation equal to the model’s sigma value.

 

## Now make prediction using a random draw with mean = 0 and sd = sigma for model uncertainty
new_H <- 95

# set seed so that the random draw for model error is replicable
set.seed(476)
new_batter2 <- boot_coefs_sigma %>%
  mutate(pred_R = intercept + H * new_H + rnorm(n = nrow(.), mean = 0, sd = sigma))

new_batter2 %>%
  head()


Again, we can see that we have some predicted estimates for the number of runs we might expect for this new hitter. We can take those values and produce a histogram as well as extract the mean, standard deviation, and 90% Prediction Intervals.

## plot the distribution of predictions
new_batter2 %>%
  ggplot(aes(x = pred_R)) +
  geom_histogram(color = "black",
                 fill = "light grey") +
  geom_vline(aes(xintercept = mean(pred_R)),
             color = "red",
             linetype = "dashed",
             size = 1.4)

## mean and standard deviation of bootstrap predictions
new_batter2 %>%
  summarize(avg = mean(pred_R),
            SD = sd(pred_R),
            Low_CL90 = avg - 1.68 * SD,
            High_CL90 = avg + 1.68 * SD)


Notice that while the average number of Runs is relatively unchanged, the Prediction Intervals are much larger! That’s because we are incorporating more uncertainty into our Prediction Interval to try and say something about someone specific in the population versus just trying to make a statement about the population on average (again, check out THIS BLOG for more details).

Finally, we can compare the results from the bootstrapped models to the prediction intervals from our original linear model.

## compare to linear model
predict(fit_lm, newdata = data.frame(H = new_H), interval = "predic", level = 0.9)


Again, pretty close to the same results!

Wrapping Up

Using things like bootstrap resampling and simulation (NOTE: They are different!) is a great way to explore uncertainty in your data and in your models. Additionally, such techniques become incredibly useful when making predictions because every prediction about the future is riddled with uncertainty and point estimates rarely ever do us any good by themselves. Finally, {tidymodels} offers a nice framework for building models and provides a number of helper functions that take a lot of the heavy lifting and coding out of your hands, allowing you to think harder about the models you are creating and less about writing for() loops and vectors.

(THIS BLOG contains my simple {tidymodels} template if you are looking to get started using the package).

If you notice any errors, please let me know!

To access the entire code for this article, please see my GITHUB page.

Bayesian Priors for Categorical Variables using rstanarm

Continuing with more Bayesian data analysis using the {rstanarm} package, today walk through the ways of setting priors on categorical variables.

NOTE: Priors are a pretty controversial piece in Bayesian statistics and one of the arguments people make against Bayesian data analysis. Thus, I’ll also show what happens when you are overly bullish with your priors/

The full code is accessible on my GITHUB page.

Load Packages & Data

We are going to use the mtcars data set from R. The cylinder variable (cyl) is read in as a numeric but it only have three levels (4, 6, 8), therefore, we will convert it to a categorical variable and treat it as such for the analysis.

We are going to build a model that estimates miles per gallon (mpg) from the number of cylinders a  car has. So, we will start by looking at the mean and standard deviation of mpg for each level of cyl.

## Bayesian priors for categorical variables using rstanarm

library(rstanarm)
library(tidyverse)
library(patchwork)

### Data -----------------------------------------------------------------
d <- mtcars %>%
  select(mpg, cyl, disp) %>%
  mutate(cyl = as.factor(cyl),
         cyl6 = ifelse(cyl == "6", 1, 0),
         cyl8 = ifelse(cyl == "8", 1, 0))

d %>% 
  head()

d %>%
  group_by(cyl) %>%
  summarize(avg = mean(mpg),
            SD = sd(mpg))

Fit the model using Ordinary Least Squares regression

Before constructing our Bayesian model, we fit the model as a basic regression model to see what the output looks like.

## Linear regression ------------------------------------------------------
fit_lm <- lm(mpg ~ cyl, data = d)
summary(fit_lm)

  • The model suggests there is a relationship between mpg and cyl number
  • A 4 cyl car is represented as the intercept. Consequently, the intercept represents the average mpg we would expect from a 4 cylinder car.
  • The other two coefficients (cyl6 and cyl8) represent the difference in mpg for each of those cylinder cars relative to a 4 cylinder car (the model intercept). So, a 6 cylinder can, on average, will get 7 less mpg than a 4 cylinder car while an 8 cylinder car will, on average, get about 12 less mpg’s than a 4 cylinder car.

Bayesian regression with rstanarm — No priors specified

First, let’s fit the model with no priors specified (using the functions default priors) to see what sort of output we get.

## setting no prior info
stan_glm(mpg ~ cyl, data = d) %>%
  summary(digits = 3)

  • The output is a little different than the OLS model. First we see that there are no p-values (in the spirit of Bayes analysis!).
  • We do find that the model coefficients are basically the same as those produce with the OLS model and even the standard deviation is similar to the standard errors from above.
  • Instead of p-values for each coefficient we get 80% credible intervals.
  • The sigma value at the bottom corresponds to the residual standard error we got in our OLS model.

Basically, the default priors “let the data speak” and reported back the underlying relationship in the empirical data.

Setting Some Priors

Next, we can set some minimally informative priors. These priors wont contain much information and, therefore, will be highly influenced by minimal amounts of evidence regarding the underlying relationship that is present in the data.

To set priors on independent variables in rstanarm we need to create an element to store them. We have two independent variables (cyl6 and cyl8), both requiring priors (we will set the prior for the intercept and the model sigma in the function directly). To set these priors we need to determine a distribution, a mean value (location), and a standard deviation (scale). We add these values into the distribution function in the order in which they will appear in the model. So, there will be a vector of location that is specific to cyl6 and cyl8 and then a vector of scale that is also specific to cyl6 and cyl8, in that order.

## Setting priors
ind_var_priors <- normal(location = c(0, 0), scale = c(10, 10))

Next, we run the model.

fit_rstan <- stan_glm(mpg ~ cyl, 
                      prior = ind_var_priors,
                      prior_intercept = normal(15, 8),
                      prior_aux = cauchy(0, 3),
                      data = d)

# fit_rstan
summary(fit_rstan, digits = 3)

Again, this model is not so different from the one that used the default priors (or from the findings of the OLS model). But, our priors were uninformative.

One note I’ll make, before proceeding on, is that you can do this a different way and simply dummy code the categorical variables and enter those dummies directly into the model, setting priors on each, and you will obtain the same result. The below code dummy codes cyl6 and cyl8 as booleans (1 = yes, 0 = no) so when both are 0 we effectively are left with cyl4 (the model intercept).

############################################################################################
#### Alternate approach to coding the priors -- dummy coding the categorical variables #####
############################################################################################

d2 <- d %>%
  mutate(cyl6 = ifelse(cyl == "6", 1, 0),
         cyl8 = ifelse(cyl == "8", 1, 0))

summary(lm(mpg ~ cyl6 + cyl8, data = d2))

stan_glm(mpg ~ cyl, 
         prior = ind_var_priors,
         prior_intercept = normal(15, 8),
         prior_aux = cauchy(0, 3),
         data = d2) %>%
  summary()

###########################################################################################
###########################################################################################

Okay, back to our regularly scheduled programming…..

So what’s the big deal?? The model coefficients are relatively the same as with OLS. Why go through the trouble? Two reasons:

  1. Producing the posterior distribution of model coefficients posterior predictive distribution for the dependent variable allows us to evaluate our uncertainty around each. I’ve talked a bit about this before (Making Predictions with a Bayesian Regression Model, Confidence & Prediction Intervals – Compare and Contrast Frequentist and Bayesian Approaches, and Approximating a Bayesian Posterior with OLS).
  2. If we have more information on relationship between mpg and cylinders we can code that in as information the model can use!

Let’s table point 2 for a second and extract out some posterior samples from our Bayesian regression and visualize the uncertainty in the coefficients.

# posterior samples
post_rstan <- as.matrix(fit_rstan) %>%
  as.data.frame() %>%
  rename('cyl4' = '(Intercept)')

post_rstan %>%
  head()

mu.cyl4 <- post_rstan$cyl4
mu.cyl6 <- post_rstan$cyl4 + post_rstan$cyl6
mu.cyl8 <- post_rstan$cyl4 + post_rstan$cyl8

rstan_results <- data.frame(mu.cyl4, mu.cyl6, mu.cyl8) %>%
  pivot_longer(cols = everything())


rstan_plt <- rstan_results %>%
  left_join(
    
    d %>%
      group_by(cyl) %>%
      summarize(avg = mean(mpg)) %>%
      rename(name = cyl) %>%
      mutate(name = case_when(name == "4" ~ "mu.cyl4",
                              name == "6" ~ "mu.cyl6",
                              name == "8" ~ "mu.cyl8"))
    
  ) %>%
  ggplot(aes(x = value, fill = name)) +
  geom_histogram(alpha = 0.4) +
  geom_vline(aes(xintercept = avg),
             color = "black",
             size = 1.2,
             linetype = "dashed") +
  facet_wrap(~name, scales = "free_x") +
  theme_light() +
  theme(strip.background = element_rect(fill = "black"),
        strip.text = element_text(color = "white", face = "bold")) +
  ggtitle("rstanarm")

rstan_plt

  • The above plot represents the posterior distribution (the prior combined with the observed data, the likelihood) of the estimated values for each of our cylinder types.
  • The dashed line is the observed mean mpg for each cylinder type in the data.
  • The distribution helps give us a good sense of the certainty (or uncertainty) we have in our estimates.

We can summarize this uncertainty with point estimates (e.g., mean and median) and measures of spread (e.g., standard deviation, credible intervals, quantile intervals).

 

# summarize posteriors
mean(mu.cyl4)
sd(mu.cyl4)
qnorm(p = c(0.05, 0.95), mean = mean(mu.cyl4), sd = sd(mu.cyl4))
quantile(mu.cyl4, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))

mean(mu.cyl6)
sd(mu.cyl6)
qnorm(p = c(0.05, 0.95), mean = mean(mu.cyl6), sd = sd(mu.cyl6))
quantile(mu.cyl6, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))

mean(mu.cyl8)
sd(mu.cyl8)
qnorm(p = c(0.05, 0.95), mean = mean(mu.cyl8), sd = sd(mu.cyl8))
quantile(mu.cyl8, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))

For example, the below information tells us that cyl8 cars will, on average, provide us with ~15.2 mpg with a credible interval between 13.7 and 16.2. The median value is 15.2 with an interquartile range between 14.6 and 15.8 and a 90% quantile interval ranging between 13.7 and 16.6.

Bullish Priors

As stated earlier, priors are one of the most controversial aspects of Bayesian analysis. Most argue against Bayes because they feel that priors can be be manipulated to fit the data. However, what many fail to recognize is that no analysis is void of human decision-making. All analysis is done by humans and thus there are a number of subjective decisions that need to be made along the way, such as deciding on what to do with outliers, how to handle missing data, the alpha level or level of confidence that you want to test your data against, etc. As I’ve said before, science isn’t often as objective as we’d like it to be. That all said, selecting priors can be done in a variety of ways (aside from just using non-informative priors as we did above). You could get expert opinion, you could use data and observations gained from a pilot study, or you can use information about parameters from previously conducted studies (though be cautious as these also might be bias due to publication issues such as the file drawer phenomenon, p-hacking, and researcher degrees of freedom).

When in doubt, it is probably best to be conservative with your priors. But, let’s say we sit down with a mechanic and inform him of a study where we are attempting to estimate the miles per gallon for 4, 6, and 8 cylinder cars. We ask him if he can help us with any prior knowledge about the decline in mpg when the number of cylinders increase. The mechanic is very bullish with his prior information and states,

“Of course I know the relationship between cylinders and miles per gallon!! Those 4 cylinder cars tend to be very economical and get around 50 mpg plus or minus 2. I haven’t seen too many 6 cylinder cars, but my hunch is that there are pretty similar to the 4 cylinder cars. Now 8 cylinder cars…I do a ton of work on those! Those cars get a bad wrap. In my experience they actually get better gas mileage than the 4 or 6 cylinder cars. My guess would be that they can get nearly 20 miles per gallon more than a 4 cylinder car!”

Clearly our mechanic has been sniffing too many fumes in the garage! But, let’s roll with his beliefs and codify them as prior knowledge for our model and see how such bullish priors influence the model’s behavior.

  • We set the intercept to be normally distributed with a mean of 50 and a standard deviation of 2.
  • Because the mechanic felt like the 6 cylinder car was similar to the 4 cylinder car we will stick suggest that the difference between 6 cylinders and 4 cylinders is normally distributed with a mean of 0 and standard deviation of 2.
  • Finally, we use the crazy mechanics belief that the 8 cylinder car gets roughly 20 more miles per gallon than the 4 cylinder car and we code its prior to be normally distributed with a mean of 20 and standard deviation of 5.

Fit the model…

 

## Use wildly different priors ---------------------------------------------------------
ind_var_priors2 <- normal(location = c(0, 20), scale = c(10, 5))

fit_rstan2 <- stan_glm(mpg ~ cyl, 
                       prior = ind_var_priors2,
                       prior_intercept = normal(50, 2),
                       prior_aux = cauchy(0, 10),
                       data = d)

summary(fit_rstan2, digits = 3)


Wow! Look how much the overly bullish/informative priors changed the model output.

  • Our new belief is that a 4 cylinder car gets approximately 39 mpg and the 6 cylinder car gets about 3 more mpg than that, on average.
  • The 8 cylinder car is now getting roughly 14 mpg more than the 4 cylinder car.

The bullish priors have overwhelmed the observed data. Notice that the results are not exact to the prior but the prior, as they are tugged a little bit closer to the observed data, though not by much. For example, we specified the 8 cylinder car to have about 20 mpg over a 4 cylinder car. The observed data doesn’t indicate this to be true (8 cylinder cars were on average 11 mpg LESS THAN a 4 cylinder car) so the coefficient is getting pulled down slightly, from our prior of 20 to 14.4.

Let’s plot the posterior distribution.

# posterior samples
post_rstan <- as.matrix(fit_rstan2) %>%
  as.data.frame() %>%
  rename('cyl4' = '(Intercept)')

post_rstan %>%
  head()

mu.cyl4 <- post_rstan$cyl4
mu.cyl6 <- post_rstan$cyl4 + post_rstan$cyl6
mu.cyl8 <- post_rstan$cyl4 + post_rstan$cyl8

rstan_results <- data.frame(mu.cyl4, mu.cyl6, mu.cyl8) %>%
  pivot_longer(cols = everything())


rstan_plt2 <- rstan_results %>%
  left_join(
    
    d %>%
      group_by(cyl) %>%
      summarize(avg = mean(mpg)) %>%
      rename(name = cyl) %>%
      mutate(name = case_when(name == "4" ~ "mu.cyl4",
                              name == "6" ~ "mu.cyl6",
                              name == "8" ~ "mu.cyl8"))
    
  ) %>%
  ggplot(aes(x = value, fill = name)) +
  geom_histogram(alpha = 0.4) +
  geom_vline(aes(xintercept = avg),
             color = "black",
             size = 1.2,
             linetype = "dashed") +
  facet_wrap(~name, scales = "free_x") +
  theme_light() +
  theme(strip.background = element_rect(fill = "black"),
        strip.text = element_text(color = "white", face = "bold")) +
  ggtitle("rstanarm 2")

rstan_plt2

Notice how different these posteriors are than the first Bayesian model. In every case, the predicted mpg from the number of cylinders are all over estimating the observed mpg by cylinder (dashed line).

Wrapping Up

Today we went through how to set priors on categorical variables using rstanarm. Additionally, we talked about some of the skepticism about priors and showed what can happen when the priors you select are too overconfident. The morale of the story is two-fold:

  1. All statistics (Bayes, Frequentist, Machine Learning, etc) have some component of subjectivity as the human doing the analysis has to make decisions about what to do with their data at various points.
  2. Don’t be overconfident with your priors. Minimally informative priors maybe be useful to allowing us to assert some level of knowledge of the outcome while letting that knowledge be influenced/updated by what we’ve just observed.

The full code is accessible on my GITHUB page.

If you notice any errors, please reach out!

 

Making predictions from a mixed model using R

Models are built for different reasons. Some models are built to make inferences and explore an underlying phenomenon while other models are built for making predictions and forecasting future outcomes. Often, in the applied sport science setting, the latter is a key goal as we are interested in future predictions that can assist practitioners in making decisions about training or treatment interventions.

I’ve previously discussed building and interpreting mixed models using R (see HERE and HERE). Therefore, today, I wanted to talk about making predictions with mixed models. We will do this in R using the mixed model packages {lme4} and walk through how to use the predict() function and other helper functions to extract confidence intervals and predictions intervals. First, I’ll briefly cover the predict() function using a simple linear regression so that we have a general understanding of how it works and what it does before we get into the mixed models (as a side note, I’ve previously talked about making predictions using Bayesian models here).

Data

As with my previous article on mixed models, I’ll be using the sleepstudy data set, which is freely available in the {lme4} package. It’s a convenient data set to use for this project because it has repeated measurements of reaction time during increasing days of sleep deprivation for multiple subjects. Additionally, the role that sleep plays in sport performance is an important topic that is frequently discussed on social media. Therefore, practitioners might be handling similar wearable tracking data for the groups of athletes they work with.

## Load Packages & Data
library(tidyverse)

dat <- lme4::sleepstudy

Simple Linear Regression

For the purposes of building the model from the ground up, we will create a simple linear regression that acts as if we don’t have repeated measurements on individuals. This model will regress the dependent variable, Reaction, on the independent variable, Days.

Below we see a plot of the data as well as the output of the regression model.

# plot of the data
dat %>%
  ggplot(aes(x = Days, y = Reaction)) +
  geom_point(size = 3,
              shape = 21,
              fill = "grey",
              color = "black") +
  geom_smooth(method = "lm",
              color = "red",
              size = 1.2)

# regression
fit_ols <- lm(Reaction ~ Days, data = dat)
summary(fit_ols)

Visually we can see a linear relationship indicating that for every additional day of sleep deprivation there is a corresponding increase in Reaction time (IE, the people react slower and become more impaired). The model output tells us that this increase is on the magnitude of roughly 10.5 milliseconds (the beta coefficient for the Days variable). Thus, for every one day increase in sleep deprivation we see a 10.5 millisecond increase in Reaction time, on average (for this population that was tested).

Making a prediction with the linear model

What if we were interested in predicting the expected reaction time following 5 days of sleep deprivation? We could do this with the predict() function and we find that we expect the reaction time to be about 304 milliseconds, on average.

This is the same as if we used the coefficients from the above model and multiplied the Days coefficient (10.5) by 5 days:

251.4 + 10.5 * 5

While this may be the average value of reaction time after 5 days of sleep deprivation we can clearly see from our plot above that there is a considerable amount of dispersion around the regression line. If we are interested in understanding the dispersion around this average estimate at the population level, we might use a confidence interval. Alternatively, if we’d like to understand this dispersion at an individual participant level, we might instead use a prediction interval. I’ve discussed the differences between these two in a prior blog article (see here).

Within the predict function we can pass the interval argument to indicate whether we want confidence or prediction intervals. Additionally, the level argument specifies what level of interval we are interested (e.g., 90%, 95%, 99%, etc).

Below we see the output for a 90% Confidence Interval and a 90% Prediction Interval. Notice that the average value (fit) is the same for both. However, the prediction interval has a much larger 90% width than the confidence interval. For all the details about why this is and the gory calculations behind them, see my previous blog post.


Finally, if we use the predict() function and set the argument se.fit to TRUE, we obtain the fitted value and 90% confidence interval (as seen above) along with the standard error and the model degrees of freedom.

The results are returned in a named list, thus we can extract the standard error and calculate the 90% confidence interval directly. We determine the upper and lower  90% confidence limits by multiplying the t-critical value to the standard error. The t-critical value is determined using a t-distribution with 178 degrees of freedom (180 observations minus 2, since our model has 2 parameters: an intercept and a slope). We then add/subtract the confidence limits from our average predicted value.

# Build the 90% confidence interval with the standard error
predict(fit_ols, newdata = data.frame(Days = 5)) + qt(p = c(0.05, 0.95), df = nrow(dat) - 2) * predict(fit_ols, newdata = data.frame(Days =5), interval = 'confidence', level = 0.90, se.fit = TRUE)$se.fit


We find that we obtain the exact same results. Fortunately, R makes it relatively easy to build a model and make predictions, so we don’t need to go through all those steps every time. I was mainly just providing them in the instance that you might need to extract values specifically from the model predictions.

Now that we have the basics down, let’s extend this to a mixed model and see what happens when we account for repeated measurements on each participant and what that does to our model predictions.

Mixed Model

To being, we can plot the data for each individual to show that there are a bunch of individual responses to days of sleep deprivation. Then, we construct a mixed model (the same one built in my previous article) where we extend the linear regression above to incorporate random intercepts for each individual. Basically, what this means is that we retain the same linear relationship from the model above; however, the random effect now understands that there are several observations for each participant, meaning that those observations have some correlation to each other (within individual). So, we allow the intercept of the linear regression model to vary by the participant based on the information we can glean from their data.

Similar to the linear regression model, for every 1 day increase in sleep deprivation we see, on average, a 10.5 millisecond increase in reaction time. However, notice the standard error around the Days coefficient! In the linear regression model it was 1.24 and we have now shrunk it to 0.804. This is due to accounting for the repeated measures of the data. We can see in the Random Effects portion of the model output how much of the variance is contained between subjects, around the (Intercept), and how much of the variance is left unexplained (Residual).

We can look explicitly at how each individual behaves within the model with the ranef() function, which gives us the difference between the individual and the model intercept (251.4). Alternatively, we can use the coef() function to indicate the linear equation for each participant. This is the model intercept plus the random effect, which we found with the ranef() function. Also notice that since this is not a random slope model, the Days coefficient is the same for each participant.

Making Predictions with a Mixed Model

Making predictions with the mixed model can be a little tricky. We don’t only have the linear regression component of the model (often referred to as the fixed effects), but we also have a component that references specific individuals. Additionally, it’s not as easy as just extracting confidence or predictions intervals, as we did before.

Making point estimate predictions

The easiest thing to do is to simply disregard the random effects in the model by passing the predict() function the argument re.form = NA. Doing so indicates that at five days of sleep deprivation we see, on average, a 304 millisecond increase in Reaction time (similar to the linear model above).


If we are making a prediction for one of the specific participants in our data, for example Subject 308, we can include this in the predict() function by specifying that we want to use the random effects that were previously set. We do this with the re.form argument and pass it the exact random effect that we specified when we built the model.


Now we see the point estimate prediction is different than the one above it because the prediction contains some information that is known about Subject 308.

What if we have new participant that the model has no information on?

Using the allow.new.levels argument we can tell the model that we want to allow new random effect levels. In doing so, since the model has no information about the new participant, it will simply return the fixed effects point estimate, just as we got when we specified no random effects in the prediction. This is because the model’s best guess for a new participant is the population average for 5 days of sleep deprivation.


Finally, we can take these point estimate predictions and predict with only the fixed effects and then with the random effects for every participant in our data and then plot the results.

# Make predictions using fixed effect only and then random effects and plot the results
dat %>%
  mutate(pred_fixef = predict(fit_lme, newdata = ., re.form = NA),
         pred_ranef = predict(fit_lme, newdata = ., re.form = ~(1|Subject))) %>%
  ggplot(aes(x = Days, y = Reaction)) +
  geom_point(shape = 21,
             size = 2,
             color = "black",
             fill = "grey") +
  geom_line(aes(y = pred_fixef),
            color = "blue",
            size = 1.2) +
  geom_line(aes(y = pred_ranef),
            color = "green",
            size = 1.2) +
  facet_wrap(~Subject) +
  scale_x_continuous(labels = seq(from = 0, to = 9, by = 3),
                     breaks = seq(from = 0, to = 9, by = 3)) +
  labs(x = "Days of Sleep Deprivation",
       y = "Average Reaction Time",
       title = "Reaction Time ~ Days of Sleep Deprivation",
       subtitle = "Blue = Fixed Effects Prediction | Green = Random Effects Prediction")


We see our fixed effects predictions in blue (which are the same for every participant) and our predictions incorporating random effects in green, which move up and down based on the participant. Also notice that the slope is exactly the same for each participant and each prediction. This is because we only specified a random intercept for this model, which is why the green line moves up and down relative to the fixed effects blue line, given that some participants are higher or lower than the population.

Confidence Intervals

To obtain confidence intervals, we can’t simply specify the interval argument in the predict() function as we did with linear regression. Instead, we need to bootstrap the predictions using the bootMer() function. (NOTE: If you want to always be able to replicate your confidence interval results be sure to set.seed() prior to running the function).

boot_ci <- bootMer(fit_lme,
                   nsim = 100,
                   FUN = function(x) { predict(x, newdata = data.frame(Days = 5), re.form = NA) })

boot_ci

The boot_ci element that we created has a number of different parameters contained within it. The ‘t‘ parameter contains the 100 bootstrapped resamples that we created.

With these bootstrapped resamples we can do a number of things such as plotting a histogram.

Or extracting information such as the quantiles, quantile intervals, the standard deviation of the bootstrapped resamples, and using the mean and standard deviation of the reamples to build 90% confidence intervals around the point estimate prediction.

Prediction Intervals

We can create prediction intervals around our point estimate predictions using the merTools package and the predictInterval() function.

With the predictInterval() function we need to specify a Subject, since that is what the original mixed model also had in its equation (as a random effect). Here I’ll specify a new subject that the model has never seen. The model will return prediction intervals for the point estimate prediction using only fixed effects given that it doesn’t have data on this subject (it will also let me know this in the warning).


Of course, if we are making a prediction on a subject that the model is familiar with, we can use that information. Here I make a prediction on Subject 308 and we see that the results differ from above.


Finally, just for fun and to show we can do this at scale, we will make a prediction on every single participant in our data across all of the days of sleep deprivation. We will bind those predictions and their corresponding 90% prediction intervals to the original data and then plot the results.

pred_ints <- predictInterval(fit_lme, 
                newdata = dat, 
                n.sims = 100,
                returnSims = TRUE, 
                seed = 123, 
                level = 0.90)

dat_new <- cbind(dat, pred_ints) dat_new %>%
  head()

# plot predictions
dat_new %>%
  ggplot(aes(x = Days, y = Reaction)) +
  geom_point(shape = 21,
             size = 2,
             color = "black",
             fill = "grey") +
  geom_ribbon(aes(ymin = lwr, ymax = upr),
            fill = 'light grey',
            alpha = 0.4) +
  geom_line(aes(y = fit),
            color = 'red',
            size = 1.2) +
  facet_wrap(~Subject) +
  scale_x_continuous(labels = seq(from = 0, to = 9, by = 3),
                     breaks = seq(from = 0, to = 9, by = 3)) +
  labs(x = "Days of Sleep Deprivation",
       y = "Average Reaction Time",
       title = "Reaction Time ~ Days of Sleep Deprivation")

Wrapping Up

Mixed models are an incredibly valuable analysis tools. In sport medicine and sport science, models are often helpful for allowing practitioners to make forecasts about how an athlete is progressing or to understand how an athlete is performing relative to what might be expected. Hopefully this article was useful in walking through how to make point estimate predictions and obtain confidence and prediction intervals from mixed models. If you notice any errors, feel free to reach out!

As always, all of the code if available on my GITHUB page.

The Role of Skill vs Luck in Team Sport Winning

This week on the Wharton Moneyball Podcast, the hosts were discussing World Cup results following the group play stage.

At one point, they talked about the variance in performance and the role that luck can play in winning or losing. They felt like they didn’t have as good an intuition about how variable the games were and one of the hosts, Eric Bradlow, said that he’d try and look into it and have an answer for next week’s episode.

The discussion reminded me of a chapter in one of my favorite books, The Success Equation by Michael Mauboussin. The book goes into nice detail about the role of skill and luck in sports, investing, and life. On page 78, Mauboussin provides the following equation:

Skill = Observed Outcome – Luck

From there, he explains the steps for determining the contribution of luck to winning in a variety of team sports. Basically, the amount of luck plays is represented as the ratio of the variance of luck to the variance of observed outcomes.

To calculate the variance of observed outcomes, we find the win% of each team in a given season and calculate the standard deviation of that win%. Squaring this value gives us the variance of observed outcomes.

When calculating the variance of luck we first find the average win% of all of the teams and, treating this as a binomial, we calculate the standard deviation as sqrt((win% * (1 – win%)) / n_games), where n_games is the number of games in a season.

I scraped data for the previous 3 complete seasons for the Premier League, NHL, NBA, NFL, and MLB from sports-reference.com. All of the data and code for calculating the contribution of luck to winning for these sports is available on my GitHub page.

Let’s walk through an example

Since the guys on Wharton Moneyball Podcast were talking about Soccer, I’ll use the Premier League as an example.

First, we create a function that does all the calculations for us. All we need to do is obtain the relevant summary statistics to feed into the function and then let it do all the work.

league_perf <- function(avg_win_pct, obs_sd, luck_sd, league){
  
  ## convert standard deviations to variance
  var_obs <- obs_sd^2
  var_luck <- luck_sd^2
  
  ## calculate the variation due to skill
  var_skill <- var_obs + var_luck
  
  ## calculate the contribution of luck to winning
  luck_contribution <- var_luck / var_obs
  
  
  ## Results table
  output <- tibble(
    league = {{league}},
    avg_win_pct = avg_win_pct,
    luck_contribution = luck_contribution,
  )
  
  return(output)
  
}

 

Using the Premier League data set we calculate the games per season, league average win%, the observed standard deviation, and the luck standard deviation.

NOTE: I’m not much of a soccer guy so I wasn’t sure how to handle draws. I read somewhere that they equate to about 1/3 of a win, so that is what I use here to credit a team for a draw.

## get info for function
# NOTE: Treating Draws as 1/3 of a win, since that reflects the points the team is awarded in the standings
premier_lg %>%
  select(Squad, MP, W, D) %>%
  mutate(win_pct = (W + 1/3*D) / MP) %>%
  summarize(games = max(MP),
            avg_win_pct = mean(win_pct),
            obs_sd = sd(win_pct),
            luck_sd = sqrt((avg_win_pct * (1 - avg_win_pct)) / games))

## Run the function
premier_lg_output <- league_perf(avg_win_pct = 0.462,
                                 obs_sd = 0.155,
                                 luck_sd = 0.0809,
                                 league = "Premier League")

premier_lg_output

Looking at our summary table, it appears that teams have had an average win% over the past 3 seasons of 46.2% (not quite 50%, as we will see in NBA, MLB, or NFL, since draws happen so frequently). The standard deviation of team win% was 15.5% (this is the observed standard deviation), while the luck standard deviation, 8.1%, is the binomial standard deviation using the average win percentage and 38 games in a season.

Feeding these values into the function created above we find that luck contributes approximately 27.2% to winning in the Premier League. In Mauboussin’s book he found that the contribution of luck to winning was 31% in the Premier League from 2007 – 2011. I don’t feel like the below value is too far off of that, though I don’t have a good sense for what sort of magnitude of change would be surprising. That said, the change could be due to improved skill in the Premier League (already one of the most skillful leagues in the world) or perhaps how he handled draws, which was not discussed in the book and may differ from the approach I took here.

Let’s look at the table of results for all sports

Again, you can get the code for calculating the contribution of luck to winning in these sports from my GitHub page, so no need to rehash it all here. Instead, let’s go directly to the results.

  • NBA appears to be the most skill driven league with only about 15% of the contribution to winning coming from luck. Mauboussin found this value to be 12% from 2007 – 2011.
  • The NFL appears to be drive most by luck, contributing 39% to winning. This is identical to what Mauboussin observed using NFL data from 2007 – 2011.
  • The two most surprising outcomes here are the MLB and NHL. Mauboussin found the MLB to have a 34% contribution of luck (2007 – 2011) and the NHL a 53% contribution of luck (2008 – 2012). However, in my table below it would appear that the contribution of luck has decreased substantially in these two sports, using data from previous 3 seasons (NOTE: throwing out the COVID year doesn’t alter this drastically enough to make it close to what Mauhoussin showed).

Digging Deeper into MLB and NHL

I pulled data from the exact same seasons that Mauboussin used in his book and obtained the following results.

  • The results I observed for the MLB are identical to what Mauboussin had in his book (pg. 80)
  • The results for the NHL are slightly off (47.7% compared to 53%) but this might have to do with how I am handling overtime wins and losses (which awards teams points in the standings), as I don’t know enough about hockey to determine what value to assign to them. Perhaps Mauboussin addressed these two outcomes in his calculations in a more specific way.

Additional hypotheses:

  • Maybe skill has improved in hockey over the past decade and a half?
  • Maybe a change in tactics (e.g., the shift) or strategy (e.g., hitters trying to hit more home runs instead of just making contact or pitchers trying to explicitly train to increase max velocity) has taken some of the luck events out of baseball and turned it into more of a zero-sum duel between the batter and pitcher, making game outcomes more dependent on the skill of both players?
  • Maybe I have an error somewhere…let me know if you spot one!

Wrapping Up

Although we marvel at the skill of the athletes we watch in their arena of competition, it’s important to recognize that luck plays a role in the outcomes, and for some sports it plays more of a role than others. But, luck is also what keeps us watching and makes things fun! Sometimes the ball bounces your way and you can steal a win! The one thing I always appreciate form the discussions on the Wharton Moneyball Podcast is that the hosts don’t shy away from explaining things like luck, randomness, variance, regression to the mean, and weighting observed outcomes with prior knowledge. This way of thinking isn’t just about sport, it’s about life. In this sense, we may consider sport to be the vehicle through which these hosts are teaching us about our world.