Author Archives: Patrick

Crossed vs Nested Random Effects

I was reading a paper this weekend and the researchers used a mixed model to for their data given that they were dealing with repeated measures of individuals and teams. Mixed models have become common in sport science research and I’ve written about them a few times in this blog:

As I was reading the paper I was wondering, because it wasn’t specified in the methods section, if the researchers had given any thought to whether their random effects were crossed or nested. Then I thought, maybe people aren’t thinking about this and perhaps a short blog article with an example would be useful. So here we are…

For brevity, I’ll only present short snippets of code in the blog article. The entire code is available on my Github Page.

Data

We need to simulate some data to work with. To keep things simple, we will create a data set of a league that has 6 teams and 60 players. Each player will have a player value metric, the outcome of interest, for every team they played on during the season.

suppressPackageStartupMessages({
  suppressMessages({
    suppressWarnings({
    library(tidyverse)
    library(lme4)
    library(arm)
    })
  })
})


theme_set(theme_bw())

set.seed(225544)
teams <- sample(c("Sharks", "Bears", "Turtles", "Bisons", "Jaguars", "Pythons"), size = 60, replace = TRUE)
players <- as.factor(sample(1:60, size = 60, replace = TRUE))
intercept <- 50
team_effect <- case_when(teams == "Sharks" ~ rnorm(1, mean = 0, sd = 7), 
                         teams == "Bears" ~ rnorm(1, mean = 0, sd = 2),
                         teams == "Turtles" ~ rnorm(1, mean = 0, sd = 1),
                         teams == "Bisons" ~ rnorm(1, mean = 0, sd = 13),
                         teams == "Jaguars" ~ rnorm(1, mean = 0, sd = 5),
                         teams == "Pythons" ~ rnorm(1, mean = 0, sd = 3))
player_effect <- rnorm(length(levels(players)), mean = 0, sd = 10)
random_noise <- 2



dat <- data.frame( team = teams, player_id = players ) %>%
  mutate(
    mu = intercept + team_effect + player_effect[players],
    player_value = round(mu + rnorm(n(), mean = 0, sd = random_noise), 1)
  ) %>%
  dplyr::select(-mu)

dat %>%
  head()

EDA

How many unique players were on each team during the season?

Which players played on multiple teams?

Team Summary Stats

Crossed vs Nested Effects

We have 14 players that played on multiple teams (Player 11 ended up playing for all 6 teams!) and, as a consequence, different teams had different numbers of individual players. For example, the Bears had 13 different players while the Pythons had 6 different players. This small detail is important to consider with respect to our random effects because crossed and nested effects behave differently with respect to how they partition variance.

When constructing mixed models we usually have a data hierarchy where observations are grouped or nested within specific categories. When the observations are nested the observations are exclusive to specific groups. Some examples:

Classes are nested within schools but classes in one school cannot also be grouped within a different school.
Athletes competing in the Olympics are nested within their respective countries but an individual cannot compete for two different countries in the same Olympic Games.

By contrast, crossed effects can occur when observations move between groups during the same study/observation period. A few examples of crossed effects are:

A student might start the year at one school and then move half way through the semester and attend a different school.
Players in team sport might be traded mid-season and play for different teams.
In a study where the experimental and placebo groups contain both male and female participants, we would have crossed random effects to account for sex being in both groups.
Looking at sports teams, we may be analyzing the behavior of position groups on different teams. The position groups are the same for each team (e.g., Forwards, Guards, and Centers on Basketball teams) so the model has crossed effects.

To help solidify this concept a bit more, I’ve created the below drawing.

We see that in the Completely Nested example there are individual players that only player for their respective teams. Contrast that to the Completely Crossed example, where every player ends up playing on each team. This would, obviously, be an incredibly weird scenario and not one we’d see in sport but probably one that you’d see in researched where there is a crossover of treatments. Finally, the Partially Nested & Crossed example has a few players playing for multiple teams (for example, Player 1 played for both Team 1 and Team 2) and some players that stay with a single team during the season. This version is the one that we most frequently encounter in team sport.

Now that we have a general understanding of the difference between crossed and random effects, let’s see how this looks when we build a model on our simulated data and (most importantly) let’s see how specifying the model to have crossed or nested random effects changes the model output.

Model Examples

Crossed

Since we commonly deal with crossed (or partially nested/crossed) data in team sport, I’ll start with modeling the data to have crossed random effects. As a side note, most of the time when I look at people’s models in sport science they are creating the model structure like this, by default.

We will build random intercept model (since we didn’t simulate any fixed effects) and allow a separate intercept variance for both team and player_id, keeping in mind that this specifies that there are players who are potentially playing on different teams (crossed observations) across the season.

lme_crossed <- lmer(player_value ~ 1 + (1|team) + (1|player_id), data = dat)

Nested

Next, we fit the random intercept model but this time we treat our random effects of player_id and team as nested. That is to say that we are allowing the players to be nested within their respective teams. To review, this means that we are treating this data as if each player only played on a single team during the season and never changed teams.

lme_nested <- lmer(player_value ~ 1 + (1|team) + (1|team:player_id), data = dat)

Comparing the two models

First, we notice that the fixed effects are slightly different between the two models, but the difference in small in this simulated data.

The random effects look very different between the two model structures. In the crossed effects model we get two different random effects — one for the player_id and one for the team. However, in the nested effects model we have a random effect for team but then we have a random effect for player_id nested within each team. So, in the nested model a player that played for multiple teams will have different random intercept for each of the teams they were nested in. Both models, of course, also have a residual in the random effects, however the values are very different.

Let’s compile the random effects side-by-side to get a better appreciation for the differences. We will start with the `team` random effect since it is present in both models.

There are pretty stark differences between the random intercepts of the nested and crossed models! Recall from our EDA that the Bears had the highest average value of all the teams and they also had the fewest number of different players (5).

The player random effects get a little wonky because players who played for multiple teams will end up having different random effects depending on which team we nested them in. Conversely, in the crossed model we get one random effect estimate per player.

To join the random effects from both models we have to do a bit of work on the nested model to extract the info. (NOTE: Since the table is really long I’ll only show the first 10 rows here but you can got to the Github page to see the full output.)

Joining the two model random effects on player_id we see that player’s have some very different random effects between the two models. In particular, players that played on different teams can differ greatly between the models.

For example, let’s look at Player 21.

In the crossed model we find that Player 21 has a random intercept that is 1.6 points below league average (the fixed effect intercept). However, in the nested model, when Player 21 was on the Bears he was 1.15 points above the population intercept and 0.74 points below the population intercept when he was on the Turtles. This is partially due to the team effect, where the the Bears were the best team in the league (highest average value) and the Turtles were the worst team in the league (lowest average value).

Wrapping Up

Hopefully this tutorial provided a brief, but useful, overview of crossed versus random effects during mixed model construction. While most people seem to gravitate towards crossed effects by default there may be times where nested effects are indeed required for the structure of the data. As the example shows, how we specify the model can dramatically change the outputs and thus the inference that we can derive from these outputs.

To access the full code please see the Github page.

K-Nearest Neighbor: {tidymodels} tutorial

When working on data science or research teams, it often helps to have a workflow that makes it easy for teammates to review your work and add additional components to the data cleaning or model structure. Additionally, it ensures that the steps in the process are clear so that debugging is easy.

In R, the {tidymodels} package offers one such workflow and in python, folks seem to prefer scikit learn. This week, I’m going to walk through a full workflow in tidymodels using the K-nearest neighbor algorithm. Some of the things I’ll cover include:

Splitting the data in test/train sets and cross validation folds
Setting up the model structure (what tidymodels refers to as a recipe)
Creating preprocessing steps
Compiling the preprocessing and model structure together into a single workflow
Tuning the KNN model
Identifying the best tuned model
Evaluating the model on the test set
Saving the entire worflow and model to be used later on with new data
Making predictions with the saved workflow and model on new data

I’ve provided a number of tidymodels tutorials on this blog so feel free to search the blog for those. Additionally, all of my tidymodels tutorials and templates are available in a GitHub Repo. Alternatively, if Python is more you jam, I did do a previous blog comparing the workflow in tidymodels to Scikit Learn, which can be found HERE.

This entire tutorial and data are available on my GitHub page if you’d like to code along.

Load & Clean Data

For this tutorial I’m going to use the 2021 FIFA Soccer Ratings, which are available on Kaggle. There are several years of ratings there, but we will concentrate on 2021 with the goal being to estimate a player’s contract value based on the various FIFA ratings provided and the position they play.

First, I load the tidyverse and tidymodels libraries and read in the data. I’m going to drop goalies so that we only focus on players who play the field and, since the data has a large number of columns, I’m going to get only the columns we care about: Information about the player (playerID, name, height, weight, club, league, position, and contract value) and then the various FIFA ratings.

The only column with missing data that impacts our analysis is the team position column, as this is a feature in the data set. Additionally, there are 207 players with a contract value of 0 (all 195 players missing a team position have a 0 contract value). So, perhaps these are players who weren’t on a club at the time the ratings were assigned.

I’m going to remove these players from our analysis data set, but I’m going to place them in their own data set because we will use them later to make estimates of their contract value.

Visualizing the Data

Let’s take a look at a visual of the contract value for all of the players in our data set. Since this is severely right skewed, we will plot it on the log scale.

We also notice there are a variety of positions (way more than I would have guessed in soccer!).

Finally, we want to explore how playing position might influence contract value.

tidymodels set up

Now that we have our data organized we want to set up a tidymodels workflow to estimate the contract value of every player using a K-nearest neighbor model.

First, we split the data in train and test splits and then further split the training data into 5 cross validation folds so that we can tune the model to find the best number of K-neighbors.

Next, we specify that we want to use a KNN model and the outcome variable as regression, since it is continuous (as opposed to classification).

Now that the model is specified we need to set up our preprocessing steps. This is one thing that tidymodels is super useful for. Normally, we’d need to do all this preprocessing before fitting the model and, as in the case of KNN and most machine learning models, we’d need to standardize the variables in the training set and then store the means and standard deviations of those variables so that we can use them to standardize the test set or future data. In tidymodels, we set up our preprocessing workflow and store it and it contains all that information for us! We can simply load it and use it for any downstream analysis we want to do. There are three preprocessing steps we want to add to our workflow:

log transform the dependent variable (contract value)
Dummy code the positions
Standardize (z-score) all off the numeric variable in the model (the ratings)

We can view the preprocessing steps using the prep() function.

With the model specified and the preprocessing recipe set up we are ready to compile everything into a single workflow.

We can look at the workflow to make sure everything makes sense before we start fitting the model.

Next, we tune the model using the cross validated folds that we set up on our training data set. We are trying to find the optimal number of k-neighbors that minimizes the RMSE of the outcome variable (the log of contract value).

We see the results are stored in a list for every one of our 5 cross validation splits. We can view the model metrics for each of the folds.

The model with 10 neighbors appears to have the lowest RMSE. So, we want to pull that value directly from this table so that we can store it and use it later when we go to fit the final model.

Fitting the Final Model Version 1: Using finalize_workflow()

Version 1 that I’ll show for fitting the final model uses the finalize_workflow() function. I like this approach if I’m building my model in a local session and then want to see how well it performs on the test session (which is what this function does). This isn’t the approach I take when I need to save everything for downstream use (which we will cover in Version 2).

First, fit the best model using the finalize_workflow() function.

Then, we get the model predictions for this final workflow on the the test data set.

We can plot the predicted contract value compared to the actual contract value and calculate the test set RMSE.

Fitting the Final Model Version 2: Re-sepcify the model on the training set, save it, and use it later

In this second version of fitting the final model, we will take the best neighbors from the tuning phase, re-specify the model to the training set, reset the workflow, and then save these components so that we can use them later.

First, re-specify the model and notice I set the neighbors argument to best_neighbors, which we saved above when we tuned the model.

We then use this finalized/re-specified model to reset the workflow. Notice it is added to the add_model() function however the recipe, with our preprocessing steps, does not change.

With the workflow set up we now need to do the final two sets:

Fit the model to the entire data set and extract the recipe using extract_recipe() since we will need to save this to preprocess any new data before making predictions.
Fit the model to the final data and extract the model itself, using extract_fit_parsnip() so we can use the model to make future predictions.

Now we save the recipe and model as .rda files. We can then load these two structures and use them on new data!

We have 207 players with 0 contract value, of which 195 had no team position. So, that leaves us with 12 players that we can estimate contract value for with our saved model.

First, we use the bake() function to apply the saved recipe to the new data so that we can preprocess everything appropriately.

With the new data preprocessed we can now predict the log of contract value with our saved model.

We can add these predictions back to the new data set, exponentiate the predicted contract value to get it back to the normal scale and then create a visual of our estimates for the 12 players.

Wrapping Up

That’s a quick walk through on how to set up a complete tdymodels workflow. This approach would work for any type of model you want to build, not just KNN! As always, the code and data are available on my GitHub page.

TidyX Episode 186: {camcorder}

This week, Ellis Hughes and I go through a new project that Ellis has worked on called {camcorder}. Camcorder is an R package that allows for a very simple way to take your {ggplot2} visualizations and animate them. This was also the topic of Ellis’ recent Posit Conference Seattle talk.

To watch the screen cast, CLICK HERE.

To access our code, CLICK HERE.

TidyX Episode 184: Intro to {purrr}

In this episode, Ellis Hughes and I follow up our work from Episode 183, where we introduced using {purrr} to do within group statistical modeling. Here, we go over the basics of {purrr} and show how valuable this package can be in your data analysis work.

To watch the screen cast, CLICK HERE.

To access the code, CLICK HERE.

Learning Bayesian Statistics Podcast

I recently had the please of speaking with Alex Andorra on his amazing Learning Bayesian Statistics Podcast. This was an amazing experience given the quality of people he has interviewed in the past and the fact that this is one of my favorite podcasts. We talked a lot about Bayesian statistics in sport. Hopefully others find it interesting.

Listen here << Learning Bayesian Statistics Podcast >>

Patrick Ward, PhD

Patrick Ward, PhD | Sports Science & Analytics