Author Archives: Patrick

Bayesian Modeling with STAN Part 3 – Handling Categorical Predictors

We’ve been discussing how to code STAN models and Part 1 and Part 2 served as a basic intro for simple linear regression, multiple linear regression, and model checking. Today, we cover how to code categorical variables into your models. STAN doesn’t handle categorical variables so, it’s on the user to properly index these variables as integers in the model. This can take a bit of practice and can get confusing. In this blog article I offer a few different approaches to accomplish the same task.

Click here to read the RMarkdown Blog Article >> STAN-Part-3—Handling-Categorical-Predictors

Click here to access the code >> STAN Part 3 – Handling Categorical Predictors

Bayesian Modeling with STAN Part 2 – Multiple Linear Regression & Model Checking

In Part 1 of this series I went over some basics of STAN coding and we built our first linear regression. In Part 2, we will extend the simple linear model to a multiple linear regression and discuss some model checking methods for convergence and diagnosis of the Markov Chain.

Click here to read the RMarkdown Blog Article >> STAN Part-2 – Multiple Linear Regression & Model Checking

Click here to access the code >> STAN Part 2 – Multiple Linear Regression & Model Checking

What’s the Plan Stan? Bayesian Modeling with STAN Part 1

For those interested in Bayesian models and probabilistic coding, I’m starting a new series on coding in STAN. We’ve done a lot of Bayesian coding in this blog, often with {rstanarm}, {bmrs}, and coding empirical Bayesian models by hand. We’ve also done a little bit with {PyMC3}, from the Python coding language.

STAN offers us a lot more flexibility to specify whatever model we want and full autonomy of the priors we can place on the various parameters. It is a little challenging given it is coming out of the C++ coding language. So, we will start slow and build up. This first post will simply be about how to code a STAN model and extract information from it. We will progress to building more complex models, modeling different outcome distributions, building hierarchical models, evaluating model diagnostics, and making predictions from the model.

I’m going to work with the {rstan} and {cmdstanr} packages to allow us to use R as our coding language and then call the C++ compiler when we need to fit the model. Along the way I’ll also provide some Jupyter Notebooks and show how to code stand models in Python using {cmdstanpy}, a Python analog for {cmdstanr} that allows us to run our same STAN models in Python.

In order to not repeat everything here in the blog, I’ll simply update the blog with links to my github repo for each new section so that all of the code is accessible and housed in one place.

Click here to read the RMarkdown Blog Article >> STAN Part 1 – Intro to STAN Code

Click here to access the code >> STAN Part 1 – Intro to STAN Code

 

XGBOOST Tuning – Tutorial in {xgboost} and {tidymodels}

A colleague recently asked me about XGBOOST (Extreme Gradient Boosting) models so I figured I’d put together a short tutorial of using XGBOOST both with the {xgboost} package and within the {tidymodels} environment.

Like all machine learning models, XGBOOST, has a number of different knobs and levers we can twist, push, and pull in order to tune the model and identify the best parameters for learning from the data. One can get a look at these parameters by loading the {xgboost} package and then typing ?xgboost into the console.

XGBOOST falls in the class of tree-based models. It is an effective machine learning algorithm for identifying patterns in data and it can be used for regression or classification problems.

When fitting the model in {tidymodels} it is pretty straight forward as far as the {tidymodels} syntax goes. However, if fitting models with the {xgboost} package, there are a number of considerations we need to make with the data. For example. XGBOOST can only use numeric data, so any categorical features will need to be one-hot-encoded. Additionally, the data cannot be in a data frame form, but rather needs to be turned into a matrix.

Let’s jump in!

(NOTE: All code is available on my Github page.)

Load Data & Packages

We will keep the data simple and use the mtcars data from base R. XGBOOST is well suited for complex data with a lot of interactions and non-linearity, but I’d like to use a simple data set so that the models can run fast for you and you get the gist of fitting them. Tuning all of the parameters can take a lot of time if you are dealing with a big data set.

Our goal is to predict mpg using the other features in the data.

{xgboost} for a regression problem

First we split our data into a training and testing data set.

set.seed(2025)
N <- nrow(mtcars)
ids <- sample(x = 1:N, size = floor(N * 0.7), replace = FALSE)
train <- mtcars[ids, ]
test <- mtcars[-ids, ]

Prepare the data

For the {xgboost} package to run we need the predictor variables to be in a data matrix and the outcome variables to be a vector. We then place the prepared data into a dense matrix so we can run our XGBOOST model.

train_x <- data.matrix(train[, -1])
train_y <- train$mpg

test_x <- data.matrix(test[, -1])
test_y <- test$mpg

xgb_train <- xgb.DMatrix(data = train_x, label = train_y)
xgb_test <- xgb.DMatrix(data = test_x, label = test_y)

Tune the model

First, we walk through an example of of how the xgb.cv function works and what it outputs. Then we can write a loop to extract the relevant information for the purposes of the tuning the hyperparameters.

Notice that we specified a variety of hyperparameters (e.g., max.depth, eta, subsample, and colsample_bytree). These aren’t all of the possible hyperparameters that can be tuned, though. Rather than hash out them all out here, you can find a list of their definitions by either using the help function in the R console, ?xgb.cv() or checking out THIS webpage.

fit_example <- xgb.cv(data = xgb_train,
                      params = list(booster = "gbtree",
                        objective = "reg:squarederror",
                        max.depth = 15,
                        eta = 0.1,
                        subsample = 0.1,
                        colsample_bytree = 0.5),
                      eval_metric = "rmse",
                      nfold = 5,
                      watchlist = list(train = xgb_train, test = xgb_test),
                      early_stopping_rounds = TRUE,
                      nrounds = 1000)

There are a number of features within the model that we can extract:

  • best_iteration indicates the best iteration of the model fit, represented as the nrounds argument in our function.

  • evaluation_log provides us with information about the RMSE for all iterations of model fitting.

We can directly select the best iteration from the evaluation log.

Here we see the RMSE for the training and testing data at the best iteration.

Now that we know what we are looking at with the xgb.cv() output, let’s write a grid of possible values for the hyperparameters and create a data frame, grid_params that stores this information.

eta <- seq(0.1, 0.9, 0.1)
max_depth <- c(5, 10, 15)
subsample <- c(0.5, 0.75, 1)
colsample_bytree <- c(0.5, 0.75, 1)
nrounds <- seq(100, 500, 100)

grid_params <- expand.grid(nrounds = nrounds,
                           max_depth = max_depth, 
                           subsample = subsample,
                           colsample_bytree = colsample_bytree,
                           eta = eta)

head(grid_params)

n <- nrow(grid_params)

Next, we need an empty data frame to store the best iteration for each row of our grid parameters data frame.

output_df <- data.frame("iter" = rep(NA, n), 
                           "train_rmse_mean" = rep(NA, n), 
                           "train_rmse_std" = rep(NA, n), 
                           "test_rmse_mean" = rep(NA, n), 
                           "test_rmse_std" = rep(NA, n))

Now run a for() loop to find the best parameters for the data!

 

for(i in 1:n){
  
  fit <- xgb.cv(data = xgb_train,
                params = list(booster = "gbtree",
                      objective = "reg:squarederror",
                      max.depth = grid_params$max_depth[i],
                      eta = grid_params$eta[i],
                      subsample = grid_params$subsample[i],
                      colsample_bytree = grid_params$colsample_bytree[i]),
                      eval_metric = "rmse",
                      nfold = 5,
                      nrounds = grid_params$nrounds[i],
                      watchlist = list(train = xgb_train, test = xgb_test),
                      early_stopping_rounds = TRUE,
                      grid_params$nrounds[i],
                      verbose = FALSE)
  
  best_fit <- fit$best_iteration
  
  output_df[i, 1] <- fit$evaluation_log %>% as.data.frame() %>% filter(iter == best_fit) %>% pull(iter)
  output_df[i, 2] <- fit$evaluation_log %>% as.data.frame() %>% filter(iter == best_fit) %>% pull(train_rmse_mean)
  output_df[i, 3] <- fit$evaluation_log %>% as.data.frame() %>% filter(iter == best_fit) %>% pull(train_rmse_std)
  output_df[i, 4] <- fit$evaluation_log %>% as.data.frame() %>% filter(iter == best_fit) %>% pull(test_rmse_mean)
  output_df[i, 5] <- fit$evaluation_log %>% as.data.frame() %>% filter(iter == best_fit) %>% pull(test_rmse_std)

}

Join the grid_params data set with the output_df` of our for() loop and select the output that had the lowest train set RMSE.

Fit the model

After tuning, we want to fit the model to the optimized hyper-parameters stored in the best_params element.

## fit model
fit_mpg <- xgboost(data = xgb_train,
                   params = list(booster = "gbtree",
                      objective = "reg:squarederror",
                      max.depth = best_params$max_depth,
                      eta = best_params$eta,
                      subsample = best_params$subsample,
                      colsample_bytree = best_params$colsample_bytree),
                      eval_metric = "rmse",
                      nrounds = best_params$nrounds)

Variables of Importance

Plot the variables of importance from the tuned model.

Predictions

We can plot the predictions and obtain the RMSE on our test set.

train_pred_y <- predict(fit_mpg, xgb_train)
test_pred_y <- predict(fit_mpg, xgb_test) test_x %>%
  bind_cols(mpg = test_y) %>%
  mutate(pred_mpg = test_pred_y) %>%
  ggplot(aes(x = pred_mpg, y = mpg)) +
  geom_point() +
  geom_abline()

sqrt(mean((test_y - test_pred_y)^2))

{tidymodels} XGBOOST for regression

We can do the same type of XGBOOST model fitting in {tidymodels}.

Train/Test Split

set.seed(3333)
car_split <- initial_split(mtcars)
car_split

train <- training(car_split)
test <- testing(car_split)

Cross Validation Folds

set.seed(34)
cv_folds <- vfold_cv(
  data = train, 
  v = 5
  ) 

Model Specification

In the boost_tree() function we specify which hyperparameters we want to tune.

boost_spec <- boost_tree( trees = tune(), 
      tree_depth = tune(), 
      min_n = tune(), 
      loss_reduction = tune(), 
      sample_size = tune(), 
      mtry = tune(), 
      learn_rate = tune() ) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

Workflow

Create the workflow, add the formula we want for the model, and add the model specifications.

xgb_wf <- workflow() %>%
  add_formula(mpg ~ .) %>%
  add_model(boost_spec)

Set up tuning grid

Instead of specifying specific values for each of the tuning parameters, as we did in the {xgboost} package example, we will use the grid_latin_hypercube() function to construct the parameter grid for us. We pass it a size argument to indicate how large (how many rows) we want the grid to be.

 

xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  trees(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), train),
  learn_rate(),
  size = 40
)

Hyper parameter tuning on cross validated folds

doParallel::registerDoParallel()

set.seed(66)
xgb_res <- tune_grid(
  xgb_wf,
  resamples = cv_folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE)
)

Obtaining cross validated outputs and identify the best model


Fit the final model

boost_spec <- boost_tree(mtry = best_xgb %>% pull(mtry),
                         trees = best_xgb %>% pull(trees),
                         min_n = best_xgb %>% pull(min_n),
                         tree_depth = best_xgb %>% pull(tree_depth),
                         learn_rate = best_xgb %>% pull(learn_rate),
                         loss_reduction = best_xgb %>% pull(loss_reduction),
                         sample_size = best_xgb %>% pull(sample_size)) %>%
  set_engine("xgboost") %>%
  set_mode("regression")


boost_fit <- boost_spec %>%
  fit(mpg ~ ., data = train)

Evaluate Test Set Performance

augment(boost_fit,
        new_data = test) %>%
  rmse(truth = mpg, estimate = .pred)

augment(boost_fit,
        new_data = test) %>%
  rsq(truth = mpg, estimate = .pred)


augment(boost_fit,
        new_data = test) %>%
  ggplot(aes(x = .pred, y = mpg)) +
  geom_jitter() +
  geom_abline(intercept = 0, slope = 1)

Wrapping up

XGBOOST is a flexible model that can be used for continuous outcomes, binary outcomes, and multi-class classification tasks. The model does have a number of tuning parameters that can help to optimize performance. However, it is relatively easy to create a grid of potential values and either write a for() loop or use the {tidymodels} framework to do the heavy lifting for you.

To obtain the full code and RMarkdown file check out my Github page.

Crossed vs Nested Random Effects

I was reading a paper this weekend and the researchers used a mixed model to for their data given that they were dealing with repeated measures of individuals and teams. Mixed models have become common in sport science research and I’ve written about them a few times in this blog:

As I was reading the paper I was wondering, because it wasn’t specified in the methods section, if the researchers had given any thought to whether their random effects were crossed or nested. Then I thought, maybe people aren’t thinking about this and perhaps a short blog article with an example would be useful. So here we are…

For brevity, I’ll only present short snippets of code in the blog article. The entire code is available on my Github Page

Data

We need to simulate some data to work with. To keep things simple, we will create a data set of a league that has 6 teams and 60 players. Each player will have a player value metric, the outcome of interest, for every team they played on during the season.

suppressPackageStartupMessages({
  suppressMessages({
    suppressWarnings({
    library(tidyverse)
    library(lme4)
    library(arm)
    })
  })
})


theme_set(theme_bw())

set.seed(225544)
teams <- sample(c("Sharks", "Bears", "Turtles", "Bisons", "Jaguars", "Pythons"), size = 60, replace = TRUE)
players <- as.factor(sample(1:60, size = 60, replace = TRUE))
intercept <- 50
team_effect <- case_when(teams == "Sharks" ~ rnorm(1, mean = 0, sd = 7), 
                         teams == "Bears" ~ rnorm(1, mean = 0, sd = 2),
                         teams == "Turtles" ~ rnorm(1, mean = 0, sd = 1),
                         teams == "Bisons" ~ rnorm(1, mean = 0, sd = 13),
                         teams == "Jaguars" ~ rnorm(1, mean = 0, sd = 5),
                         teams == "Pythons" ~ rnorm(1, mean = 0, sd = 3))
player_effect <- rnorm(length(levels(players)), mean = 0, sd = 10)
random_noise <- 2



dat <- data.frame( team = teams, player_id = players ) %>%
  mutate(
    mu = intercept + team_effect + player_effect[players],
    player_value = round(mu + rnorm(n(), mean = 0, sd = random_noise), 1)
  ) %>%
  dplyr::select(-mu)

dat %>%
  head()

EDA

How many unique players were on each team during the season?

Which players played on multiple teams?

Team Summary Stats

Crossed vs Nested Effects

We have 14 players that played on multiple teams (Player 11 ended up playing for all 6 teams!) and, as a consequence, different teams had different numbers of individual players. For example, the Bears had 13 different players while the Pythons had 6 different players. This small detail is important to consider with respect to our random effects because crossed and nested effects behave differently with respect to how they partition variance.

When constructing mixed models we usually have a data hierarchy where observations are grouped or nested within specific categories. When the observations are nested the observations are exclusive to specific groups. Some examples:

  • Classes are nested within schools but classes in one school cannot also be grouped within a different school.
  • Athletes competing in the Olympics are nested within their respective countries but an individual cannot compete for two different countries in the same Olympic Games.

By contrast, crossed effects can occur when observations move between groups during the same study/observation period. A few examples of crossed effects are:

  • A student might start the year at one school and then move half way through the semester and attend a different school.
  • Players in team sport might be traded mid-season and play for different teams.
  • In a study where the experimental and placebo groups contain both male and female participants, we would have crossed random effects to account for sex being in both groups.
  • Looking at sports teams, we may be analyzing the behavior of position groups on different teams. The position groups are the same for each team (e.g., Forwards, Guards, and Centers on Basketball teams) so the model has crossed effects.

To help solidify this concept a bit more, I’ve created the below drawing.

We see that in the Completely Nested example there are individual players that only player for their respective teams. Contrast that to the Completely Crossed example, where every player ends up playing on each team. This would, obviously, be an incredibly weird scenario and not one we’d see in sport but probably one that you’d see in researched where there is a crossover of treatments. Finally, the Partially Nested & Crossed example has a few players playing for multiple teams (for example, Player 1 played for both Team 1 and Team 2) and some players that stay with a single team during the season. This version is the one that we most frequently encounter in team sport.

Now that we have a general understanding of the difference between crossed and random effects, let’s see how this looks when we build a model on our simulated data and (most importantly) let’s see how specifying the model to have crossed or nested random effects changes the model output.

Model Examples

Crossed

Since we commonly deal with crossed (or partially nested/crossed) data in team sport, I’ll start with modeling the data to have crossed random effects. As a side note, most of the time when I look at people’s models in sport science they are creating the model structure like this, by default.

We will build random intercept model (since we didn’t simulate any fixed effects) and allow a separate intercept variance for both team and player_id, keeping in mind that this specifies that there are players who are potentially playing on different teams (crossed observations) across the season.

lme_crossed <- lmer(player_value ~ 1 + (1|team) + (1|player_id), data = dat)

Nested

Next, we fit the random intercept model but this time we treat our random effects of player_id and team as nested. That is to say that we are allowing the players to be nested within their respective teams. To review, this means that we are treating this data as if each player only played on a single team during the season and never changed teams.

lme_nested <- lmer(player_value ~ 1 + (1|team) + (1|team:player_id), data = dat)

Comparing the two models

First, we notice that the fixed effects are slightly different between the two models, but the difference in small in this simulated data.

The random effects look very different between the two model structures. In the crossed effects model we get two different random effects — one for the player_id and one for the team. However, in the nested effects model we have a random effect for team but then we have a random effect for player_id nested within each team. So, in the nested model a player that played for multiple teams will have different random intercept for each of the teams they were nested in. Both models, of course, also have a residual in the random effects, however the values are very different.

Let’s compile the random effects side-by-side to get a better appreciation for the differences. We will start with the `team` random effect since it is present in both models.

There are pretty stark differences between the random intercepts of the nested and crossed models! Recall from our EDA that the Bears had the highest average value of all the teams and they also had the fewest number of different players (5).

The player random effects get a little wonky because players who played for multiple teams will end up having different random effects depending on which team we nested them in. Conversely, in the crossed model we get one random effect estimate per player.

To join the random effects from both models we have to do a bit of work on the nested model to extract the info. (NOTE: Since the table is really long I’ll only show the first 10 rows here but you can got to the Github page to see the full output.)

Joining the two model random effects on player_id we see that player’s have some very different random effects between the two models. In particular, players that played on different teams can differ greatly between the models.

For example, let’s look at Player 21.

In the crossed model we find that Player 21 has a random intercept that is 1.6 points below league average (the fixed effect intercept). However, in the nested model, when Player 21 was on the Bears he was 1.15 points above the population intercept and 0.74 points below the population intercept when he was on the Turtles. This is partially due to the team effect, where the the Bears were the best team in the league (highest average value) and the Turtles were the worst team in the league (lowest average value).

Wrapping Up

Hopefully this tutorial provided a brief, but useful, overview of crossed versus random effects during mixed model construction. While most people seem to gravitate towards crossed effects by default there may be times where nested effects are indeed required for the structure of the data. As the example shows, how we specify the model can dramatically change the outputs and thus the inference that we can derive from these outputs.

To access the full code please see the Github page.