Author Archives: Patrick

Bayesian Updating for Normally Distributed Data – A few different approaches for the normal-normal conjugate

Introduction

Bayesian updating provides a way of combining prior knowledge/belief with newly observed data to obtain an updated knowledge of the world (posterior). Most Bayesian updating examples begin by observing a binomial outcome and combining those observations with a beta prior. While this is useful for understanding the basic crux of Bayesian updating, not all problems that we face in the real world will be binomial in nature, thus requiring a different likelihood distribution. For example, normally distributed data can be challenging to work with because there are two parameters (a mean and standard deviation) that have their own respective variances. Two circumvent this issue, in a normal-normal conjugate, we often accept one of the parameters as being known and fixed in the population (essentially treating it as a nuisance parameter and not something we are explicitly modeling). Often, because we care about updating our knowledge about the mean (center) of an observed value the standard deviation is taken to be fixed for the population, allowing us to create an updated mean and a corresponding distribution around it.

In reading about various approaches to normal-normal conjugate, I’ve noted three methods that can be used for Bayesian updating of a normally distributed variable in a simple way. The difference between the three approaches appears to be with the amount of information we have available to us about the observed values. These approaches are easy to use and can be applied quickly by a practitioner, with just a calculator, offering a convenient way to make observations and rationalize about the world around us.

The information required for the three approaches is as follows (I’ll share references to where I got each approach in the sections below):

  1. We have a prior value for the population mean and the sample size that this mean was taken from. What we are lacking is information about the population standard deviation. Thus, we have no information about how the variable varies.
  2. We have a prior mean and standard deviation for the population but we are lacking sample size information that the mean and standard deviation was derived from. Thus, we know how the variable varies but we don’t know how confident we should be about the observations (a large sample means we’d be more confident while a smaller sample means we’d be less confident).
  3. We have all three pieces of population prior — mean, sd, and sample size.

The complete code and data are available on my GITHUB page.

Load Data

Reference: Data comes from basketball-reference.com advanced shooting stats.

We will load several seasons worth of NBA advanced shooting statistics and the stat we are interested in is Player Efficiency Rating (PER).

 


library(tidyverse)

theme_set(theme_light())

## load nba_advanced_shooting_stats.csv
d <- read.csv("nba_advanced_shooting_stats.csv", header = TRUE) %>%
  select(Season, Player, Pos, Age, Tm, G, MP, PER) %>%
  janitor::clean_names()

d %>%
  head() %>%
  knitr::kable()

We have 4 seasons worth of data (really about 3.5 given that the 2022-2023 season wasn’t complete when I scraped the original data).

Exploring Player Efficiency Rating & Minutes Played

Let’s focus on the Player Efficiency Rating (PER) and Minutes Played.

It looks like on average players has a PER greater than 0, between 10 and 15. The minutes played is right skewed with a vast majority of the players playing a low number of minutes and a few players playing a lot of minutes.

We can look at the numbers explicitly by evaluating the quantiles.

summary(d[, c("mp", "per")])

Setting up our priors

Since the 2022 – 2023 season was not finished when I scraped this data, we will base our prior for PER on the previous 3 seasons. Additionally, we will set up our prior mean from players in the population who had over the median number of minutes played in those seasons.

d %>%
  filter(season != "2022-2023",
         mp > median(mp)) %>%
  summarize(n_players = n(),
            avg_mp = mean(mp),
            avg_per = mean(per),
            sd_per = sd(per))

We can store these variables in their own elements so that they can be called later in our calculations.

prior_mu <- 14.79
prior_n <- 1448
prior_df <- prior_n - 1

Recall that for our prior standard deviation we need to obtain a prior for the standard deviation around the mean (a standard error of the mean) and we also need to obtain a known population standard deviation (what I referred to as the nuisance parameter above, since we wont be directly estimating it).

We will call the standard deviation for the mean PER, prior_sd, and the fixed standard deviation, prior_tau. To calculate the prior_sd we’ll take the standard deviation across the three seasons for each player and then take the mean of those player standard deviations. For prior_tau we’ll use the overall standard deviation of observed PER values for the three seasons (which was calculated in our summary function above). Again, we’ll store these values in their own elements for calculations later.

d %>%
  filter(season != "2022-2023",
         mp &gt; median(mp)) %>%
  group_by(player) %>%
	summarize(per_sd = sd(per),
	          .groups = "drop") %>%
  summarize(mean(per_sd, na.rm = TRUE))


prior_sd <- 1.57
prior_var <- prior_sd^2
prior_precision <- 1 / prior_var

prior_tau <- 4.53
prior_tau_var <- prior_tau^2
prior_tau_precision <- 1 / prior_tau_var

 

Selecting one player

Let’s start with one player and work through the three approaches explained above before applying them to the full data set.

We will select a player with a low number of minutes played so that we can see how their PER behaves when we combine it with our prior. I’ll select Thanasis Antetokounmpo from the 2022-2023 season.

ta <- d %>%
  filter(season == "2022-2023",
         player == "Thanasis Antetokounmpo")

ta

We don’t know Thanasis Antetokounmpo standard deviation of player efficiency rating over the 21 games (91 minutes) he played. Therefore, we don’t have a standard deviation for his production.

Let’s store his observed values in separate elements.

obs_mu <- 1.9
obs_n <- 91

Method 1

I stumbled upon this method in the 2nd edition of Wayne Winston’s fantastic book, Mathletics.

Combining the observed PER and sample size (minutes played) for Antetokounmpo with the prior PER and prior sample size for the population, we see that Antetokoumpo’s estimated PER gets pulled up closer to the population mean, though still below average.

To get a sense for how much sample size effects the shrinkage towards the prior, let’s pretend that Antetokounmpo had 1200 minutes of observation with the same PER.

Notice that with 1200 minutes played we are much more certain that Antetokounmpo has a below average PER.

Method 2

Recall that for method 2 to work we require a mean and standard deviation for Antetokounmpo’s PER. Since we don’t have a standard deviation for his PER in the 2022-2023 we will get his PER from the previous 3 seasons and calculate a standard deviation. We will store that value in its own element.

This approach was discussed in Chapter 9 of Gelman and Hill’s Regression and Other Stories.

d %>%
  filter(season != "2022-2023",
         player == "Thanasis Antetokounmpo") %>%
  summarize(sd(per))

obs_sd <- 1.19

Applying method 2 we get the following result.

## Posterior
bayes_v2 <- ((1/prior_sd^2 * prior_mu) + (1 / obs_sd^2 * obs_mu))/((1/obs_sd^2) + (1/prior_sd^2))

bayes_v2

## Posterior SD
bayes_v2_sd <- sqrt(1/(1/obs_sd^2 + 1/prior_sd^2))
bayes_v2_sd

## Posterior 95% CI
bayes_v2 + qnorm(p = c(0.025, 0.975))*bayes_v2_sd

We could use a similar approach with just the mean and standard deviation (no sample size info) but use precision (1 / variance) as the parameter describing our spread in the data (instead of SD). We obtain the same results.

obs_precision <- 1 / obs_sd^2

posterior_precision <- prior_precision + obs_precision

bayes_v2.2 <- prior_precision/posterior_precision * prior_mu + obs_precision/posterior_precision * obs_mu

bayes_v2.2

bayes_v2.2_sd <- sqrt(1/posterior_precision)
bayes_v2.2_sd

## Posterior 95% CI
bayes_v2.2 + qnorm(p = c(0.025, 0.975))*bayes_v2.2_sd

This result is much more conservative than method 1. We see that Antetokounmpo is estimated to be well below average. Additionally, now that we have a standard deviation for Antetokounmpo’s PER we are also able to calculate a credible interval for his performance.

Method 3

For this final method we will use all of the observed info – mean, sd, and minutes play. This approach was presented in William Bolstad’s Introduction to Bayesian Statistics, 2nd Ed.

bayes_v3_precision <- prior_precision + obs_n/prior_tau_var
bayes_v3_precision

bayes_v3_sd <- sqrt(1/bayes_v3_precision)
bayes_v3_sd

bayes_v3 <- (prior_precision / (obs_n/prior_tau_var + prior_precision))*prior_mu + ((obs_n/prior_tau_var) / (obs_n/prior_tau_var + prior_var)) * obs_mu

bayes_v3

## Posterior 95% CI
bayes_v3 + qnorm(p = c(0.025, 0.975))*bayes_v3_sd

Comparing the Results

data.frame(bayes_v1, bayes_v2, bayes_v2_sd, bayes_v3, bayes_v3_sd) %>%
  knitr::kable()

  • Method 1 has the largest pull towards the prior mean because it uses the least information. Since we don’t have an observed standard deviation for our observation, we also don’t have any information about the variability in the posterior mean.
  • Method 2 has less pull to the prior mean than version 1 and also has a rather large standard deviation around the values.
  • Methods 3 has the lowest pull towards the mean compared to the other three approaches and it uses the largest amount of information.

Antetokounmpo only had 91 minutes of observation time. To show how sample sizes effects our estimate, if we increase his sample size to 1000 we end up with more confidence about his performance (an estimated PER closer to the observed 1.9 and a smaller standard deviation).

 

bayes_v3.3_precision <- prior_precision + 1000/prior_tau_var
bayes_v3.3_precision

bayes_v3.3_sd <- sqrt(1/bayes_v3.3_precision)
bayes_v3.3_sd

bayes_v3.3 <- (prior_precision / (1000/prior_tau_var + prior_precision))*prior_mu + ((1000/prior_tau_var) / (1000/prior_tau_var + prior_var)) * obs_mu

bayes_v3.3

## Posterior 95% CI
bayes_v3.3 + qnorm(p = c(0.025, 0.975))*bayes_v3.3_sd

 

Let’s create a simulation using rnorm() and plot the estimates from the three methods. Since we don’t have a standard deviation for method 1 we will use the prior_sd. We notice that method 3, which uses the most information gives us a much more conservative belief about the player’s true performance compared to the other two methods.

N <- 1e4

set.seed(9087)
v1_sim <- rnorm(n = N, mean = bayes_v1, sd = prior_sd)
v2_sim <- rnorm(n = N, mean = bayes_v2, sd = bayes_v2_sd)
v3_sim <- rnorm(n = N, mean = bayes_v3, sd = bayes_v3_sd)


plot(density(v1_sim), 
     col = "blue",
     lwd = 3,
     xlim = c(0, 20),
     ylim = c(0, 0.95),
     main = "Bayesian Normal Updating -- 3 Approaches\nDashed Line = Observed PER")
lines(density(v2_sim), 
     col = "red",
     lwd = 3)
lines(density(v3_sim), 
     col = "green",
     lwd = 3)
abline(v = obs_mu,
       col = "black",
       lty = 2,
       lwd = 2)
legend(x = 12,
       y = 0.8,
       c("Method 1", "Method 2", "Method 3"),
       col = c("blue", "red", "green"),
       lwd = 2)

Wrapping Up

The normal-normal conjugate can be a little tricky compared to a beta-binomial conjugate, but it is an important distribution to work with given most of the data we deal with on a regular basis. Without getting into complex modeling we can use a few simple approaches for a normal-normal conjugate that allows us to quickly update our beliefs based on various bits of information we have access to. Hopefully this article was useful at showing a few of these approaches (there are others!).

The full code for this article is available on my GITHUB page.

If you notice any errors or have additional thoughts, feel free to email me!

TidyX Episode 143: Four in for(loops)

This week, Ellis Hughes and I continue with last weeks example and discuss for() loops in more detail. Specifically, we discuss four (plus one bonus — so really five) different ways of specifying the for() loop depending on the data you are dealing with and the problem you are trying to solve. We talk through each type of specification approach, advantages and disadvantages, and when you may choose one option over another.

To listen to our screen cast, CLICK HERE.

To access our code, CLICK HERE.

TidyX Episode 142: Storing data in for loops – An example using k-fold cross validation

for() loops are a necessary evil – you probably wont love writing them but they come in handy. I often find myself writing a lot of for() loops and viewers of our screen cast often have questions about writing for() loops. This week, Ellis Hughes and I discuss four different approaches to storing your data from a for() loop so that you can use it for downstream projects (e.g., data models, visualizations, etc.). We do this with an example of creating a for() loop for k-fold cross validation.

K-fold cross validation is a useful component of the model building process, allowing you to split the data into K number of folds. Each fold serves as a one time hold-out set whereby a model is constructed on the rest of the data and that model is then used to predict on the hold-out fold. This continues until each k-fold has been held out and predicted on. The model performance is then compared across each of the folds. This type of iterative approach to model building makes k-fold cross validation a good candidate for a for() loop.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

TidyX Episode 141: Function Factories

Two episodes ago, Ellis Hughes and I briefly discussed function factories — a function that returns a function — and we got a few questions about how to use them and how to construct them. The good new is that you’ve probably been using function factories in your normal R coding and never knew it! This week we go a little deeper into function factories and answer some viewer questions about them.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

Tidymodels Workflow Sets Tutorial

Intro

The purpose of workflow sets are to allow you to seamlessly fit multiply different models (and even tune them) simultaneously. This provide an efficient approach to the model building process as the models can then be compared to each other to determine which model is the optimal model for deployment. Therefore, the aim of this tutorial is to provide a simple walk through of how to set up a workflow_set() and build multiple models simultaneously using the tidymodels framework.

The full code (which will include code not directly embedded in this tutorial) is available on my GITHUB page.

Load Packages & Data

Data comes from the nwslR package, which provides a lot of really nice National Women’s Soccer League data.

We will be using stats for field players to determine those who received the the Best XI award (there will only be 10 players per season since we are dealing with field player stats, no goalies).

## packages
library(tidyverse)
library(tidymodels)
library(nwslR)
library(tictoc)

theme_set(theme_light() +
            theme(strip.background = element_rect(fill = "black"),
                  strip.text = element_text(face = "bold")))


## data sets required
data(player)
data(fieldplayer_overall_season_stats)
data(award)

## join all data sets to make a primary data set
d <- fieldplayer_overall_season_stats %>%
  left_join(player) %>% 
  left_join(award) %>% 
  select(-name_other) %>% 
  mutate(best_11 = case_when(award == "Best XI" ~ 1,
                             TRUE ~ 0)) %>% 
  select(-award)

d %>% 
  head()


Our features will be all of the play stats: mp, starts, min, gls, ast, pk, p_katt and the position (pos) that the player played.

Exploratory Data Analysis

Let’s explore some of the variables that we will be modeling.

How many NAs are there in the data set?

  • It looks like there are some players that matches played (mp) and starts yet the number of minutes was not recorded. We will need to handle this in our pre-processing. The alternative approach would be to just remove those 79 players, however I will add an imputation step in the recipe section of our model building process to show how it works.
  • There are also a number of players that played in games but never attempted a penalty kick. We will set these columns to 0 (the median value).

How many matches did those who have an NA for minutes play in?


Let’s get a look at the relationship between matches played, `mp`, and `min` to see if maybe we can impute the value for those who have NA.

fit_min <- lm(min ~ mp, data = d)
summary(fit_min)

plot(x = d$mp, 
     y = d$min,
     main = "Minutes Played ~ Matches Played",
     xlab = "Matches Played",
     ylab = "Minutes Played",
     col = "light grey",
     pch = 19)
abline(summary(fit_min),
       col = "red",
       lwd = 5,
       lty = 2)

  • There is a large amount of error in this model (residual standard error = 264) and the variance in the relationship appears to increase as matches played increases. This is all we have in this data set to really go on. It is probably best to figure out why no minutes were recorded for those players or see if there are other features in a different data set that can help us out. For now, we will stick with this simple model and use it in our model `recipe` below.

Plot the density of the continuous predictor variables based on the `best_11` award.

d %>%
  select(mp:p_katt, best_11) %>%
  pivot_longer(cols = -best_11) %>%
  ggplot(aes(x = value, fill = as.factor(best_11))) +
  geom_density(alpha = 0.6) +
  facet_wrap(~name, scales = "free") +
  labs(x = "Value",
       y = "Density",
       title = "Distribution of variables relative to Best XI designation",
       subtitle = "NOTE: axes are specific to the value in question")

How many field positions are there?

Some players appear to play multiple positions. Maybe they are more versatile? Have players with position versatility won more Best XI awards?

Data Splitting

First, I’ll create a data set of just the predictors and outcome variables (and get rid of the other variables in the data that we won’t be using). I’ll also convert our binary outcome variable from a number to a factor, for model fitting purposes.

Split the data into train/test splits.

## Train/Test
set.seed(398)
init_split <- initial_split(d_model, prop = 0.7, strat = "best_11")

train <- training(init_split)
test <- testing(init_split)

Further split the training set into 5 cross validation folds.

## Cross Validation Split of Training Data
set.seed(764)
cv_folds <- vfold_cv(
  data = train, 
  v = 5
  ) 


Prepare the data with a recipe

Recipes help us set up the data for modeling purposes. It is here that we can handle missing values, scale/nornmalize our features, and create dummy variables. More importantly, creating the recipe ensure that if we deploy our model for future predictions the steps in the data preparation process will be consistent and standardized with what we did when we fit the model.

You can find all of the recipe options HERE.

The pre-processing steps we will use are:

  • Impute any NA minutes, `min` using the `mp` variable.
  • Create one hot encoded dummy variables for the player’s position
  • Impute the median (0) when penalty kicks attempted and penalty kicks made are NA
  • Normalize the numeric data to have a mean of 0 and standard deviation of 1
nwsl_rec <- recipe(best_11 ~ ., data = train) %>%
  step_impute_linear(min, impute_with = imp_vars(mp)) %>%
  step_dummy(pos, one_hot = TRUE) %>%
  step_impute_median(pk, p_katt, ast) %>%
  step_normalize(mp:p_katt)

nwsl_rec

Here is what the pre-processed training set looks like when we apply this recipe:

Specifying the models

We will fit three models at once:

  1. Random Forest
  2. XGBoost
  3. K-Nearest Neighbor
## Random forest
rf_model <- rand_forest( mtry = tune(), trees = tune(), ) %>%
  set_mode("classification") %>%
  set_engine("randomForest", importance = TRUE)

## XGBoost
xgb_model <- boost_tree( trees = tune(), mtry = tune(), tree_depth = tune(), learn_rate = .01 ) %>%
  set_mode("classification") %>% 
  set_engine("xgboost",importance = TRUE)

## Naive Bayes Classifier
knn_model <- nearest_neighbor(neighbors = 4) %>%
  set_mode("classification")

Workflow Set

We are now ready to combine the pre-processing recipes and the three models together in a workflow_set().

nwsl_wf <-workflow_set(
  preproc = list(nwsl_rec),
  models = list(rf_model, xgb_model, knn_model),
  cross = TRUE
  )

nwsl_wf

Tune & fit the 3 workflows

Once the models are set up we use workflow_map() to fit the workflow to the cross-validated folds we created. We will set up a few tuning parameters for the Random Forest and XGBOOST models so during the fitting process we can determine which of parameter pairings optimize the model performance.

I also use the ‘tic()’ and ‘toc()’ functions from the tictoc package to determine the length of time it takes the model to fit, in case there are potential opportunities to optimize the fitting process.

doParallel::registerDoParallel(cores = 10)

tic()

fit_wf <- nwsl_wf %>%  
  workflow_map(
    seed = 44, 
    fn = "tune_grid",
    grid = 10,           ## parameters to pass to tune grid
    resamples = cv_folds
  )

toc()

# Took 1.6 minutes to fit

doParallel::stopImplicitCluster()

fit_wf


Evaluate each model’s performance on the train set

We can plot the model predictions across the range of models we fit using autoplot(), get a summary of the model predictions with the collect_metrics() function, and rank the results of the model using rank_results().

 

## plot each of the model's performance and ROC
autoplot(fit_wf)

## Look at the model metrics for each of the models
collect_metrics(fit_wf) 

## Rank the results based on model accuracy
rank_results(fit_wf, rank_metric = "accuracy", select_best = TRUE)


We see that the Random Forest models out performed the XGBOOST and KNN models.

Extract the model with the best performance

Now that we know that the Random Forest performed the best. We will grab the model ID for the Random Forest Models and their corresponding workflows.

## get the workflow ID for the best model
best_model_id <- fit_wf %>% 
  rank_results(
    rank_metric = "accuracy",
    select_best = TRUE
  ) %>% 
  head(1) %>% 
  pull(wflow_id)

best_model_id

## Extract the workflow for the best model
best_model <- extract_workflow(fit_wf, id = best_model_id)
best_model

Extract the tuned results from workflow of the best model

We know the best model was the Random Forest model so we can use the best_model_id to get all of the Random Forest models out and look at how each one did during the tuning process.

First we extract the Random Forest models.

## extract the Random Forest models
best_workflow <- fit_wf[fit_wf$wflow_id == best_model_id,
                               "result"][[1]][[1]]

best_workflow

With the collect_metrics() function we can see the iterations of mtry, trees, and tree_depth that were evaluated in the tuning process. We can also use select_best() to get the model parameters that performed the best of the Random Forest models.

collect_metrics(best_workflow)
select_best(best_workflow, "accuracy")

Fit the final model

We saw above that the best model had the following tuning parameters:

  • mtry = 1
  • trees = 944

We can extract this optimized workflow using the finalize_workflow() function and then fit that final workflow to the initial training split data.

## get the finalized workflow
final_wf <- finalize_workflow(best_model, select_best(best_workflow, "accuracy"))
final_wf

## fit the final workflow to the initial data split
doParallel::registerDoParallel(cores = 8)

final_fit <- final_wf %>% 
  last_fit(
    split = init_split
  )

doParallel::stopImplicitCluster()

final_fit

Extract Predictions on Test Data and evaluate model

First we can evaluate the variable importance plot for the random forest model.

library(vip)

final_fit %>%
  extract_fit_parsnip() %>%
  vip(geom = "col",
      aesthetics = list(
              color = "black",
              fill = "palegreen",
              alpha = 0.5)) +
  theme_classic()

Next we will look at the accuracy and ROC on the test set by using the collect_metrics() function on the final_fit. Additionally, if we use the collect_predictions() function we will get the predicted class and predicted probabilities for each row of the test set.

## Look at the accuracy and ROC on the test data
final_fit %>% 
  collect_metrics()

## Get the model predictions on the test data
fit_test <- final_fit %>% 
  collect_predictions()

fit_test %>%
  head()

Next, create a confusion matrix of the class of interest, best_11 and our predicted class, .pred_class.

fit_test %>% 
  count(.pred_class, best_11)

table(fit_test$best_11, fit_test$.pred_class)

We see that the model never actually predicted a person to be in class 1, indicating that they would be ranked as one of the Best X1 for a given season. We have such substantial class imbalance that the model can basically guess that no one will will Best XI and end up with a high accuracy.

The predicted class for a binary prediction ends up coming from a default threshold of 0.5, meaning that the predicted probability of being one of the Best XI needs to exceed 50% in order for that class to be predicted. This might be a bit high/extreme for our data! Additionally, in many instances we may not care so much about a specific predicted class but instead we want to just understand the probability of being predicted in one class or another.

Let’s plot the distribution of Best XI predicted probabilities colored by whether the individual was actually one of the Best XI players.

fit_test %>%
  ggplot(aes(x = .pred_1, fill = best_11)) +
  geom_density(alpha = 0.6)

We can see that those who were actually given the Best XI designation had a higher probability of being indicated as Best XI, just not high enough to exceed the 0.5 default threshold. What if we set the threshold for being classified as Best XI at 0.08?

fit_test %>%
  mutate(pred_best_11_v2 = ifelse(.pred_1 > 0.08, 1, 0)) %>%
  count(pred_best_11_v2, best_11)

Wrapping Up

In the final code output above we see that there are 20 total instances where the model predicted the individual would be a Best XI player. Some of those instances the model correctly identified one of the Best XI and other times the model prediction led to a false positive (the model thought the person had a Best XI season but it was incorrect). There is a lot to unpack here. Binary thresholds like this can often be messy as predicting one class or another can be weird as you get close to the threshold line. Additionally, changing the threshold line will change the classification outcome. This would need to be considered based on your tolerance for risk of committing a Type I or Type II error, which may depend on the goal of your model, among other things. Finally, we often care more about the probability of being in one class or another versus a specific class outcome. All of these things need to be considered and thought through and are out of the scope of this tutorial, which had the aim of simply walking through how to set up a workflow_set() and fit multiple models simultaneously. Perhaps a future tutorial can cover such matters more in depth.

The complete code for this tutorial is available on my GITHUB page.