Category Archives: R Tips & Tricks

Visualizing Vald Force Frame Data

Recently, in a sport science chat group on Facebook someone asked for an example of how other practitioners are visualizing their Vald Force Frame data. For those that are unfamiliar, Vald is a company that originally pioneered the Nordbord, for eccentric hamstring strength testing. The Force Frame is their latest technological offering, designed to help practitioners test abduction and adduction of the hips for performance and return-to-play purposes.

The Data

I never used the Force Frame, personally, so the author provided a screen shot of what the data looks like. Using that example, I created a small simulation of data to try and create a visual that might be useful for practitioners. Briefly, the data is structured as two rows per athlete, a row representing the force output squeeze (adduction) and a row representing pull (abduction). My simulated data looks like this:

### Vald Force Frame Visual
library(tidyverse)

set.seed(678)
dat <- tibble( player = rep(c("Tom", "Alan", "Karl", "Sam"), each = 2), test = rep(c("Pull", "Squeeze"), times = 4) ) %>%
  mutate(l_force = ifelse(test == "Pull", rnorm(n = nrow(.), mean = 305, sd = 30),
                          rnorm(n = nrow(.), mean = 360, sd = 30)),
         r_force = ifelse(test == "Pull", rnorm(n = nrow(.), mean = 305, sd = 30),
                          rnorm(n = nrow(.), mean = 360, sd = 30)),
         pct_imbalance = ifelse(l_force &gt; r_force, ((l_force - r_force) / l_force) * -1,
                            (r_force - l_force) / r_force))

In this simplified data frame we see the two rows per athlete with left and right force outputs. I’ve also calculated the Bilateral Strength Asymmetry (BSA), indicated in the percent imbalance column. This measure, as well as several other measures of asymmetry, was reported by Bishop et al (2016) in their paper on Calculating Asymmetries. The equation is as follows:

BSA = (Stronger Limb – Weaker Limb) / Stronger Limb

Additionally, if the left leg was stronger than the right I multiplied the BSA by -1, so that the direction of asymmetry (the stronger limb) can be reflected in the plot.

The Plot

Prior to plotting, I set some theme elements that allow me to style the axis labels, axis text, plot title and subtitle, and the headers of the two facets of tests that we have (one for pull and one for squeeze). Additionally, in order to make the plot look cleaner, I get rid of the legend since it didn’t offer any new information that couldn’t be directly interpreted by looking at the visual.

Before creating the visual, I first add a few variables that will help me give more context to the numbers. The goal of this visual is to help practitioners look at the tests results for a larger group of athletes and quickly identify those athletes that may require specific consideration. Therefore, I create a text string that captures the left and right force outputs so that they can be plotted directly onto the plot and a corresponding “flag” variable that indicates when an athlete may be below some normalized benchmark of strength in the respective test. Finally, I created an asymmetry flag to indicate when an athlete has a BSA that exceeds 10% in either direction. This threshold can (and should) be whatever is meaningful and important with your athletes and in your sport.

For the plot itself, I decided that plotting the BSA values for both tests would be something that practitioners would find valuable and comprehending the direction of asymmetry in a plot is also very easy. Remember, the direction that the bar is pointed represents the stronger limb. To provide context for the asymmetry direction, I created a shaded normative range and whenever the bar is outside of this range, it changes to red. When it is inside the range it remains green. To provide the raw value force numbers, I add those to the plot as labels, in the middle of each plot region for each athlete. If the athlete is flagged as having a force output for either leg that is below the predetermined threshold the text will turn red.

theme_set(theme_classic() +
          theme(strip.background = element_rect(fill = "black"),
                strip.text = element_text(size = 13, face = "bold", color = "white"),
                axis.text = element_text(size = 13, face = "bold"),
                axis.title.x = element_text(size = 14, face = "bold"),
                plot.title = element_text(size = 18),
                plot.subtitle = element_text(size = 14),
                legend.position = "none"))

dat %>%
  mutate(left = paste("Left =", round(l_force, 1), sep = " "),
         right = paste("Right =", round(r_force, 1), sep = " "),
         l_r = paste(left, right, sep = "\n"),
         asym_flag = ifelse(abs(pct_imbalance) > 0.1, "flag", "no flag"),
         weakness_flag = ifelse((test == "Pull" & (l_force < 250 | r_force < 250)) |
                                 (test == "Squeeze" & (l_force < 330 | r_force < 330)), "flag", "no flag")) %>%
  ggplot(aes(x = pct_imbalance, y = player)) +
  geom_rect(aes(xmin = -0.1, xmax = 0.1),
            ymin = 0,
            ymax = Inf,
            fill = "light grey",
            alpha = 0.3) +
  geom_col(aes(fill = asym_flag),
           alpha = 0.6,
           color = "black") +
  geom_vline(xintercept = 0, 
             size = 1.3) +
  annotate(geom = "text",
           x = -0.2, 
           y = 4.5,
           size = 6,
           label = "Left") +
  annotate(geom = "text",
           x = 0.2, 
           y = 4.5,
           size = 6,
           label = "Right") +
  geom_label(aes(x = 0, y = player, label = l_r, color = weakness_flag)) +
  scale_color_manual(values = c("flag" = "red", "no flag" = "black")) +
  scale_fill_manual(values = c("flag" = "red", "no flag" = "palegreen")) +
  labs(x = "% Imbalance",
       y = NULL,
       title = "Force Frame Team Testing",
       subtitle = "Pull Weakness < 250 | Squeeze Weakness < 330") +
  facet_grid(~test) +
  scale_x_continuous(labels = scales::percent_format(accuracy = 0.1),
                     limits = c(-0.3, 0.3))

At a quick glance we can notice that 3 of the athletes exhibit a strength asymmetry for Pull and two exhibit a strength asymmetry for Squeeze. Additionally, one of the athletes, Karl, is also below the strength threshold for both Pull and Squeeze while Sam is exhibiting strength below the threshold for Squeeze only.

Wrapping Up

There are a lot of ways to visualize this sort of single day testing data. This is just one example and would be different if we were trying to visualize serial measurements, where we are tracking changes over time. Hopefully this short example provides some ideas. If you’d like to play around with the code and adapt it to your own athletes, you can access it on my GitHub page.

Two Group Comparison – Frequentist vs Bayes – Part 1

One of the more common types of analysis conducted in science is the comparison of two groups (e.g., a treatment group and a control group) to identify whether an intervention had a desired effect. The problem, in sport science and other fields, is often a lack of participants (small sample sizes) and an inability to combine prior knowledge (e.g., outcomes from previous research or domain expertise) with the observed data in order to make a broader statement of our findings (IE, a Bayesian approach). Consequently, this leaves us analyzing the data on hand and reporting whatever potential findings shake out.

This tutorial will step through a two group comparison of some simulated data from both a frequentist (t-test) and Bayesian approach.

All code is accessible on my GITHUB page.

Two Groups – Simulated Data

We are conducting a study on the effect of a new strength training program. Participants are randomly allocated into a control group (n = 17), receiving a normal training program, or an experimental group (n = 22), receiving the new training program.

The data consists of the change in strength score observed for each participant in their respective groups. (NOTE: For the purposes of this example, the strength score is a made up number describing a participants strength).

Simulate the data

library(tidyverse)
library(patchwork)

theme_set(theme_classic())

set.seed(6677)
df <- data.frame( subject = 1:39, group = c(rep("control", times = 17), rep("experimental", times = 22)), strength_score_change = c(round(rnorm(n = 17, mean = 0, sd = 0.5), 1), round(rnorm(n = 22, mean = 0.5, sd = 0.5), 1)) ) %>%
  mutate(group = factor(group, levels = c("experimental", "control")))

Summary Statistics

df %>%
  group_by(group) %>%
  summarize(N = n(),
            Avg = mean(strength_score_change),
            SD = sd(strength_score_change),
            SE = SD / sqrt(N))

It looks like the new strength training program led to a greater improvement in strength, on average, than the normal strength training program.

Plot the data

Density plots of the two distributions

df %>%
  ggplot(aes(x = strength_score_change, fill = group)) +
  geom_density(alpha = 0.2) +
  xlim(-2, 2.5)

Plot the means and 95% CI to compare visually

df %>%
  group_by(group) %>%
  summarize(N = n(),
            Avg = mean(strength_score_change),
            SD = sd(strength_score_change),
            SE = SD / sqrt(N)) %>%
  ggplot(aes(x = group, y = Avg)) +
  geom_hline(yintercept = 0,
             size = 1,
             linetype = "dashed") +
  geom_point(size = 5) +
  geom_errorbar(aes(ymin = Avg - 1.96 * SE, ymax = Avg + 1.96 * SE),
                width = 0,
                size = 1.4) +
  theme(axis.text = element_text(face = "bold", size = 13),
        axis.title = element_text(face = "bold", size = 17))

Plot the 95% CI for the difference in means

df %>%
  mutate(group = factor(group, levels = c("control", "experimental"))) %>%
  group_by(group) %>%
  summarize(N = n(),
            Avg = mean(strength_score_change),
            SD = sd(strength_score_change),
            SE = SD / sqrt(N),
            SE2 = SE^2,
            .groups = "drop") %&amp;gt;%
  summarize(diff = diff(Avg),
            se_diff = sqrt(sum(SE2))) %&amp;gt;%
  mutate(group = 'Group\nDifference') %&amp;gt;%
  ggplot(aes(x = diff, y = 'Group\nDifference')) +
  geom_vline(aes(xintercept = 0),
             linetype = "dashed",
             size = 1.2) +
  geom_point(size = 5) +
  geom_errorbar(aes(xmin = diff - 1.96 * se_diff,
                    xmax = diff + 1.96 * se_diff),
                width = 0,
                size = 1.3) +
  xlim(-0.8, 0.8) +
  labs(y = NULL,
       x = "Average Difference in Strengh Score",
       title = "Difference in Strength Score",
       subtitle = "Experimental - Control (Mean Difference ± 95% CI)") +
  theme(axis.text = element_text(face = "bold", size = 13),
        axis.title = element_text(face = "bold", size = 17),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 18))

Initial Observations

We can see that the two distributions have a decent amount of overlap
The mean change in strength for the experimental group appears to be larger than the mean change in strength for the control group
When visually comparing the group means and their associated confidence intervals, there is an observable difference. Additionally, looking at the mean difference plot, that difference appears to be significant based on the fact that the confidence interval does not cross zero.

So, it seems like something is going on here with the new strength training program. Let’s do some testing to see just how large the difference is between our two groups.

Group Comparison: T-test

First, let’s use a frequentist approach to evaluating the difference in strength score change between the two groups. We will use a t-test to compare the mean differences.

diff_t_test <- t.test(strength_score_change ~ group,
       data = df,
       alternative = "two.sided")

## store some of the main values in their own elements to use later
t_test_diff  <- as.vector(round(diff_t_test$estimate[1] - diff_t_test$estimate[2], 2))

se_diff <- 0.155    # calculated above when creating the plot

diff_t_test

The test is statistically significant, as the p-value is less than the magic 0.05 (p = 0.01) with a mean difference in strength score of 0.41 and a 95% Confidence Interval of 0.1 to 0.73.

Here, the null hypothesis was that the difference in strength improvement between the two groups is 0 (no difference) while the alternative hypothesis is that the difference is not 0. In this case, we did a two sided test so we are stating that we want to know if the change in strength from the new training program is either better or worse than the normal training program. If we were interested in testing the difference in a specific direction, we could have done a one sided test in the direction of our hypothesis, for example, testing whether the new training program is better than the normal program.

But, what if we just got lucky with our sample in the experimental group, and they happened to respond really well to the new program, and got unlucky in our control group, and they happened to not change as much to the normal training program? This could certainly happen. Also, had we collected data on a few more participants, it is possible that the group means could have changed and no effect was observed.

The t-test only allows us to compare the group means. It doesn’t tell us anything about the differences in variances between groups (we could do an F-test for that). The other issue is that rarely is the difference between two groups exactly 0. It makes more sense for us to compare a full distribution of the data and make an inference about how much difference there is between the two groups. Finally, we are only dealing with the data we have on hand (the data collected for our study). What if we want to incorporate prior information/knowledge into the analysis? What if we collect data on a few more participants and want to add that to what we have already observed to see how it changes the distribution of our data? For that, we can use a Bayesian approach.

Group Comparison: Bayes

The two approaches we will use are:

Group comparison with a known variance, using conjugate priors
Group comparisons without a know variance, estimating the mean and SD distributions jointly

Part 1: Group Comparison with known variance

First, we need to set some priors. Let’s say that we are skeptical about the effect of the new strength training program. It is a new training stimulus that the subjects have never been exposed to, so perhaps we believe that it will improve strength but not much more than the traditional program. We will set our prior belief about the mean improvement of any strength training program to be 0.1 with a standard deviation of 0.3, for our population. Thus, we will look to combine this average/expected improvement (prior knowledge) with the observations in improvement that we see in our respective groups. Finally, we convert the standard deviation of the mean to precision, calculated as 1/sd^2, for our equations going forward.

prior_mu <- 0.1
prior_sd <- 0.3
prior_var <- prior_sd^2
prior_precision <- 1 / prior_var

Plot the prior distribution for the mean difference

hist(rnorm(n = 1e6, mean = prior_mu, sd = prior_sd),
     main = "Prior Distribution for difference in strength change\nbetween control and experimental groups",
     xlab = NULL)
abline(v = 0,
       col = "red",
       lwd = 5,
       lty = 2)

Next, to make this conjugate approach work we need to have a known standard deviation. In this example, we will not be estimating the joint posterior distributions (mean and SD). Rather, we are saying that we are interested in knowing the mean and the variability around it but we are going to have a fixed/known SD.

Let’s say we looked at some previously published scientific literature and also tried out the new program in a small pilot study of athletes and we the improvements of a strength training program are normally distributed with a known SD of 0.6.

known_sd <- 0.6
known_var <- known_sd^2

Finally, let’s store the summary statistics for each group in their own elements. We can type these directly in from our summary table.

df %>%
  group_by(group) %>%
  summarize(N = n(),
            Avg = mean(strength_score_change),
            SD = sd(strength_score_change),
            SE = SD / sqrt(N))

experimental_N <- 22
experimental_mu <- 0.423
experimental_sd <- 0.494
experimental_var <- experimental_sd^2
experimental_precision <- 1 / experimental_var

control_N <- 17
control_mu <- 0.0118
control_sd <- 0.469
control_var <- control_sd^2
control_precision <- 1 / control_var

Now we are ready to update the observed study data for each group with our prior information and obtain posterior estimates. We will use the updating rules provided by William Bolstad in Chapter 13 of his book, Introduction to Bayesian Statistics, 2nd Ed.

##### Update the control group observations ######
## SD
posterior_precision_control <- prior_precision + control_N / known_var
posterior_var_control <- 1 / posterior_precision_control
posterior_sd_control <- sqrt(posterior_var_control)

## mean
posterior_mu_control <- prior_precision / posterior_precision_control * prior_mu + (control_N / known_var) / posterior_precision_control * control_mu

posterior_mu_control
posterior_sd_control

##### Update the experimental group observations ######
## SD
posterior_precision_experimental <- prior_precision + experimental_N / known_var
posterior_var_experimental <- 1 / posterior_precision_experimental
posterior_sd_experimental <- sqrt(posterior_var_experimental)

## mean
posterior_mu_experimental <- prior_precision / posterior_precision_experimental * prior_mu + (experimental_N / known_var) / posterior_precision_experimental * experimental_mu

posterior_mu_experimental
posterior_sd_experimental

Compare the posterior difference in strength change

# mean
mu_diff <- posterior_mu_experimental - posterior_mu_control
mu_diff

# sd = sqrt(var1 + var2)
sd_diff <- sqrt(posterior_var_experimental + posterior_var_control)
sd_diff

# 95% Credible Interval
mu_diff + qnorm(p = c(0.025, 0.975)) * sd_diff

Combining the observations for each group with our prior, it appears that the new program was more effective than the normal program on average (0.34 difference in change score between experimental and control) but the credible interval suggests the data is consistent with the new program having no extra benefit [95% Credible Interval: -0.0003 to 0.69].

Next, we can take the posterior parameters and perform a Monte Carlo Simulation to compare the posterior distributions.

Monte Carlo Simulation is an approach that uses random sampling from a defined distribution to solve a problem. Here, we will use Monte Carlo Simulation to sample from the normal distributions of the control and experimental groups as well as a simulation for the difference between the two. To do this, we will create a random draw of 10,000 values with the mean and standard deviation being defined as the posterior mean and standard deviation from the respective groups.

## Number of simulations
N <- 10000

## Monte Carlo Simulation
set.seed(9191)
control_posterior <- rnorm(n = N, mean = posterior_mu_control, sd = posterior_sd_control)
experimental_posterior <- rnorm(n = N, mean = posterior_mu_experimental, sd = posterior_sd_experimental)
diff_posterior <- rnorm(n = N, mean = mu_diff, sd = sd_diff)

## Put the control and experimental groups into a data frame
posterior_df <- data.frame(
  group = c(rep("control", times = N), rep("experimental", times = N)),
  posterior_sim = c(control_posterior, experimental_posterior)
)

Plot the mean and 95% CI for both simulated groups

posterior_df %>%
  group_by(group) %>%
  summarize(Avg = mean(posterior_sim),
            SE = sd(posterior_sim)) %>%
  ggplot(aes(x = group, y = Avg)) +
  geom_hline(yintercept = 0,
             size = 1,
             linetype = "dashed") +
  geom_point(size = 5) +
  geom_errorbar(aes(ymin = Avg - 1.96 * SE, ymax = Avg + 1.96 * SE),
                width = 0,
                size = 1.4) +
  labs(x = NULL,
       y = "Mean Diff",
       title = "Difference in Strength Score",
       subtitle = "Monte Carlo Simulation of Posterior Mean and SD") +
  theme(axis.text = element_text(face = "bold", size = 13),
        axis.title = element_text(face = "bold", size = 17),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 17))

We see more overlap here than we did when we were just looking at the observed data. Part of this is due to the priors that we set.

Plot the posterior simulation for the difference

hist(diff_posterior,
     main = "Posterior Simulation of the Difference between groups",
     xlab = "Strength Score Difference")
abline(v = 0,
       col = "red",
       lwd = 5,
       lty = 2)

Compare these results to the observed data and the t-test results

data.frame(
  group = c("control", "experimental", "difference"),
  observed_avg = c(control_mu, experimental_mu, t_test_diff),
  posterior_sim_avg = c(posterior_mu_control, posterior_mu_experimental, mu_diff),
  observed_standard_error = c(control_sd / sqrt(control_N), experimental_sd / sqrt(experimental_N), se_diff),
  posterior_sim_standard_error = c(posterior_sd_control, posterior_sd_experimental, sd_diff)
  )

Recall that we set out prior mean to 0.1 and our prior SD to 0.3
Also notice that, because this conjugate approach is directed at the means, I convert the observed standard deviations into standard errors (the standard deviation of the mean given the sample size). Remember, we set the SD to be known for this approach to work, so we are not able to say anything about it following our Bayesian procedure. To do so, we’d need to estimate both parameters (mean and SD) jointly. We will do this in part 2.
Above, we see that, following the Bayesian approach, the mean value of the control group gets pulled up closer to our prior while the mean value of the experimental group gets pulled down closer to that prior.
The posterior for the difference between the two groups, which is what we are interested in, has come down towards the prior, considerably. Had we had a larger sample of data in our two groups, the movement towards the prior would have been less extreme. But, since this was a relatively small study we get more movement towards the prior, placing less confidence in the observed sample of participants we had. We also see that the standard error around the mean difference increased, since our prior SD was 0.3. So, the mean difference decreased a bit and the standard error around that mean increased a bit. Both occurrences lead to less certainty in our observed outcomes.

Compare the Observed Difference to the Bayesian Posterior Difference

Start by creating a Monte Carlo Simulation of the observed difference from the original data.
Notice that the Bayesian Posterior Difference (blue) is pulled down towards our prior, slightly.
We also see that the Bayesian Posterior has more overlap of 0 than the Observed Difference (red) from the original data, which had a small sample size and hadn’t combined any prior knowledge that we might have had.

N <- 10000

set.seed(8945)
observed_diff <- rnorm(n = N, mean = t_test_diff, sd = se_diff)

plot(density(observed_diff),
     lwd = 5,
     col = "red",
     main = "Comparing the Observed Difference (red)\nto the\nBayes Posterior Difference (blue)",
     xlab = "Difference in Strength Score between Groups")
lines(density(diff_posterior),
      lwd = 5,
      col = "blue")
abline(v = 0,
       lty = 2,
       lwd = 3,
       col = "black")

Wrapping Up

That’s it for part 1. We’ve computed a frequentist t-test of the observed data in our study and compared those results to Bayesian estimation where we combined the observed data with some prior knowledge that we accessed from previous research and domain expertise. In this example, we used a conjugate approach and therefore were required to specify a known standard deviation for the data. Unfortunately, this may not always make sense to do and we may instead need to jointly estimating both the mean and standard deviation simultaneously. In part 2, we will discuss how to do this.

The code for this article is accessible on my GITHUB page.

If you see any code or mathematical errors, please email me.

Neural Networks….It’s regression all the way down!

Yesterday, I talked about how t-test and ANOVA are fundamentally just linear regression. But what about something more complex? What about something like a neural network?

Whenever people bring up neural networks I always say, “The most basic neural network is a sigmoid function. It’s just logistic regression!!” Of course, neural networks can get very complex and there is a lot more than can be added to them to maximize their ability to do the job. But fundamentally, they look like regression models and when you add several hidden layers (deep learning) you end up just stacking a bunch of regression models on top of each other (I know I’m over simplifying this a little bit).

Let’s see if we can build a simple neural network to prove it. As always, you can access the full code on my GITHUB page.

Loading packages, functions, and data

We will load {tidyverse} for data cleaning, {neuralnet} for building our neural network, and {mlbench} to access the Boston housing data.

I create a z-score function that will be used to standardize the features for our model. We will keep this simple and attempt to predict Boston housing prices (mdev) using three features (rm, dis, indus). To read more about what these features are, in your R console type ?BostonHousing. Once we’ve selected those features out of the data set, we apply our z-score function to them.

## Load packages
library(tidyverse)
library(neuralnet)
library(mlbench)

## z-score function
z_score <- function(x){
  z <- (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
  return(z)
}

## get data
data("BostonHousing")

## z-score features
d <- BostonHousing %>%
  select(medv, rm, dis, indus) %>%
  mutate(across(.cols = rm:indus,
                ~z_score(.x),
                .names = "{.col}_z"))
  
d %>% 
  head()

Train/Test Split

There isn’t much of a train/test split here because I’m not building a full model to be tested. I’m really just trying to show how a neural network works. Thus, I’ll select the first row of data as my “test” set and retain the rest of the data for training the model.

## remove first observation for making a prediction on after training
train <- d[-1, ]
test <- d[1, ]

Neural Network Model

We build a simple model with 1 hidden layer and then plot the output. In the plot we see various numbers. The numbers in black refer to our weights and the numbers in blue refer to the biases.

## Simple neural network with one hidden layer
set.seed(9164)
fit_nn <- neuralnet(medv ~ rm_z + dis_z + indus_z,
                    data = train,
                    hidden = 1,
                    err.fct = "sse",
                    linear.output = TRUE)

## plot the neural network
plot(fit_nn)

Making Predictions — It’s linear regression all the way down!

As stated above, we have weights (black numbers) and biases (blue numbers). If we are trying to frame up the neural network as being a bunch of stacked together linear regressions then we can think about the weights as functioning like regression coefficients and the biases functioning like the linear model intercept.

Let’s take each variable from the plot and store them in their own elements so that we can apply them directly to our test observation and write out the equation by hand.

## Predictions are formed using the weights (black) and biases
# Store the weights and biases from the plot and put them into their own elements
rm_weight <- 1.09872
dis_weight <- -0.05993
indus_weight <- -0.49887

# There is also a bias in the hidden layer
hidden_weight <- 35.95032

bias_1 <- -1.68717
bias_2 <- 14.85824

With everything stored, we are ready to make a prediction on the test observations

We begin at the input layer by multiplying each z-scored value by the corresponding weight from the plot above. We sum those together and add in the first bias — just like we would with a linear regression.

# Start by applying the weights to their z-scored values, sum them together and add
# in the first bias
input <- bias_1 + test$rm_z * rm_weight + test$dis_z * dis_weight + test$indus_z * indus_weight
input

One regression down, one more to go! But before we can move to the next regression, we need to transform this input value. The neural network is using a sigmoid function to make this transformation as the input value moves through the hidden layer. So, we should apply the sigmoid function before moving on.

# transform input -- the neural network is using a sigmoid function
input_sig <- 1/(1+exp(-input))
input_sig

We take this transformed input and multiply it by the hidden weight and then add it to the second bias. This final regression equation produces the predicted value of the home.

prediction <- bias_2 + input_sig * hidden_weight
prediction

The prediction here is in the thousands, relative to census data from the 1970’s.

Let’s compare the prediction we got by hand to what we get when we run the predict() function.

## Compare the output to what we would get if we used the predict() function
predict(fit_nn, newdata = test)

Same result!

Again, if you’d like the full code, you can access it on by GITHUB page.

Wrapping Up

Yesterday we talked about always thinking regression whenever you see a t-test or ANOVA. Today, we learn that we can think regression whenever we see a neural network, as well! By stacking two regression-like equations together we produced a neural network prediction. Imagine if we stacked 20 hidden layers together!

The big take home, though, is that regression is super powerful. Fundamentally, it is the workhorse that helps to drive a number of other machine learning approaches. Imagine if you spent a few years really studying regression models? Imagine what you could learn about data analysis? If you’re up for it, one of my all time favorite books on the topic is Gelman, Hill, and Vehtari’s Regression and Other Stories. I can’t recommend it enough!

Confidence Intervals vs Prediction Intervals – A Frequentist & Bayesian Example

We often construct models for the purpose of estimating some future value or outcome. While point estimates for a forecasting future outcomes are interesting it is important to remember that the future contains a lot of uncertainty. As such, a reflection of this uncertainty is often conveyed using confidence intervals (which make a statement about the uncertainty of a population mean) or prediction intervals (which make a statement about a new/future observation for an individual within the population).

This tutorial will walk through how to calculate confidence intervals and prediction intervals by hand and then show the corresponding R functions for obtaining these values. We will finish by building the same model using a Bayesian framework and calculate the highest posterior density intervals and prediction intervals.

All of code is accessible through my GITHUB page.

Loading Packages & Data

We will use the Lahman baseball data set. For the purposes of this example, we will constrain our data to all MLB seasons after 2009 and players who had at least 250 At Bats in those seasons.

library(tidyverse)
library(broom)
library(Lahman)

theme_set(theme_minimal())

data(Batting)

df <- Batting %>%
  filter(yearID >= 2010,
         AB >= 250)

EDA

Let’s explore the relationship between Hits (H) and Runs Batted In (RBI) in our data set.

cor.test(df$H, df$RBI)

We see a very positive correlation suggesting that as a player gets more hits they tend to also have more RBI’s.

A Simple Regression Model

We will build a simple model that regresses RBI’s on Hits.

rbi_fit <- df %>%
  lm(RBI ~ H, data = .)

tidy(rbi_fit)
confint(rbi_fit)

Using tidy() from the broom package, we see that for every additional Hit we estimate a player to increase their RBI, on average, by 0.482.
We use the confint() function to produce the confidence intervals around the model coefficients.

Estimating Uncertainty

We can use the above model to forecast the expected number of RBI’s for a player given a number of hits.

For example, a player that has 100 hits would be estimated to produce approximately 49 RBI’s.

hits <- 100
pred_rbi <-  round(rbi_fit$coef[1] + rbi_fit$coef[2]*hits, 1)
pred_rbi

49 RBI’s is rather precise. There is always going to be some uncertainty around a number like this. For example, a player could be a good hitter but play on a team with poor hitters who don’t get on base thus limiting the possibility for all the hits he is getting to lead to RBI’s.

There are two types of questions we may choose to answer around our prediction:

Predict the average RBI’s for players who get 100 hits.
Predict the average RBI’s for a single player who gets 100 hits.

The point estimate, which we calculated above, will be the same for these two questions. Where they will differ is in the uncertainty we have around that estimate. Question 1 is a statement about the population (the average RBIs for all players who had 100 hits) while question 2 is a statement about an individual (the average RBI’s for a single player). Consequently, the confidence interval that we calculate to estimate our uncertainty for question 1 will be smaller than the prediction interval that we calculate to estimate our uncertainty for question 2.

Necessary Information for Computing Confidence Intervals & Prediction Intervals

In order to calculate the confidence interval and prediction interval by hand we need the following pieces of data:

The model degrees of freedom.
A t-critical value corresponding to the level of uncertainty we are interested in (here we will use 95%).
The average number of hits (our independent variable) observed in our data.
The standard deviation of hits observed in our data.
The residual standard error of our model.
The total number of observations in our data set.

## Model degrees of freedom
model_df <- rbi_fit$df.residual

## t-critical value for a 95% level of certainty
level_of_certainty <- 0.95
alpha <- 1 - (1 - level_of_certainty) / 2
t_crit <- qt(p = alpha, df = model_df)
t_crit

## Average &amp; Standard Deviation of Hits
avg_h <- mean(df$H)
sd_h <- sd(df$H)

## Residual Standard Error of the model and Total Number of Observations
rse <- sqrt(sum(rbi_fit$residuals^2) / model_df)
N <- nrow(df)

Calculating the Confidence Interval

We can calculate the 95% confidence level by hand using the following equation:

CL95 = t.crit * rse * sqrt(1/N + ((hits – avg.h)^2) / ((N – 1) * sd.h^2))

## Calculate the 95% Confidence Limits
cl_95 <- t_crit * rse * sqrt(1/N + ((hits - avg_h)^2) / ((N-1) * sd_h^2))
cl_95

## 95% Confidence Interval
low_cl_95 <- round(pred_rbi - cl_95, 1)
high_cl_95 <- round(pred_rbi + cl_95, 1)

paste("100 Hits =", pred_rbi, "±", low_cl_95, "to", high_cl_95, sep = " ")

If we don’t want to calculate the Confidence Interval by hand we can simply use the predict() function in R. We obtain the same result.

round(predict(rbi_fit, newdata = data.frame(H = hits), interval = "confidence", level = 0.95), 1)

Calculating the Prediction Interval

Notice below that the prediction interval uses the same equation as the confidence interval with the exception of adding 1 before 1/N. As such, the prediction interval will be wider as there is more uncertainty when attempting to make a prediction for a single individual within the population.

PI95 = t.crit * rse * sqrt(1 + 1/N + ((hits – avg.h)^2) / ((N – 1) * sd.h^2))

## Calculate the 95% Prediction Limit
pi_95 <- t_crit * rse * sqrt(1 + 1/N + ((hits - avg_h)^2) / ((N-1) * sd_h^2))
pi_95

## Calculate the 95% Prediction interval
low_pi_95 <- round(pred_rbi - pi_95, 1)
high_pi_95 <- round(pred_rbi + pi_95, 1)

paste("100 Hits =", pred_rbi, "±", low_pi_95, "to", high_pi_95, sep = " ")

Again, if we don’t want to do the math by hand, we can simply use the the predict() function and change the interval argument from confidence to prediction.

round(predict(rbi_fit, newdata = data.frame(H = hits), interval = "prediction", level = 0.95), 1)

Visualizing the 95% Confidence Interval and Prediction Interval

Instead of deriving the point estimate, confidence interval, and prediction interval for a single observation of hits, let’s derive them a variety of number of hits and then plot the intervals over our data to see what they look like.

We could simply use ggplot2 to draw a regression line with 95% confidence intervals over top of our data.

Notice how narrow the 95% confidence interval is around the regression line. The lightly grey shaded region is barely visible because the interval is so tight.

While plotting the data like this is simple, it might help us understand what is going on better if we constructed our own data frame of predictions and intervals and then overlaid those onto the original data set.

We begin by creating a data frame that has a single column, H, representing the range of hits observed in our data set. We then can create predictions and our 95% Prediction and Confidence Intervals across the vector of hits data that we just created.

Similar to our first plot, we see that the 95% confidence interval (green) is very tight to the regression line (black). Also, the prediction interval (light blue) is substantially larger than the confidence interval across the entire range of values.

A Bayesian Perspective

We now take a Bayesian approach to constructing intervals. First, we need to specify our regression model. To do so I will be using the rethinking package which serves as a supplement to Richard McElreath’s brilliant textbook, Statistical Rethinking: A Bayesian Course with Examples in R and Stan.

To create our Bayesian model we will need to specify a few priors. Before specifying these, however, I am going to mean center the independent variable, Hits, and create a new variable called hits_c. Doing so helps with interpretation of the model output and it should be noted that the intercept will now reflect the expected value of RBIs when there is an average number of hits (which would be hits_c = 0).

For my priors:

Intercept = N(70, 25) — represents my prior belief of the number of RBI’s when a batter has an average number of hits.
Beta Coefficient = N(0, 20) — This prior represent the amount of change in RBI for a unit of change in a batter’s hits relative to the average number of hits for the population.
Sigma = Unif(0, 20) — this represents my prior for the model standard error.

library(rethinking)

## What is the average number of hits?
mean(df$H)

## Mean center hits
df <- df %>%
  mutate(hits_c = H - mean(H))

## Fit the linear model and specify the priors
rbi_bayes <- map(
  alist(
    RBI ~ dnorm(mu, sigma),     # dependent variable (RBI)
    mu <- a + b*hits_c,				  # linear model regressing RBI on H
    a ~ dnorm(70, 25),          # prior for the intercept
    b ~ dnorm(0, 20),           # prior for the beta coefficient
    sigma ~ dunif(0, 20)),      # prior for the model standard error
  data = df)

## model output
precis(rbi_bayes)

Predict the number of RBI for our batter with 100 hits

The nice part of the Bayesian framework is that we can sample from the posterior distribution of our model and build simulations.

We start by extracting samples from our model fit. We can also then plot these samples and get a clearer understanding of the certainty we have around our model output.

coef_sample <-extract.samples(rbi_bayes)
coef_sample[1:10, ]

## plot the beta coefficient
hist(coef_sample$b,
     main = "Posterior Samples of Hits Centered Coefficient\nFrom Bayesian Model",
     xlab = "Hits Centered Coefficient")

Our batter had 100 hits. Remember, since our model independent variable is now centered to the population mean we will need to center the batter’s 100 hits to that value prior to using our model to predict the number of RBI’s we’d expect. We will then plot the full sample of the predicted number of RBI’s to reflect our uncertainty around the batter’s performance.

## Hypothetical batter with 100 hits
hits <- 100
hits_centered <- hits - mean(df$H)

## Predict RBI total from the model
rbi_sample <- coef_sample$a + coef_sample$b * hits_centered

## Plot the posterior sample
hist(rbi_sample,
     main = "Posterior Prediction of RBI for batter with 100 hits",
     xlab = "Predicted RBI Total")

Highest Density Posterior Interval (HDPI)

We now create an interval around the RBI samples. Bayesians often discuss intervals around a point estimate under various different names (e.g., Credible Intervals, Quantile Intervals, Highest Density Posterior Intervals, etc.). Here, we will use Richard McElreath’s HDPI() function to calculate the highest density posterior interval. I’ll also use qnorm() to show how to obtain the 95% interval of the sample.

## HDPI
round(HPDI(sample = rbi_sample, prob = 0.95), 1)

## 95% Interval
round(qnorm(p = c(0.025, 0.975), mean = mean(rbi_sample), sd = sd(rbi_sample)), 1)

Prediction Interval

We can also use the samples to create a prediction interval for a single batter (remember, this will be larger than the confidence interval). To do so, we first simulate from our Bayesian model for the number of hits the batter had (100) and center that to the population average for hits.

## simulate from the posterior
rbi_sim <- sim(rbi_bayes, data = data.frame(hits_c = hits_centered))

## 95% Prediction interval
PI(rbi_sim, prob = 0.95)

Plotting Intervals Across All of the Data

Finally, we will plot our confidence intervals and prediction intervals across the range of our entire data. We will need to make sure and center the vector of hits that we create to the mean of hits for our original data set.

## Create a vector of hits
hit_df <- data.frame(H = seq(from = min(df$H), to = max(df$H), by = 1))

## Center the hits to the average number of hits in the original data
hit_df <- hit_df %>%
  mutate(hits_c = H - mean(df$H))

## Use link() to calculate a mean value across all simulated values from the posterior distribution of the model
mus <- link(rbi_bayes, data = hit_df)

## get the average and highest density interval from our mus
mu_avg <- apply(X = mus, MARGIN = 2, FUN = mean)
mu_hdpi <- apply(mus, MARGIN = 2, FUN = HPDI, prob = 0.95)

## simulate from the posterior across our vector of centered hits
sim_rbi_range <- sim(rbi_bayes, data = hit_df)

## calculate the prediction interval over this simulation
rbi_pi <- apply(sim_rbi_range, 2, PI, prob = 0.95)


## create a plot of confidence intervals and prediction intervals
plot(df$H, df$RBI, pch = 19, 
     col = "light grey",
     main = "RBI ~ H",
     xlab = "Hits",
     ylab = "RBI")
shade(rbi_pi, hit_df$H, col = col.alpha("light blue", 0.5))
shade(mu_hdpi, hit_df$H, col = "green")
lines(hit_df$H, mu_avg, col = "red", lwd = 3)

Similar to our frequentist intervals, we see that the 95% HDPI interval (green) is very narrow and close to the regression line (red). We also see that the 95% prediction interval (light blue) is much larger than the HDPI, as it is accounting for more uncertainty when attempting to make a prediction for a single individual within the population.

Wrapping Up

When making predictions it is always important to reflect the uncertainty that you have around your model’s outcome. Confidence intervals and prediction intervals are two ways to present your uncertainty for two different objectives. Confidence intervals say something about the uncertainty for the population average while prediction intervals attempt to provide uncertainty for an individual within the population. As a consequence, prediction intervals end up being wider than confidence intervals. The approach to sharing your model’s uncertainty can be done from either a frequentist or Bayesian perspective.

To access all of the code for this blog article, CLICK HERE.

Using Beta-Binomial Regression to Set Priors for Different Sample Sizes

In a prior post I explained an approach for using Bayes to estimate a player’s 3pt% based on prior knowledge of 3pt success in NBA players. This approach took advantage of the beta-binomial conjugate.

In that post, I constrained our analysis to only players that had 200 or more 3pt attempts during the course of a season. But, what if we don’t want to only focus on players that obtained a certain number of 3 point attempts? What about players who only took 100 attempts? 10 attempts? 2 attempts?! What can be said about their performance?

Today, we will discuss the approach of using beta-binomial regression to first establish a prior 3pt%, based on the number of 3pt attempts the player shot (sample size) and then update that prior based on the success that the player had in those attempts.

The code for this article is available on my GITHUB page.

A few references of books that I’ve found useful for building Bayesian models are at the end of this post. The approach here was inspired by Chapter 7 of David Robinson’s fantastic book, Introduction to Empirical Bayes: Examples from Baseball Statistics.

The Data

We will use the three point attempts data for all players in the 2022 NBA season. Data was scraped from basketball-reference.com. In total, there are 740 rows of data. Here is what the first four look like:

Plotting the Data

To help wrap our heads around the relationship between 3pt% and 3pt attempts, we will build a simple plot.

Notice that the number of 3 point attempts has some influence on 3pt%. First, as 3pt attempts increase, so does 3pt%. This is because better 3pt shooters will take more 3pt shots and, because they are good, their teams will also try and put them in position to take these shots. Additionally, we can see that as 3pt attempts increase, the amount of variance in performance decreases. Players with under 100 3pt attempts are relatively spread out around the regression line.

Bayesian Shrinkage using the Beta-Binomial Conjugate

As a review from the prior blog article on this topic, we can use our knowledge of the 3pt% success of NBA players as a prior to help shrink observations towards the “expected” outcome. The amount of shrinkage a player exhibits will be dependent on the number of 3pt attempts they have. A smaller sample size means we have less confidence in the observed performance and thus greater shrinkage towards the population prior. Conversely, a large sample size means more confidence in the observed performance and therefore much less shrinkage.

From the prior article we estimated the alpha and beta parameters for our beta distribution to be 61.8 and 106.2, respectively. Recall that these values were estimated from the prior 2 seasons and using only those players with 200 or more 3pt attempts. The alpha and beta parameters provide us with a prior mean for NBA 3pt% of 36.8%.

We will apply this prior knowledge to the observations of all players in our 2022 data set by using the beta-binomial conjugate.

alpha <- 61.8
beta <- 106.2

prior_mu <- alpha / (alpha + beta)
prior_mu

tbl2022 <- tbl2022 %>%
  mutate(three_pt_missed = three_pt_att - three_pt_made,
         posterior_alpha = three_pt_made + alpha,
         posterior_beta = three_pt_missed + beta,
         posterior_three_pt_pct = posterior_alpha / (posterior_alpha + posterior_beta),
         posterior_three_pt_sd = sqrt((posterior_alpha * posterior_beta) / ((posterior_alpha + posterior_beta)^2 * (posterior_alpha + posterior_beta + 1))))

Next, we create a plot to see how the beta-binomial posterior for each player looks relative to the raw data (plotted on the left):

Combining our prior and observed values we can see (on the right) that the data are now constrained around the prior (36.8%). Players with a small number of observations (on the left) are pulled nearest to the line while players with larger observations (on the right) are less influenced by the prior and tend to remain closer to their observed performance.

The problem is that the prior mean (36.8%) is too high for the players with a small number of 3pt attempts. Surely we wouldn’t want to make the assumption that their performance is close to the prior for players that had over 200 attempts! For example, the players with under 50 attempts have an observed average 3pt% of 30% and a median 3pt% of 25% (just over 10% less than those those who had 200 or more attempts!).

To deal with this issue we need to account for shot attempts first so that we can estimate players with smaller sample sizes relative to a prior performance that is more appropriate for them (IE, a prior that is lower than that currently being assumed by our alpha and beta parameters).

Accounting for 3pt Shot Attempts

Our outcome variable is binomial (success and failures) so we will use a beta-binomial regression to estimate a prior for 3pt% while controlling for 3pt shot attempts.

suppressPackageStartupMessages({
  suppressWarnings({
    library(gamlss)
  })
})

fit_3pt <- gamlss(cbind(three_pt_made, three_pt_missed) ~ log(three_pt_att),
                  data = tbl2022,
                  family = BB(mu.link = "identity"))

## extract model coefficients
fit_3pt$mu.coefficients
fit_3pt$sigma.coefficients

Now we can use these model coefficients to fit an estimated 3pt% for each player based on their number of 3pt attempts. This estimation will serve as our prior, which we will then turn into a new posterior 3pt% for each player using our beta-binomial approach. Note that while the mean 3pt% for each player will vary depending on shot attempts the population sigma will be constant for all athletes, representing the variance that we expect all of those in the population to similarly exhibit.

tbl2022 <- tbl2022 %>%
  mutate(mu = fitted(fit_3pt, parameter = "mu"),
         sigma = fitted(fit_3pt, parameter = "sigma"),
         prior_alpha_reg = mu / sigma,
         prior_beta_reg = (1 - mu) / sigma,
         posterior_alpha_reg = prior_alpha_reg + three_pt_made,
         posterior_beta_reg = prior_beta_reg + three_pt_missed,
         posterior_mu_reg = posterior_alpha_reg / (posterior_alpha_reg + posterior_beta_reg))

We now have two estimates of each player’s 3pt%. One that was calculated using the beta-binomial conjugate with a prior of 36.8% (the average 3pt% for all shooters with 200 or more 3pt shots). The second estimate first establishes a prior based on the number of 3pt shots the player has taken and then updates that prior based on the individual player’s performance in those shots. We can plot the relationship between these two.

Notice that those with more 3 point attempts are close to the red line (intercept = 0, slope = 1) , representing perfect agreement between the two estimates, while those with less attempts are pulled further down, indicating that we estimate them to be poorer 3pt shooters.

Finally, we can compare the results from our beta-binomial regression prior with our other two estimates of performance (raw observations and our prior of 36.8%).

In the right most plot (beta-binomial regression prior) we see those with a small number of 3pt attempts are shrunk to a smaller prior 3pt% than those with a larger number of 3pt attempts.

Making an estimation for a new player

We can use this approach to estimate the performance of a new player, as well.

new_player <- data.frame( three_pt_att = 10, three_pt_made = 2, three_pt_missed = 10 - 2, three_pt_pct = 2 / 10 ) new_player %>%
  mutate(mu = predict(fit_3pt, newdata = new_player),
         sigma = exp(fit_3pt$sigma.coefficients),
         prior_alpha = mu / sigma,
         prior_beta = (1 - mu) / sigma,
         posterior_alpha = prior_alpha + three_pt_made,
         posterior_beta = prior_beta + three_pt_missed,
         posterior_mu = posterior_alpha / (posterior_alpha + posterior_beta)) %>%
  pivot_longer(cols = everything())

The new player took 10 three point shots, made 2, and has an observed 3pt% of 20%. Using our beta-binomial regression model we estimate the prior for a player with 10 attempts to be 0.274. Combining the prior with the 10 attempts we get a posterior 3pt% for the player of 0.272 (slightly below the average for the population of players who had 10 attempts).

Useful Resources

Some textbooks that I’ve found useful for exploring this type of work: