Category Archives: Sports Analytics

Visualizing Group Changes

CJ Mayes recently posted some really nice plots for visualizing group differences to Twitter.

My personal favorite was the bottom right plot, which can be a nice way of visualizing pre and post changes in a study. I believe the original plots were done in Tableau, so I’ve gone ahead and reproduced that bottom right plot in R.

You can grab the full code, all in one piece, from my GitHub page.

Simulate Some Data

library(tidyverse)
library(gghalves)
theme_set(theme_light())

set.seed(1234)
dat <- tibble( subject = LETTERS[1:26], pre = rnorm(n = 26, mean = 10, sd = 3) ) %>%
  mutate(post = pre + rnorm(n = 26, mean = 0, sd = 2))

dat_long <- dat %>%
  pivot_longer(cols = -subject) %>%
  mutate(name = factor(name, levels = c("pre", "post")))

Create the plot

dat_long %>%
  ggplot(aes(x = name, y = value)) +
  geom_line(aes(group = subject),
            color = "light grey",
            size = 0.7) +
  geom_point(aes(group = subject,
                 color = name),
             alpha = 0.7,
             size = 2) +
  geom_half_violin(aes(x = name),
                   data = dat_long %&gt;% filter(name == 'pre'),
                   fill = "light grey",
                   color = "light grey",
                   side = 'l',
                   alpha = 0.7) +
  geom_half_violin(aes(x = name),
                   data = dat_long %&gt;% filter(name == 'post'),
                   fill = "palegreen",
                   color = "palegreen",
                   side = 'r',
                   alpha = 0.7) +
  scale_color_manual(values = c("pre" = "light grey", "post" = "palegreen")) +
  labs(x = "Test Time",
       y = NULL,
       title = "Changes from Pre to Post") +
  theme(axis.text = element_text(face = "bold", size = 12),
        axis.title = element_text(face = "bold", size = 15),
        plot.title = element_text(size = 18),
        legend.position = "none")

Visualizing Vald Force Frame Data

Recently, in a sport science chat group on Facebook someone asked for an example of how other practitioners are visualizing their Vald Force Frame data. For those that are unfamiliar, Vald is a company that originally pioneered the Nordbord, for eccentric hamstring strength testing. The Force Frame is their latest technological offering, designed to help practitioners test abduction and adduction of the hips for performance and return-to-play purposes.

The Data

I never used the Force Frame, personally, so the author provided a screen shot of what the data looks like. Using that example, I created a small simulation of data to try and create a visual that might be useful for practitioners. Briefly, the data is structured as two rows per athlete, a row representing the force output squeeze (adduction) and a row representing pull (abduction). My simulated data looks like this:

### Vald Force Frame Visual
library(tidyverse)

set.seed(678)
dat <- tibble( player = rep(c("Tom", "Alan", "Karl", "Sam"), each = 2), test = rep(c("Pull", "Squeeze"), times = 4) ) %>%
  mutate(l_force = ifelse(test == "Pull", rnorm(n = nrow(.), mean = 305, sd = 30),
                          rnorm(n = nrow(.), mean = 360, sd = 30)),
         r_force = ifelse(test == "Pull", rnorm(n = nrow(.), mean = 305, sd = 30),
                          rnorm(n = nrow(.), mean = 360, sd = 30)),
         pct_imbalance = ifelse(l_force &gt; r_force, ((l_force - r_force) / l_force) * -1,
                            (r_force - l_force) / r_force))

In this simplified data frame we see the two rows per athlete with left and right force outputs. I’ve also calculated the Bilateral Strength Asymmetry (BSA), indicated in the percent imbalance column. This measure, as well as several other measures of asymmetry, was reported by Bishop et al (2016) in their paper on Calculating Asymmetries. The equation is as follows:

BSA = (Stronger Limb – Weaker Limb) / Stronger Limb

Additionally, if the left leg was stronger than the right I multiplied the BSA by -1, so that the direction of asymmetry (the stronger limb) can be reflected in the plot.

The Plot

Prior to plotting, I set some theme elements that allow me to style the axis labels, axis text, plot title and subtitle, and the headers of the two facets of tests that we have (one for pull and one for squeeze). Additionally, in order to make the plot look cleaner, I get rid of the legend since it didn’t offer any new information that couldn’t be directly interpreted by looking at the visual.

Before creating the visual, I first add a few variables that will help me give more context to the numbers. The goal of this visual is to help practitioners look at the tests results for a larger group of athletes and quickly identify those athletes that may require specific consideration. Therefore, I create a text string that captures the left and right force outputs so that they can be plotted directly onto the plot and a corresponding “flag” variable that indicates when an athlete may be below some normalized benchmark of strength in the respective test. Finally, I created an asymmetry flag to indicate when an athlete has a BSA that exceeds 10% in either direction. This threshold can (and should) be whatever is meaningful and important with your athletes and in your sport.

For the plot itself, I decided that plotting the BSA values for both tests would be something that practitioners would find valuable and comprehending the direction of asymmetry in a plot is also very easy. Remember, the direction that the bar is pointed represents the stronger limb. To provide context for the asymmetry direction, I created a shaded normative range and whenever the bar is outside of this range, it changes to red. When it is inside the range it remains green. To provide the raw value force numbers, I add those to the plot as labels, in the middle of each plot region for each athlete. If the athlete is flagged as having a force output for either leg that is below the predetermined threshold the text will turn red.

theme_set(theme_classic() +
          theme(strip.background = element_rect(fill = "black"),
                strip.text = element_text(size = 13, face = "bold", color = "white"),
                axis.text = element_text(size = 13, face = "bold"),
                axis.title.x = element_text(size = 14, face = "bold"),
                plot.title = element_text(size = 18),
                plot.subtitle = element_text(size = 14),
                legend.position = "none"))

dat %>%
  mutate(left = paste("Left =", round(l_force, 1), sep = " "),
         right = paste("Right =", round(r_force, 1), sep = " "),
         l_r = paste(left, right, sep = "\n"),
         asym_flag = ifelse(abs(pct_imbalance) > 0.1, "flag", "no flag"),
         weakness_flag = ifelse((test == "Pull" & (l_force < 250 | r_force < 250)) |
                                 (test == "Squeeze" & (l_force < 330 | r_force < 330)), "flag", "no flag")) %>%
  ggplot(aes(x = pct_imbalance, y = player)) +
  geom_rect(aes(xmin = -0.1, xmax = 0.1),
            ymin = 0,
            ymax = Inf,
            fill = "light grey",
            alpha = 0.3) +
  geom_col(aes(fill = asym_flag),
           alpha = 0.6,
           color = "black") +
  geom_vline(xintercept = 0, 
             size = 1.3) +
  annotate(geom = "text",
           x = -0.2, 
           y = 4.5,
           size = 6,
           label = "Left") +
  annotate(geom = "text",
           x = 0.2, 
           y = 4.5,
           size = 6,
           label = "Right") +
  geom_label(aes(x = 0, y = player, label = l_r, color = weakness_flag)) +
  scale_color_manual(values = c("flag" = "red", "no flag" = "black")) +
  scale_fill_manual(values = c("flag" = "red", "no flag" = "palegreen")) +
  labs(x = "% Imbalance",
       y = NULL,
       title = "Force Frame Team Testing",
       subtitle = "Pull Weakness < 250 | Squeeze Weakness < 330") +
  facet_grid(~test) +
  scale_x_continuous(labels = scales::percent_format(accuracy = 0.1),
                     limits = c(-0.3, 0.3))

At a quick glance we can notice that 3 of the athletes exhibit a strength asymmetry for Pull and two exhibit a strength asymmetry for Squeeze. Additionally, one of the athletes, Karl, is also below the strength threshold for both Pull and Squeeze while Sam is exhibiting strength below the threshold for Squeeze only.

Wrapping Up

There are a lot of ways to visualize this sort of single day testing data. This is just one example and would be different if we were trying to visualize serial measurements, where we are tracking changes over time. Hopefully this short example provides some ideas. If you’d like to play around with the code and adapt it to your own athletes, you can access it on my GitHub page.

Two Group Comparison – Frequentist vs Bayes – Part 1

One of the more common types of analysis conducted in science is the comparison of two groups (e.g., a treatment group and a control group) to identify whether an intervention had a desired effect. The problem, in sport science and other fields, is often a lack of participants (small sample sizes) and an inability to combine prior knowledge (e.g., outcomes from previous research or domain expertise) with the observed data in order to make a broader statement of our findings (IE, a Bayesian approach). Consequently, this leaves us analyzing the data on hand and reporting whatever potential findings shake out.

This tutorial will step through a two group comparison of some simulated data from both a frequentist (t-test) and Bayesian approach.

All code is accessible on my GITHUB page.

Two Groups – Simulated Data

We are conducting a study on the effect of a new strength training program. Participants are randomly allocated into a control group (n = 17), receiving a normal training program, or an experimental group (n = 22), receiving the new training program.

The data consists of the change in strength score observed for each participant in their respective groups. (NOTE: For the purposes of this example, the strength score is a made up number describing a participants strength).

Simulate the data

library(tidyverse)
library(patchwork)

theme_set(theme_classic())

set.seed(6677)
df <- data.frame( subject = 1:39, group = c(rep("control", times = 17), rep("experimental", times = 22)), strength_score_change = c(round(rnorm(n = 17, mean = 0, sd = 0.5), 1), round(rnorm(n = 22, mean = 0.5, sd = 0.5), 1)) ) %>%
  mutate(group = factor(group, levels = c("experimental", "control")))

Summary Statistics

df %>%
  group_by(group) %>%
  summarize(N = n(),
            Avg = mean(strength_score_change),
            SD = sd(strength_score_change),
            SE = SD / sqrt(N))

It looks like the new strength training program led to a greater improvement in strength, on average, than the normal strength training program.

Plot the data

Density plots of the two distributions

df %>%
  ggplot(aes(x = strength_score_change, fill = group)) +
  geom_density(alpha = 0.2) +
  xlim(-2, 2.5)

Plot the means and 95% CI to compare visually

df %>%
  group_by(group) %>%
  summarize(N = n(),
            Avg = mean(strength_score_change),
            SD = sd(strength_score_change),
            SE = SD / sqrt(N)) %>%
  ggplot(aes(x = group, y = Avg)) +
  geom_hline(yintercept = 0,
             size = 1,
             linetype = "dashed") +
  geom_point(size = 5) +
  geom_errorbar(aes(ymin = Avg - 1.96 * SE, ymax = Avg + 1.96 * SE),
                width = 0,
                size = 1.4) +
  theme(axis.text = element_text(face = "bold", size = 13),
        axis.title = element_text(face = "bold", size = 17))

Plot the 95% CI for the difference in means

df %>%
  mutate(group = factor(group, levels = c("control", "experimental"))) %>%
  group_by(group) %>%
  summarize(N = n(),
            Avg = mean(strength_score_change),
            SD = sd(strength_score_change),
            SE = SD / sqrt(N),
            SE2 = SE^2,
            .groups = "drop") %&amp;gt;%
  summarize(diff = diff(Avg),
            se_diff = sqrt(sum(SE2))) %&amp;gt;%
  mutate(group = 'Group\nDifference') %&amp;gt;%
  ggplot(aes(x = diff, y = 'Group\nDifference')) +
  geom_vline(aes(xintercept = 0),
             linetype = "dashed",
             size = 1.2) +
  geom_point(size = 5) +
  geom_errorbar(aes(xmin = diff - 1.96 * se_diff,
                    xmax = diff + 1.96 * se_diff),
                width = 0,
                size = 1.3) +
  xlim(-0.8, 0.8) +
  labs(y = NULL,
       x = "Average Difference in Strengh Score",
       title = "Difference in Strength Score",
       subtitle = "Experimental - Control (Mean Difference ± 95% CI)") +
  theme(axis.text = element_text(face = "bold", size = 13),
        axis.title = element_text(face = "bold", size = 17),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 18))

Initial Observations

We can see that the two distributions have a decent amount of overlap
The mean change in strength for the experimental group appears to be larger than the mean change in strength for the control group
When visually comparing the group means and their associated confidence intervals, there is an observable difference. Additionally, looking at the mean difference plot, that difference appears to be significant based on the fact that the confidence interval does not cross zero.

So, it seems like something is going on here with the new strength training program. Let’s do some testing to see just how large the difference is between our two groups.

Group Comparison: T-test

First, let’s use a frequentist approach to evaluating the difference in strength score change between the two groups. We will use a t-test to compare the mean differences.

diff_t_test <- t.test(strength_score_change ~ group,
       data = df,
       alternative = "two.sided")

## store some of the main values in their own elements to use later
t_test_diff  <- as.vector(round(diff_t_test$estimate[1] - diff_t_test$estimate[2], 2))

se_diff <- 0.155    # calculated above when creating the plot

diff_t_test

The test is statistically significant, as the p-value is less than the magic 0.05 (p = 0.01) with a mean difference in strength score of 0.41 and a 95% Confidence Interval of 0.1 to 0.73.

Here, the null hypothesis was that the difference in strength improvement between the two groups is 0 (no difference) while the alternative hypothesis is that the difference is not 0. In this case, we did a two sided test so we are stating that we want to know if the change in strength from the new training program is either better or worse than the normal training program. If we were interested in testing the difference in a specific direction, we could have done a one sided test in the direction of our hypothesis, for example, testing whether the new training program is better than the normal program.

But, what if we just got lucky with our sample in the experimental group, and they happened to respond really well to the new program, and got unlucky in our control group, and they happened to not change as much to the normal training program? This could certainly happen. Also, had we collected data on a few more participants, it is possible that the group means could have changed and no effect was observed.

The t-test only allows us to compare the group means. It doesn’t tell us anything about the differences in variances between groups (we could do an F-test for that). The other issue is that rarely is the difference between two groups exactly 0. It makes more sense for us to compare a full distribution of the data and make an inference about how much difference there is between the two groups. Finally, we are only dealing with the data we have on hand (the data collected for our study). What if we want to incorporate prior information/knowledge into the analysis? What if we collect data on a few more participants and want to add that to what we have already observed to see how it changes the distribution of our data? For that, we can use a Bayesian approach.

Group Comparison: Bayes

The two approaches we will use are:

Group comparison with a known variance, using conjugate priors
Group comparisons without a know variance, estimating the mean and SD distributions jointly

Part 1: Group Comparison with known variance

First, we need to set some priors. Let’s say that we are skeptical about the effect of the new strength training program. It is a new training stimulus that the subjects have never been exposed to, so perhaps we believe that it will improve strength but not much more than the traditional program. We will set our prior belief about the mean improvement of any strength training program to be 0.1 with a standard deviation of 0.3, for our population. Thus, we will look to combine this average/expected improvement (prior knowledge) with the observations in improvement that we see in our respective groups. Finally, we convert the standard deviation of the mean to precision, calculated as 1/sd^2, for our equations going forward.

prior_mu <- 0.1
prior_sd <- 0.3
prior_var <- prior_sd^2
prior_precision <- 1 / prior_var

Plot the prior distribution for the mean difference

hist(rnorm(n = 1e6, mean = prior_mu, sd = prior_sd),
     main = "Prior Distribution for difference in strength change\nbetween control and experimental groups",
     xlab = NULL)
abline(v = 0,
       col = "red",
       lwd = 5,
       lty = 2)

Next, to make this conjugate approach work we need to have a known standard deviation. In this example, we will not be estimating the joint posterior distributions (mean and SD). Rather, we are saying that we are interested in knowing the mean and the variability around it but we are going to have a fixed/known SD.

Let’s say we looked at some previously published scientific literature and also tried out the new program in a small pilot study of athletes and we the improvements of a strength training program are normally distributed with a known SD of 0.6.

known_sd <- 0.6
known_var <- known_sd^2

Finally, let’s store the summary statistics for each group in their own elements. We can type these directly in from our summary table.

df %>%
  group_by(group) %>%
  summarize(N = n(),
            Avg = mean(strength_score_change),
            SD = sd(strength_score_change),
            SE = SD / sqrt(N))

experimental_N <- 22
experimental_mu <- 0.423
experimental_sd <- 0.494
experimental_var <- experimental_sd^2
experimental_precision <- 1 / experimental_var

control_N <- 17
control_mu <- 0.0118
control_sd <- 0.469
control_var <- control_sd^2
control_precision <- 1 / control_var

Now we are ready to update the observed study data for each group with our prior information and obtain posterior estimates. We will use the updating rules provided by William Bolstad in Chapter 13 of his book, Introduction to Bayesian Statistics, 2nd Ed.

##### Update the control group observations ######
## SD
posterior_precision_control <- prior_precision + control_N / known_var
posterior_var_control <- 1 / posterior_precision_control
posterior_sd_control <- sqrt(posterior_var_control)

## mean
posterior_mu_control <- prior_precision / posterior_precision_control * prior_mu + (control_N / known_var) / posterior_precision_control * control_mu

posterior_mu_control
posterior_sd_control

##### Update the experimental group observations ######
## SD
posterior_precision_experimental <- prior_precision + experimental_N / known_var
posterior_var_experimental <- 1 / posterior_precision_experimental
posterior_sd_experimental <- sqrt(posterior_var_experimental)

## mean
posterior_mu_experimental <- prior_precision / posterior_precision_experimental * prior_mu + (experimental_N / known_var) / posterior_precision_experimental * experimental_mu

posterior_mu_experimental
posterior_sd_experimental

Compare the posterior difference in strength change

# mean
mu_diff <- posterior_mu_experimental - posterior_mu_control
mu_diff

# sd = sqrt(var1 + var2)
sd_diff <- sqrt(posterior_var_experimental + posterior_var_control)
sd_diff

# 95% Credible Interval
mu_diff + qnorm(p = c(0.025, 0.975)) * sd_diff

Combining the observations for each group with our prior, it appears that the new program was more effective than the normal program on average (0.34 difference in change score between experimental and control) but the credible interval suggests the data is consistent with the new program having no extra benefit [95% Credible Interval: -0.0003 to 0.69].

Next, we can take the posterior parameters and perform a Monte Carlo Simulation to compare the posterior distributions.

Monte Carlo Simulation is an approach that uses random sampling from a defined distribution to solve a problem. Here, we will use Monte Carlo Simulation to sample from the normal distributions of the control and experimental groups as well as a simulation for the difference between the two. To do this, we will create a random draw of 10,000 values with the mean and standard deviation being defined as the posterior mean and standard deviation from the respective groups.

## Number of simulations
N <- 10000

## Monte Carlo Simulation
set.seed(9191)
control_posterior <- rnorm(n = N, mean = posterior_mu_control, sd = posterior_sd_control)
experimental_posterior <- rnorm(n = N, mean = posterior_mu_experimental, sd = posterior_sd_experimental)
diff_posterior <- rnorm(n = N, mean = mu_diff, sd = sd_diff)

## Put the control and experimental groups into a data frame
posterior_df <- data.frame(
  group = c(rep("control", times = N), rep("experimental", times = N)),
  posterior_sim = c(control_posterior, experimental_posterior)
)

Plot the mean and 95% CI for both simulated groups

posterior_df %>%
  group_by(group) %>%
  summarize(Avg = mean(posterior_sim),
            SE = sd(posterior_sim)) %>%
  ggplot(aes(x = group, y = Avg)) +
  geom_hline(yintercept = 0,
             size = 1,
             linetype = "dashed") +
  geom_point(size = 5) +
  geom_errorbar(aes(ymin = Avg - 1.96 * SE, ymax = Avg + 1.96 * SE),
                width = 0,
                size = 1.4) +
  labs(x = NULL,
       y = "Mean Diff",
       title = "Difference in Strength Score",
       subtitle = "Monte Carlo Simulation of Posterior Mean and SD") +
  theme(axis.text = element_text(face = "bold", size = 13),
        axis.title = element_text(face = "bold", size = 17),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 17))

We see more overlap here than we did when we were just looking at the observed data. Part of this is due to the priors that we set.

Plot the posterior simulation for the difference

hist(diff_posterior,
     main = "Posterior Simulation of the Difference between groups",
     xlab = "Strength Score Difference")
abline(v = 0,
       col = "red",
       lwd = 5,
       lty = 2)

Compare these results to the observed data and the t-test results

data.frame(
  group = c("control", "experimental", "difference"),
  observed_avg = c(control_mu, experimental_mu, t_test_diff),
  posterior_sim_avg = c(posterior_mu_control, posterior_mu_experimental, mu_diff),
  observed_standard_error = c(control_sd / sqrt(control_N), experimental_sd / sqrt(experimental_N), se_diff),
  posterior_sim_standard_error = c(posterior_sd_control, posterior_sd_experimental, sd_diff)
  )

Recall that we set out prior mean to 0.1 and our prior SD to 0.3
Also notice that, because this conjugate approach is directed at the means, I convert the observed standard deviations into standard errors (the standard deviation of the mean given the sample size). Remember, we set the SD to be known for this approach to work, so we are not able to say anything about it following our Bayesian procedure. To do so, we’d need to estimate both parameters (mean and SD) jointly. We will do this in part 2.
Above, we see that, following the Bayesian approach, the mean value of the control group gets pulled up closer to our prior while the mean value of the experimental group gets pulled down closer to that prior.
The posterior for the difference between the two groups, which is what we are interested in, has come down towards the prior, considerably. Had we had a larger sample of data in our two groups, the movement towards the prior would have been less extreme. But, since this was a relatively small study we get more movement towards the prior, placing less confidence in the observed sample of participants we had. We also see that the standard error around the mean difference increased, since our prior SD was 0.3. So, the mean difference decreased a bit and the standard error around that mean increased a bit. Both occurrences lead to less certainty in our observed outcomes.

Compare the Observed Difference to the Bayesian Posterior Difference

Start by creating a Monte Carlo Simulation of the observed difference from the original data.
Notice that the Bayesian Posterior Difference (blue) is pulled down towards our prior, slightly.
We also see that the Bayesian Posterior has more overlap of 0 than the Observed Difference (red) from the original data, which had a small sample size and hadn’t combined any prior knowledge that we might have had.

N <- 10000

set.seed(8945)
observed_diff <- rnorm(n = N, mean = t_test_diff, sd = se_diff)

plot(density(observed_diff),
     lwd = 5,
     col = "red",
     main = "Comparing the Observed Difference (red)\nto the\nBayes Posterior Difference (blue)",
     xlab = "Difference in Strength Score between Groups")
lines(density(diff_posterior),
      lwd = 5,
      col = "blue")
abline(v = 0,
       lty = 2,
       lwd = 3,
       col = "black")

Wrapping Up

That’s it for part 1. We’ve computed a frequentist t-test of the observed data in our study and compared those results to Bayesian estimation where we combined the observed data with some prior knowledge that we accessed from previous research and domain expertise. In this example, we used a conjugate approach and therefore were required to specify a known standard deviation for the data. Unfortunately, this may not always make sense to do and we may instead need to jointly estimating both the mean and standard deviation simultaneously. In part 2, we will discuss how to do this.

The code for this article is accessible on my GITHUB page.

If you see any code or mathematical errors, please email me.

Neural Networks….It’s regression all the way down!

Yesterday, I talked about how t-test and ANOVA are fundamentally just linear regression. But what about something more complex? What about something like a neural network?

Whenever people bring up neural networks I always say, “The most basic neural network is a sigmoid function. It’s just logistic regression!!” Of course, neural networks can get very complex and there is a lot more than can be added to them to maximize their ability to do the job. But fundamentally, they look like regression models and when you add several hidden layers (deep learning) you end up just stacking a bunch of regression models on top of each other (I know I’m over simplifying this a little bit).

Let’s see if we can build a simple neural network to prove it. As always, you can access the full code on my GITHUB page.

Loading packages, functions, and data

We will load {tidyverse} for data cleaning, {neuralnet} for building our neural network, and {mlbench} to access the Boston housing data.

I create a z-score function that will be used to standardize the features for our model. We will keep this simple and attempt to predict Boston housing prices (mdev) using three features (rm, dis, indus). To read more about what these features are, in your R console type ?BostonHousing. Once we’ve selected those features out of the data set, we apply our z-score function to them.

## Load packages
library(tidyverse)
library(neuralnet)
library(mlbench)

## z-score function
z_score <- function(x){
  z <- (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
  return(z)
}

## get data
data("BostonHousing")

## z-score features
d <- BostonHousing %>%
  select(medv, rm, dis, indus) %>%
  mutate(across(.cols = rm:indus,
                ~z_score(.x),
                .names = "{.col}_z"))
  
d %>% 
  head()

Train/Test Split

There isn’t much of a train/test split here because I’m not building a full model to be tested. I’m really just trying to show how a neural network works. Thus, I’ll select the first row of data as my “test” set and retain the rest of the data for training the model.

## remove first observation for making a prediction on after training
train <- d[-1, ]
test <- d[1, ]

Neural Network Model

We build a simple model with 1 hidden layer and then plot the output. In the plot we see various numbers. The numbers in black refer to our weights and the numbers in blue refer to the biases.

## Simple neural network with one hidden layer
set.seed(9164)
fit_nn <- neuralnet(medv ~ rm_z + dis_z + indus_z,
                    data = train,
                    hidden = 1,
                    err.fct = "sse",
                    linear.output = TRUE)

## plot the neural network
plot(fit_nn)

Making Predictions — It’s linear regression all the way down!

As stated above, we have weights (black numbers) and biases (blue numbers). If we are trying to frame up the neural network as being a bunch of stacked together linear regressions then we can think about the weights as functioning like regression coefficients and the biases functioning like the linear model intercept.

Let’s take each variable from the plot and store them in their own elements so that we can apply them directly to our test observation and write out the equation by hand.

## Predictions are formed using the weights (black) and biases
# Store the weights and biases from the plot and put them into their own elements
rm_weight <- 1.09872
dis_weight <- -0.05993
indus_weight <- -0.49887

# There is also a bias in the hidden layer
hidden_weight <- 35.95032

bias_1 <- -1.68717
bias_2 <- 14.85824

With everything stored, we are ready to make a prediction on the test observations

We begin at the input layer by multiplying each z-scored value by the corresponding weight from the plot above. We sum those together and add in the first bias — just like we would with a linear regression.

# Start by applying the weights to their z-scored values, sum them together and add
# in the first bias
input <- bias_1 + test$rm_z * rm_weight + test$dis_z * dis_weight + test$indus_z * indus_weight
input

One regression down, one more to go! But before we can move to the next regression, we need to transform this input value. The neural network is using a sigmoid function to make this transformation as the input value moves through the hidden layer. So, we should apply the sigmoid function before moving on.

# transform input -- the neural network is using a sigmoid function
input_sig <- 1/(1+exp(-input))
input_sig

We take this transformed input and multiply it by the hidden weight and then add it to the second bias. This final regression equation produces the predicted value of the home.

prediction <- bias_2 + input_sig * hidden_weight
prediction

The prediction here is in the thousands, relative to census data from the 1970’s.

Let’s compare the prediction we got by hand to what we get when we run the predict() function.

## Compare the output to what we would get if we used the predict() function
predict(fit_nn, newdata = test)

Same result!

Again, if you’d like the full code, you can access it on by GITHUB page.

Wrapping Up

Yesterday we talked about always thinking regression whenever you see a t-test or ANOVA. Today, we learn that we can think regression whenever we see a neural network, as well! By stacking two regression-like equations together we produced a neural network prediction. Imagine if we stacked 20 hidden layers together!

The big take home, though, is that regression is super powerful. Fundamentally, it is the workhorse that helps to drive a number of other machine learning approaches. Imagine if you spent a few years really studying regression models? Imagine what you could learn about data analysis? If you’re up for it, one of my all time favorite books on the topic is Gelman, Hill, and Vehtari’s Regression and Other Stories. I can’t recommend it enough!

t-test…ANOVA…It’s linear regression all the way down!

I had someone ask me a question the other day about t-tests. The question was regarding how to get the residuals from a t-test. The thing we need to remember about t-tests and ANOVA is that they are general linear models. As such, an easier way of thinking about them is that they are a different way of looking at a regression output. In this sense, a t-test is just a simple linear regression with a single categorical predictor (independent) variable that has two levels (e.g., Male & Female) while ANOVA is a simple linear regression with a single predictor variable that has more than two levels (e.g., Cat, Dog, Fish).

Let’s look at an example!

Complete code is available on my GITHUB page.

Load Data

The data we will use is the iris data set, available in the numpy library.

Exploratory Data Analysis

The Jupyter Notebook I’ve made available on GITHUB has a number of EDA steps. For this tutorial the variable we will look at is Sepal Length, which appears to different between Species.

T-test

We will start by conducting a t-test. Since a t-test is a comparison of means between two groups, I’ll create a data set with only the setosa and versicolor species.

## get two groups to compare
two_groups = ["setosa", 'versicolor']

## create a data frame of the two groups
df2 = df[df['species'].isin(two_groups)]

First I build the t-test in two common stats libraries in python, statsmodels and scipy.

## t-test in statsmodels.api
smf.stats.ttest_ind(x1 = df2[df['species'] == 'setosa']['sepal_length'],
                    x2 = df2[df['species'] == 'versicolor']['sepal_length'],
                   alternative="two-sided")

## t-test in scipy
stats.ttest_ind(a = df2[df['species'] == 'setosa']['sepal_length'],
                b = df2[df['species'] == 'versicolor']['sepal_length'],
                alternative="two-sided")

Unfortunately, the output of both of these approaches leaves a lot to be desired. They simply return the t-statistics, p-value, and degrees of freedom.

To get a better look at the underlying comparison, I’ll instead fit the t-test using the researchpy library.

## t-test in reserachpy
rp.ttest(group1 = df2[df['species'] == 'setosa']['sepal_length'],
         group2 = df2[df['species'] == 'versicolor']['sepal_length'])

This output is more informative. We can see the summary stats for both groups at the top. the t-test results follow below. We see the observed difference, versicolor has a sepal length 0.93 (5.006 – 5.936) longer that setosa, on average. We also get the degrees of freedom, t-statistic, and p-value, along with several measures of effect size.

Linear Regression

Now that we see what the output looks like, let’s confirm that this is indeed just linear regression!

We fit our model using the statsmodels library.

## Linear model to compare results with t-test (convert the species types of dummy variables)
X = df2[['species']]
X = pd.get_dummies(X['species'], drop_first = True)

y = df2[['sepal_length']]

## add an intercept constant, since it isn't done automatically
X = smf.add_constant(X)

# Build regression model
fit_lm = smf.OLS(y, X).fit()

# Get model output

fit_lm.summary()

Notice that the slope coefficient for versicolor is 0.93, indicating it’s sepal length is, on average, 0.93 greater than setosa’s sepal length. This is the same result we obtained with our t-test above.

The intercept coefficient is 5.006, which means that when versicolor is set to “0” in the model (0 * 0.93 = 0) all we are left with is the intercept, which is the mean value for setosa’s sepal length, the same as we saw in our t-test.

What about the residuals?

The question original question was about residuals from the t-test. Recall, the residuals are the difference between actual/observed value and the predicted value. When we have a simple linear regression with two levels (a t-test) the predicted value is simply the overall mean value for that group.

We can add predictions from the linear regression model into our data set and calculate the residuals, plot the residuals, and then calculate the mean squared error.

## Add the predictions back to the data set
df2['preds'] = fit_lm.predict(X)

## Calculate the residual
resid = df2['sepal_length'] - df2['preds']

## plot the residuals
sns.kdeplot(resid,shade = True)
plt.axvline(x = 0,linewidth=4, linestyle = '--', color='r')

In the code, you will find the same approach taken by just applying the group mean as the “predicted” value, which is the same value that the model predicts. At the bottom of the code, we will find that the outcome of the MSE is the same.

Wrapping Up

In summary, whenever you think t-test or ANOVA, just think linear regression. The intercept will end up reflecting the mean value for the reference class while the coefficient(s) for the other classes of that variable will represent the difference between their mean value and the reference class. If you want to make a comparison to a different reference class, you can change the reference class before specifying your model and you will obtain a different set of coefficients (given that they are compared to a new reference class) but the predicted values, residuals, and MSE will end up being the same.

Again, if you’d like the full code, you can access it HERE.