Category Archives: Sports Analytics

Displaying Performance Outcomes on a Test

Introduction

I recently had a discussion with some colleagues about displaying performance outcomes on a test for a group of athletes. The discussion was centered around percentile ranking the athletes on a team within a given season. While is one way to display such information we could alternatively display the data as a percentile using a known mean and standard deviation for the population. This latter approach works by standardizing the data (z-score) and using properties of the normal distribution. Similarly, we could take the z-score and convert it to a t-score, on a 1-100 score.

Given these different options, I figured I’d throw together a quick article to show what they look like and how to calculate them in R. The discussion is right in line with the last 2 blog articles about using boxplots and dotplots to visualize athlete testing data (Part 1 and Part 2).

Simulate Data

We will simulate performance test results for 22 different athletes. To do this, we take advantage of the rnorm() function in R and draw from 3 different normal distributions to produce 20 tests results. Since I used set.seed() you will be able to reproduce my results exactly. After creating 20 simulations I added 2 additional athletes to the data set and gave them test scores that were exactly the same as two other athletes in the data so that we had some athletes with the same performance outcome.

Percentile Rank

The percentile rank reflects the percentage of observations that are below a certain score. This value is displayed in 100 theoretical divisions of the observed data. Thus, the top score in the data represents 100 and every value falls below that.

To calculate the percentile rank we simply rank the observed performance values and then divide by the number of observations.

Let’s start by sorting the performance scores so that they are in order from lowest to highest.

Next, we rank these values.

Notice that when we sort the data we see that the values 58.5 and 46.2 are repeated twice. Once we rank them we see that the rank values are also correctly repeated. We can get rid of the half points for these repeated observation by using the trunc() function, which will truncate the values.

Finally, to get the percentile rank, we divide by the total number of observations.

Instead of always having to walk through these steps, we can create a function to do the steps for us in one line of code. This will come in handy when we compare all of these methods later on.

perc.rank <- function(x){
  trunc(rank(x))/length(x)
}

perc.rank(sort(df$performance))

Percentiles

A percentile value is different than a percentile rank in that the percentile value reflects the observed score relative to a population mean and standard deviation. Often, this type of value has been used to represent how well a student has performed on a standardized test (e.g., SAT, ACT, GRE, etc.). The percentile value tells us the density of values below our observation. Thus, the percentile value represents a cumulative distribution under the normal curve, below the point of interest. For example, let’s say we have a bunch of normally distributed data with a mean of 100 and standard deviation of 10. If we plot the distribution of the data and drop a line at 100 (the mean), 50% of the data will fall below and it 50% above it.

set.seed(1)
y <- rnorm(n = 10000, mean = 100, sd = 10)

plot(density(y), col = 'black',
  main = 'Mean = 100, SD = 10')
polygon(density(y), col = 'grey')
abline(v = 100, col = 'red', lty = 2, lwd = 3)

Instead, if we place the line at an observation of 85 we will see that approximately 7% of the data falls below this point (conversely, 93% of the data is above it).

To find the cumulative distribution below a specific observation we can use the pnorm() function and pass it the observation of interest, the population mean, and the standard deviation.

Alternatively, we can obtain the same value by first calculating the z-score of the point of interest and simply passing that into the pnorm() function.

z = (observation – mean) / sd

We find that the z-score for 85 is -1.5 standard deviations below the mean.

We will write a z-score function to use later on.

z_score <- function(x, avg, SD){
  z = (x - avg) / SD
  return(z)
}

T-score

As we saw above, the score of 85 led to a z-score of -1.5. Sometimes having the data scaled to a mean of 0 with values above and below it can difficult for decision-makers to interpret. As such, we can take the z-score and turn it into a a t-score, ranging from 0-100, where 50 represents average, 40 and 60 represent ± 1 standard deviation, 30 and 70 represent ± 2 standard deviation, and 20 and 80 represent ± 3 standard deviations from the mean.

t = observation*10 + 50

Therefore, using the z-score value of -1.5 we end up with a t-score of 35.

We will make a t-score function to use on our athlete simulated data.

t_score <- function(z){
  t = z * 10 + 50
  return(t)
}

Returning to the athlete simulated data

We now return to our athlete simulated data and apply all of these approaches to the performance data. For the z-score, t-score, and percentile values, I’ll start by using the mean and standard deviation of the observed data we have.

df_ranks_v1 <- df %>%
  mutate(percentile_rank = perc.rank(performance),
         percentile_value = pnorm(performance, mean = mean(performance), sd = sd(performance)),
         z = z_score(x = performance, avg = mean(performance), SD = sd(performance)),
         t = t_score(z)) %>%
  mutate(across(.cols = percentile_rank:t,
                ~round(.x, 2)))

df_ranks_v1 %>% 
  arrange(desc(percentile_rank)) %>%
  knitr::kable()

We can also plot these values to provide ourselves a visual to compare them.

We can see that the order of the athletes doesn’t change based on the method. This makes sense given that the best score for this group of athletes is always going to be the best score and the worst will always be the worst. We do see that the percentile rank approach assigns the top performance as 100%; however, the percentile value assigns the top performance a score of 98%. This is because the percent value is based on the parameters of the normal distribution (mean and standard deviation) and doesn’t rank the observations from best to worse as the percentile rank does. Similarly, the other two scores (z-score and t-score) also use the distribution parameters and thus follow the same pattern as the percentile value.

Why does this matter? The original discussion was about athletes within a given season, on one team. If all we care about is the performance of that group of athletes, on that team, in that given season, then maybe it doesn’t matter which approach we use. However, what if we want to compare the group of athletes to previous teams that we’ve had or to a population mean and standard deviation that we’ve obtained from the league (or from scientific literature)? In this instance, the percentile rank value will remain unchanged but it will end up looking different than the other three scores because it doesn’t depend on the mean and standard deviation of the population.

For example, the mean and standard deviation of our current team is 48.9 ± 13.9.

Perhaps our team is currently below average for what we expect from the population. Let’s assume that the population we want to compare our team to has a mean and standard of 55 ± 10.

df_ranks_v2 <- df %>%
  mutate(percentile_rank = perc.rank(performance),
         percentile_value = pnorm(performance, mean = 55, sd = 10),
         z = z_score(x = performance, avg = 55, SD = 10),
         t = t_score(z)) %>%
  mutate(across(.cols = percentile_rank:t,
                ~round(.x, 2)))

df_ranks_v2 %>% 
  arrange(desc(percentile_rank)) %>%
  knitr::kable()

Again, the order of the athletes’ performance doesn’t change and thus the percentile rank of the athletes also doesn’t change. However, the percentile values, z-scores, and t-scores now tell a different story. For example, el-Azer, Ariyya scored 47.9 which has a percentile rank of 50% for the observed performance scores of this specific team. However, this value relative to our population of interest produces a z-score of -0.71, a t-score of 42.9, and a percentile value indicating that only 24% of those in the population who are taking this test are below this point. The athlete looks to be average for the team but when compared to the population they look to be below average.

Wrapping Up

There are a number of ways to display the outcomes on a test for athletes. Using percentile rank, we are looking specifically at the observations of the group that took the given test. If we use percentile value, z-scores, and t-scores, we are using properties of the normal distribution and, often comparing the observed performance to some known population norms. There probably isn’t a right or wrong approach here. Rather, it comes down to the type of story you are looking to tell with your data.

The full code for this article is available on my GITHUB page.

Box & Dotplots for Performance Visuals – Creating an Interactive Plot with plotly

Yesterday, I provided some code to make a simple boxplot with doplot in order to visualize an athlete’s performance relative to their peers.

Today, we will try and make this plot interactive. To do so, we will use the {plotly} package and save the plots as html files that can be sent to our coworkers or decision-makers so that they can interact directly with the data.

Data

We will use the same simulated data from yesterday and also load the {plotly} library.

### Load libraries -----------------------------------------------
library(tidyverse)
library(randomNames)
library(plotly)

### Data -------------------------------------------------------
set.seed(2022)
dat <- tibble( participant = randomNames(n = 20), performance = rnorm(n = 20, mean = 100, sd = 10))

Interactive plotly plot

Our first two plots will be the same, one vertical and one horizontal. Plotly is a little different than ggplot2 in syntax.

I start by creating a base plot, indicating I want the plot to be a boxplot. I also tell plotly that I want to group by participant, so that the dots will show up alongside the boxplot, as they did in yesterday’s visual.
Once I’ve specified the base plot, I indicate that I want boxpoints to add the points next to the boxplot and set some colors (again, using a colorblind friendly palette)
Finally, I add axis labels and a title.
For easy, I use the subplot() function to place the two plots next to each other so that you can compare the vertical and horizontal plot and see which you prefer.

### Build plotly plot -------------------------------------------
# Set plot base
perf_plt <- plot_ly(dat, type = "box") %>%
  group_by(participant)

# Vertical plot
vert_plt <- perf_plt %>%
  add_boxplot(y = ~performance,
              boxpoints = "all",
              line = list(color = 'black'),
              text = ~participant,
              marker = list(color = '#56B4E9',
                            size = 15)) %>% 
  layout(xaxis = list(showticklabels = FALSE)) %>%
  layout(yaxis = list(title = "Performance")) %>%
  layout(title = "Team Performance")       

# Horizontal plot
hz_plt <- perf_plt %>%
  add_boxplot(x = ~performance,
              boxpoints = "all",
              line = list(color = 'black'),
              text = ~participant,
              marker = list(color = '#E69F00',
                            size = 15)) %>% 
  layout(yaxis = list(showticklabels = FALSE)) %>%
  layout(xaxis = list(title = "Performance")) %>%
  layout(title = "Team Performance") 

## put the two plots next to each other
subplot(vert_plt, hz_plt)

Statically, we can see the plot below and if you click on the red link beneath it you will be taken to the interactive version, where you can hover over the points and see the individual athlete’s name and performance on the test.

interactive_plt1

Interactive plotly with selector option

Next, we build the same plot but add a selector box so that the user can select the individual of interest and see their point relative to the boxplot (the population data).

This approach requires a few steps:

I have to create a highlight key to explicitly tell plotly that I want to be able to highlight the participants.
Next I create the base plot but this time instead of using the original data, I pass in the highlight key that I created in step 1.
I build the plot just like before.
Once the plot has been built I use the highlight() function to tell plotly how I want the plot to behave.

NOTE: This approach is super useful and easy and doesn’t require a shiny server to share the results. That said, I find this aspect of plotly to be a bit clunky and, when given the choice between this or using shiny, I take shiny because it has a lot more options to customize it exactly how you want. The downside is that you’d need a shiny server to share your results with colleagues or decision-makers, so there is a trade off.

### plotly with selection box -------------------------------------------
# set 'particpant' as the group to select
person_of_interest <- highlight_key(dat, ~participant)

# create a new base plotly plot using the person_of_interest_element
selection_perf_plt <- plot_ly(person_of_interest, type = "box") %>%
  group_by(participant)

# build the plot
plt_selector <- selection_perf_plt %>%
  group_by(participant) %>%
  add_boxplot(x = ~performance,
              boxpoints = "all",
              line = list(color = 'black'),
              text = ~participant,
              marker = list(color = '#56B4E9',
                            size = 15)) %>% 
  layout(yaxis = list(showticklabels = FALSE)) %>%
  layout(xaxis = list(title = "Performance")) %>%
  layout(title = "Team Performance")   

# create the selector tool
plt_selector %>%
  highlight(on = 'plotly_click',
              off = 'plotly_doubleclick',
              selectize = TRUE,
              dynamic = TRUE,
              persistent = TRUE)

Statically, we can see what the plot looks like below.
Below the static image you can click the red link to see me walk through the interactive plot. Notice that as I select participants it selects them out. I can add as many as I want and change color to highlight certain participants over others. Additionally, once I begin to remove participants you’ll notice that plotly will create a boxplot for the selected sub population, which may be useful when communicating performance results.
Finally, the last red link will allow you to open the interactive tool yourself and play around with it.

interactive_plt2_video

interactive_plt2

Wrapping Up

Today’s article provided some interactive options for the static plots that were created in yesterday’s blog article.

As always, the complete code for this article is available on my GITHUB page.

Box & Dotplots for Performance Visuals

A colleague recently asked me about visualizing athlete performance of athletes relative to their teammates. More specifically, they wanted something that showed some sort of team average and normal range and then a way to highlight where the individual athlete of interest resided within the population.

Immediately, my mind went to some type of boxplot visualization combined with a dotplot where the athlete can clearly identified. Here are a few examples I quickly came up with.

Simulate Performance Data

First we will simulate some performance data for a group of athletes.

### Load libraries -----------------------------------------------
library(tidyverse)
library(randomNames)

### Data -------------------------------------------------------
set.seed(2022)
dat <- tibble(
  participant = randomNames(n = 20),
  performance = rnorm(n = 20, mean = 100, sd = 10))

Plot 1 – Boxplot with Points

The first plot is a simple boxplot plot with dots below it.

A couple of notes:

I’ve selected a few athletes to be our “of_interest” players for the plot.
This plot doesn’t have a y-axis, since all I am doing is plotting the boxplot for the distribution of performance. Therefore, I set the y-axis variable to a factor, so that is simply identifies a space within the grid to organize my plot.
I’m using a colorblind friendly palette to ensure that the colors are viewable to a broad audience.
Everything else after that is basic {ggplot2} code with some simple theme styling for the plot space and the legend position.

dat %>%
  mutate(of_interest = case_when(participant %in% c("Gallegos, Dennis", "Vonfeldt, Mckenna") ~ participant,
                                 TRUE ~ "everyone else")) %>%
  ggplot(aes(x = performance, y = factor(0))) +
  geom_boxplot(width = 0.2) +
  geom_point(aes(x = performance, y = factor(0),
                  fill = of_interest),
              position = position_nudge(y = -0.2),
              shape = 21,
              size = 8,
              color = "black",
              alpha = 0.6) +
  scale_fill_manual(values = c("Gallegos, Dennis" = "#E69F00", "Vonfeldt, Mckenna" = "#56B4E9", "everyone else" = "#999999")) +
  labs(x = "Performance",
       title = "Team Performance",
       fill = "Participants") +
  theme_classic() +
  theme(axis.text.y = element_blank(),
        axis.title.y = element_blank(),
        axis.text.x = element_text(size = 10, face = "bold"),
        axis.title.x = element_text(size = 12, face = "bold"),
        plot.title = element_text(size = 18),
        legend.position = "top")

Not too bad! You might have more data than we’ve simulated and thus the inline dots might start to get busy. We can use geom_dotplot() to create separation between dots in areas where there is more density and expose that density of performance scores a bit better.

Plot 2 – Boxplot with Dotplots

Swapping out geom_point() with geom_dotplot() allows us to produce the same plot above just with a dotplot.

Here, I set “y = positional” since, as before, we don’t have a true y-axis. Doing so allows me to specify where on the y-axis I want to place by boxplot and dotplot in their respective aes().

# Horizontal
dat %>%
  mutate(of_interest = case_when(participant %in% c("Gallegos, Dennis", "Vonfeldt, Mckenna") ~ participant,
                                 TRUE ~ "everyone else")) %>%
  ggplot(aes(x = performance, y = positional)) +
  geom_boxplot(aes(y = 0.2),
    width = 0.2) +
  geom_dotplot(aes(y = 0, 
                   fill = of_interest),
              color = "black",
              stackdir="center",
              binaxis = "x",
              alpha = 0.6) +
  scale_fill_manual(values = c("Gallegos, Dennis" = "#E69F00", "Vonfeldt, Mckenna" = "#56B4E9", "everyone else" = "#999999")) +
  labs(x = "Performance",
       title = "Team Performance",
       fill = "Participants") +
  theme_classic() +
  theme(axis.text.y = element_blank(),
        axis.title.y = element_blank(),
        axis.text.x = element_text(size = 10, face = "bold"),
        axis.title.x = element_text(size = 12, face = "bold"),
        plot.title = element_text(size = 18),
        legend.position = "top",
        legend.title = element_text(size = 13),
        legend.text = element_text(size = 11),
        legend.key.size = unit(2, "line"))

If you don’t like the idea of a horizontal plot you can also do it in vertical.

# vertical
dat %>%
  mutate(of_interest = case_when(participant %in% c("Gallegos, Dennis", "Vonfeldt, Mckenna") ~ participant,
                                 TRUE ~ "everyone else")) %>%
  ggplot(aes(x=positional, y= performance)) +
  geom_dotplot(aes(x = 1.75, 
                   fill = of_interest), 
               binaxis="y", 
               stackdir="center") +
  geom_boxplot(aes(x = 2), 
               width=0.2) +
  scale_fill_manual(values = c("Gallegos, Dennis" = "#E69F00", "Vonfeldt, Mckenna" = "#56B4E9", "everyone else" = "#999999")) +
  labs(y = "Performance",
       title = "Team Performance",
       fill = "Participants") +
  theme_classic() +
  theme(axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.y = element_text(size = 10, face = "bold"),
        axis.title.y = element_text(size = 12, face = "bold"),
        plot.title = element_text(size = 18),
        legend.position = "top",
        legend.title = element_text(size = 13),
        legend.text = element_text(size = 11),
        legend.key.size = unit(2, "line")) +
  xlim(1.5, 2.25)

Wrapping Up

Visualizing an athlete’s performance relative to their team or population can be a useful for communicating data. Boxplots with dotplots can be a compelling way to show where an athlete falls when compared to their peers. Other options could have been to show a density plot with points below it (like a Raincloud plot). However, I often feel like people have a harder time grasping a density plot whereas the the boxplot clearly gives them an average (line inside of the box) and some “normal” range (interquartile range, referenced by the box) to anchor themselves to. Finally, this plot can easily be built into an interactive plot using Shiny.

All of the code for this article is available on my GITHUB page.

The Role of Skill vs Luck in Team Sport Winning

This week on the Wharton Moneyball Podcast, the hosts were discussing World Cup results following the group play stage.

At one point, they talked about the variance in performance and the role that luck can play in winning or losing. They felt like they didn’t have as good an intuition about how variable the games were and one of the hosts, Eric Bradlow, said that he’d try and look into it and have an answer for next week’s episode.

The discussion reminded me of a chapter in one of my favorite books, The Success Equation by Michael Mauboussin. The book goes into nice detail about the role of skill and luck in sports, investing, and life. On page 78, Mauboussin provides the following equation:

Skill = Observed Outcome – Luck

From there, he explains the steps for determining the contribution of luck to winning in a variety of team sports. Basically, the amount of luck plays is represented as the ratio of the variance of luck to the variance of observed outcomes.

To calculate the variance of observed outcomes, we find the win% of each team in a given season and calculate the standard deviation of that win%. Squaring this value gives us the variance of observed outcomes.

When calculating the variance of luck we first find the average win% of all of the teams and, treating this as a binomial, we calculate the standard deviation as sqrt((win% * (1 – win%)) / n_games), where n_games is the number of games in a season.

I scraped data for the previous 3 complete seasons for the Premier League, NHL, NBA, NFL, and MLB from sports-reference.com. All of the data and code for calculating the contribution of luck to winning for these sports is available on my GitHub page.

Let’s walk through an example

Since the guys on Wharton Moneyball Podcast were talking about Soccer, I’ll use the Premier League as an example.

First, we create a function that does all the calculations for us. All we need to do is obtain the relevant summary statistics to feed into the function and then let it do all the work.

league_perf <- function(avg_win_pct, obs_sd, luck_sd, league){
  
  ## convert standard deviations to variance
  var_obs <- obs_sd^2
  var_luck <- luck_sd^2
  
  ## calculate the variation due to skill
  var_skill <- var_obs + var_luck
  
  ## calculate the contribution of luck to winning
  luck_contribution <- var_luck / var_obs
  
  
  ## Results table
  output <- tibble(
    league = {{league}},
    avg_win_pct = avg_win_pct,
    luck_contribution = luck_contribution,
  )
  
  return(output)
  
}

Using the Premier League data set we calculate the games per season, league average win%, the observed standard deviation, and the luck standard deviation.

NOTE: I’m not much of a soccer guy so I wasn’t sure how to handle draws. I read somewhere that they equate to about 1/3 of a win, so that is what I use here to credit a team for a draw.

## get info for function
# NOTE: Treating Draws as 1/3 of a win, since that reflects the points the team is awarded in the standings
premier_lg %>%
  select(Squad, MP, W, D) %>%
  mutate(win_pct = (W + 1/3*D) / MP) %>%
  summarize(games = max(MP),
            avg_win_pct = mean(win_pct),
            obs_sd = sd(win_pct),
            luck_sd = sqrt((avg_win_pct * (1 - avg_win_pct)) / games))

## Run the function
premier_lg_output <- league_perf(avg_win_pct = 0.462,
                                 obs_sd = 0.155,
                                 luck_sd = 0.0809,
                                 league = "Premier League")

premier_lg_output

Looking at our summary table, it appears that teams have had an average win% over the past 3 seasons of 46.2% (not quite 50%, as we will see in NBA, MLB, or NFL, since draws happen so frequently). The standard deviation of team win% was 15.5% (this is the observed standard deviation), while the luck standard deviation, 8.1%, is the binomial standard deviation using the average win percentage and 38 games in a season.

Feeding these values into the function created above we find that luck contributes approximately 27.2% to winning in the Premier League. In Mauboussin’s book he found that the contribution of luck to winning was 31% in the Premier League from 2007 – 2011. I don’t feel like the below value is too far off of that, though I don’t have a good sense for what sort of magnitude of change would be surprising. That said, the change could be due to improved skill in the Premier League (already one of the most skillful leagues in the world) or perhaps how he handled draws, which was not discussed in the book and may differ from the approach I took here.

Let’s look at the table of results for all sports

Again, you can get the code for calculating the contribution of luck to winning in these sports from my GitHub page, so no need to rehash it all here. Instead, let’s go directly to the results.

NBA appears to be the most skill driven league with only about 15% of the contribution to winning coming from luck. Mauboussin found this value to be 12% from 2007 – 2011.
The NFL appears to be drive most by luck, contributing 39% to winning. This is identical to what Mauboussin observed using NFL data from 2007 – 2011.
The two most surprising outcomes here are the MLB and NHL. Mauboussin found the MLB to have a 34% contribution of luck (2007 – 2011) and the NHL a 53% contribution of luck (2008 – 2012). However, in my table below it would appear that the contribution of luck has decreased substantially in these two sports, using data from previous 3 seasons (NOTE: throwing out the COVID year doesn’t alter this drastically enough to make it close to what Mauhoussin showed).

Digging Deeper into MLB and NHL

I pulled data from the exact same seasons that Mauboussin used in his book and obtained the following results.

The results I observed for the MLB are identical to what Mauboussin had in his book (pg. 80)
The results for the NHL are slightly off (47.7% compared to 53%) but this might have to do with how I am handling overtime wins and losses (which awards teams points in the standings), as I don’t know enough about hockey to determine what value to assign to them. Perhaps Mauboussin addressed these two outcomes in his calculations in a more specific way.

Additional hypotheses:

Maybe skill has improved in hockey over the past decade and a half?
Maybe a change in tactics (e.g., the shift) or strategy (e.g., hitters trying to hit more home runs instead of just making contact or pitchers trying to explicitly train to increase max velocity) has taken some of the luck events out of baseball and turned it into more of a zero-sum duel between the batter and pitcher, making game outcomes more dependent on the skill of both players?
Maybe I have an error somewhere…let me know if you spot one!

Wrapping Up

Although we marvel at the skill of the athletes we watch in their arena of competition, it’s important to recognize that luck plays a role in the outcomes, and for some sports it plays more of a role than others. But, luck is also what keeps us watching and makes things fun! Sometimes the ball bounces your way and you can steal a win! The one thing I always appreciate form the discussions on the Wharton Moneyball Podcast is that the hosts don’t shy away from explaining things like luck, randomness, variance, regression to the mean, and weighting observed outcomes with prior knowledge. This way of thinking isn’t just about sport, it’s about life. In this sense, we may consider sport to be the vehicle through which these hosts are teaching us about our world.

Approximating a Bayesian Posterior Prediction

This past week on the Wharton Moneyball Podcast, during Quarter 2 of the show the hosts got into a discussion about Bayesian methods, simulating uncertainty, and frequentist approaches to statistical analysis. The show hosts are all strong Bayesian advocates but at one point in the discussion, Eric Bradlow mentioned that “frequenstist can answer similar questions by building distributions from their model predictions.” (paraphrasing)

This comment reminded me of Chapter 7 in Gelman and Hill’s brilliant book, Data Analysis Using Regression and Multilevel/Hierarchical Models. In this chapter, the authors do what they call an informal Bayesian approach by simulating the predictions from a linear regression model. It’s an interesting (and easy) approach that can be helpful for getting a sense for how Bayesian posterior predictions work without building a full Bayesian model. I call it, “building a bridge to Bayes”.

Since the entire book is coded in R, I decided to code an example in Python.

The Jupyter notebook is accessible on my GITHUB page.

(Special Material: In the Jupyter Notebook I also include additional material on how to calculate prediction intervals and confidence intervals by hand in python. I wont go over those entire sections here as I don’t want the post to get too long.)

Libraries & Data

We will start by loading up the libraries that we need and the data. I’ll use the iris data set, since it is conveniently available from the sklearn library. We will build the regression model using the statsmodels.api library.

## import libraries

from sklearn import datasets
import pandas as pd
import numpy as np
import statsmodels.api as smf
from scipy import stats
import matplotlib.pyplot as plt

## iris data set
iris = datasets.load_iris()

## convert to pandas sata frame
data = iris['data']
target = iris['target']

iris_df = pd.DataFrame(data)
iris_df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
iris_df.head(3)

Build a linear regression model

Next, we build a simple ordinary least squares regression model to predict petal_width from petal_length.

## Set up our X and y variables
X = iris_df['petal_length']
y = iris_df['petal_width']

# NOTE: statsmodels does not automatically add an intercept, so you need to do that manually

X = smf.add_constant(X)

# Build regression model
# NOTE: the X and y variables are reversed in the function compared to sklearn

fit_lm = smf.OLS(y, X).fit()

# Get an R-like output of the model

fit_lm.summary()

Simulate a distribution around the slope coefficient

The slope coefficient for the petal_length variable, from the model output above, is 0.4158 with a standard error of 0.01. We will store these two values in their own variables and then use them to create a simulation of 10,000 samples and plot the distribution.

## get summary stats
mu_slope = 0.4158
se_slope = 0.01

## create simulation
n_samples = 10000
coef_sim = np.random.normal(loc = mu_slope, scale = se_slope, size = n_samples)

## plot simulated distribution

plt.hist(coef_sim, bins = 60)
plt.show()

We can also grab the summary statistics from the simulated distribution. We will snag the mean and the 90% quantile interval.

## get summary stats from our simulation
summary_stats = {
    'Mean': coef_sim.mean(),
    'Low90': np.quantile(coef_sim, 0.05),
    'High90': np.quantile(coef_sim, 0.95)
}

summary_stats

Making a prediction on a new observation and building a posterior predictive distribution

Now that we’ve gotten a good sense for how to create a simulation in python, we can create a new observation of petal_length and make a prediction about what the petal_width would be based on our model. In addition, we will get the prediction intervals from the output and use them to calculate a standard error for the prediction, which we will use for the posterior predictive simulation.

Technical Note: In the prediction output you will see that python returns a mean_ci_lower and mean_ci_upper and an obs_ci_lower and obs_ci_higher. The latter two are the prediction intervals while the former two are the confidence intervals. I previously discussed the difference HERE and this is confirmed in the Jupyter Notebook where I calculate these values by hand.

# create a new petal_length observation
new_dat = np.array([[1, 1.7]])    # add a 1 upfront to indicate the model intercept

# make prediction of petal_width using the model
prediction = fit_lm.get_prediction(new_dat)

# get summary of the prediction
prediction.summary_frame()

Store the predicted value (0.343709) and then calculate the standard error from the lower and upper prediction intervals. Run a simulation and then plot the distribution of predictions.

mu_pred = 0.343709
se_pred = 0.754956 - 0.343709     # subtract the upper prediction interval from the mean to get the variability
n_sims = 10000

pred_obs = np.random.normal(loc = mu_pred, scale = se_pred, size = n_sims)

plt.hist(pred_obs, bins = 60)
plt.show()

Just as we did for the simulation of the slope coefficient we can extract our summary statistics (mean and 90% quantile intervals).

## get summary stats from our simulation
summary_stats = {
    'Mean': pred_obs.mean(),
    'Low90': np.quantile(pred_obs, 0.05),
    'High90': np.quantile(pred_obs, 0.95)
}

summary_stats

Wrapping Up

That is a pretty easy way to get a sense for approximating a Bayesian posterior predictive distribution. Rather than simply reporting the predicted value and a confidence interval or prediction interval, it is sometimes nice to build an entire distribution. Aside from it being visually appealing, it allows us to answer other questions we might have, for example, what percentage of the data is greater or less than 0.5 (or any other threshold value you might be interested in)?

As stated earlier, all of this code is accessible on my GITHUB page and the Jupyter notebook also has additional sections on how to calculate confidence intervals and prediction intervals by hand.

If you notice any errors, let me know!