Bayesian Priors for Categorical Variables using rstanarm

Continuing with more Bayesian data analysis using the {rstanarm} package, today walk through the ways of setting priors on categorical variables.

NOTE: Priors are a pretty controversial piece in Bayesian statistics and one of the arguments people make against Bayesian data analysis. Thus, I’ll also show what happens when you are overly bullish with your priors/

The full code is accessible on my GITHUB page.

Load Packages & Data

We are going to use the mtcars data set from R. The cylinder variable (cyl) is read in as a numeric but it only have three levels (4, 6, 8), therefore, we will convert it to a categorical variable and treat it as such for the analysis.

We are going to build a model that estimates miles per gallon (mpg) from the number of cylinders a car has. So, we will start by looking at the mean and standard deviation of mpg for each level of cyl.

## Bayesian priors for categorical variables using rstanarm

library(rstanarm)
library(tidyverse)
library(patchwork)

### Data -----------------------------------------------------------------
d <- mtcars %>%
  select(mpg, cyl, disp) %>%
  mutate(cyl = as.factor(cyl),
         cyl6 = ifelse(cyl == "6", 1, 0),
         cyl8 = ifelse(cyl == "8", 1, 0))

d %>% 
  head()

d %>%
  group_by(cyl) %>%
  summarize(avg = mean(mpg),
            SD = sd(mpg))

Fit the model using Ordinary Least Squares regression

Before constructing our Bayesian model, we fit the model as a basic regression model to see what the output looks like.

## Linear regression ------------------------------------------------------
fit_lm <- lm(mpg ~ cyl, data = d)
summary(fit_lm)

The model suggests there is a relationship between mpg and cyl number
A 4 cyl car is represented as the intercept. Consequently, the intercept represents the average mpg we would expect from a 4 cylinder car.
The other two coefficients (cyl6 and cyl8) represent the difference in mpg for each of those cylinder cars relative to a 4 cylinder car (the model intercept). So, a 6 cylinder can, on average, will get 7 less mpg than a 4 cylinder car while an 8 cylinder car will, on average, get about 12 less mpg’s than a 4 cylinder car.

Bayesian regression with rstanarm — No priors specified

First, let’s fit the model with no priors specified (using the functions default priors) to see what sort of output we get.

## setting no prior info
stan_glm(mpg ~ cyl, data = d) %>%
  summary(digits = 3)

The output is a little different than the OLS model. First we see that there are no p-values (in the spirit of Bayes analysis!).
We do find that the model coefficients are basically the same as those produce with the OLS model and even the standard deviation is similar to the standard errors from above.
Instead of p-values for each coefficient we get 80% credible intervals.
The sigma value at the bottom corresponds to the residual standard error we got in our OLS model.

Basically, the default priors “let the data speak” and reported back the underlying relationship in the empirical data.

Setting Some Priors

Next, we can set some minimally informative priors. These priors wont contain much information and, therefore, will be highly influenced by minimal amounts of evidence regarding the underlying relationship that is present in the data.

To set priors on independent variables in rstanarm we need to create an element to store them. We have two independent variables (cyl6 and cyl8), both requiring priors (we will set the prior for the intercept and the model sigma in the function directly). To set these priors we need to determine a distribution, a mean value (location), and a standard deviation (scale). We add these values into the distribution function in the order in which they will appear in the model. So, there will be a vector of location that is specific to cyl6 and cyl8 and then a vector of scale that is also specific to cyl6 and cyl8, in that order.

## Setting priors
ind_var_priors <- normal(location = c(0, 0), scale = c(10, 10))

Next, we run the model.

fit_rstan <- stan_glm(mpg ~ cyl, 
                      prior = ind_var_priors,
                      prior_intercept = normal(15, 8),
                      prior_aux = cauchy(0, 3),
                      data = d)

# fit_rstan
summary(fit_rstan, digits = 3)

Again, this model is not so different from the one that used the default priors (or from the findings of the OLS model). But, our priors were uninformative.

One note I’ll make, before proceeding on, is that you can do this a different way and simply dummy code the categorical variables and enter those dummies directly into the model, setting priors on each, and you will obtain the same result. The below code dummy codes cyl6 and cyl8 as booleans (1 = yes, 0 = no) so when both are 0 we effectively are left with cyl4 (the model intercept).

############################################################################################
#### Alternate approach to coding the priors -- dummy coding the categorical variables #####
############################################################################################

d2 <- d %>%
  mutate(cyl6 = ifelse(cyl == "6", 1, 0),
         cyl8 = ifelse(cyl == "8", 1, 0))

summary(lm(mpg ~ cyl6 + cyl8, data = d2))

stan_glm(mpg ~ cyl, 
         prior = ind_var_priors,
         prior_intercept = normal(15, 8),
         prior_aux = cauchy(0, 3),
         data = d2) %>%
  summary()

###########################################################################################
###########################################################################################

Okay, back to our regularly scheduled programming…..

So what’s the big deal?? The model coefficients are relatively the same as with OLS. Why go through the trouble? Two reasons:

Producing the posterior distribution of model coefficients posterior predictive distribution for the dependent variable allows us to evaluate our uncertainty around each. I’ve talked a bit about this before (Making Predictions with a Bayesian Regression Model, Confidence & Prediction Intervals – Compare and Contrast Frequentist and Bayesian Approaches, and Approximating a Bayesian Posterior with OLS).
If we have more information on relationship between mpg and cylinders we can code that in as information the model can use!

Let’s table point 2 for a second and extract out some posterior samples from our Bayesian regression and visualize the uncertainty in the coefficients.

# posterior samples
post_rstan <- as.matrix(fit_rstan) %>%
  as.data.frame() %>%
  rename('cyl4' = '(Intercept)')

post_rstan %>%
  head()

mu.cyl4 <- post_rstan$cyl4
mu.cyl6 <- post_rstan$cyl4 + post_rstan$cyl6
mu.cyl8 <- post_rstan$cyl4 + post_rstan$cyl8

rstan_results <- data.frame(mu.cyl4, mu.cyl6, mu.cyl8) %>%
  pivot_longer(cols = everything())


rstan_plt <- rstan_results %>%
  left_join(
    
    d %>%
      group_by(cyl) %>%
      summarize(avg = mean(mpg)) %>%
      rename(name = cyl) %>%
      mutate(name = case_when(name == "4" ~ "mu.cyl4",
                              name == "6" ~ "mu.cyl6",
                              name == "8" ~ "mu.cyl8"))
    
  ) %>%
  ggplot(aes(x = value, fill = name)) +
  geom_histogram(alpha = 0.4) +
  geom_vline(aes(xintercept = avg),
             color = "black",
             size = 1.2,
             linetype = "dashed") +
  facet_wrap(~name, scales = "free_x") +
  theme_light() +
  theme(strip.background = element_rect(fill = "black"),
        strip.text = element_text(color = "white", face = "bold")) +
  ggtitle("rstanarm")

rstan_plt

The above plot represents the posterior distribution (the prior combined with the observed data, the likelihood) of the estimated values for each of our cylinder types.
The dashed line is the observed mean mpg for each cylinder type in the data.
The distribution helps give us a good sense of the certainty (or uncertainty) we have in our estimates.

We can summarize this uncertainty with point estimates (e.g., mean and median) and measures of spread (e.g., standard deviation, credible intervals, quantile intervals).

# summarize posteriors
mean(mu.cyl4)
sd(mu.cyl4)
qnorm(p = c(0.05, 0.95), mean = mean(mu.cyl4), sd = sd(mu.cyl4))
quantile(mu.cyl4, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))

mean(mu.cyl6)
sd(mu.cyl6)
qnorm(p = c(0.05, 0.95), mean = mean(mu.cyl6), sd = sd(mu.cyl6))
quantile(mu.cyl6, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))

mean(mu.cyl8)
sd(mu.cyl8)
qnorm(p = c(0.05, 0.95), mean = mean(mu.cyl8), sd = sd(mu.cyl8))
quantile(mu.cyl8, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))

For example, the below information tells us that cyl8 cars will, on average, provide us with ~15.2 mpg with a credible interval between 13.7 and 16.2. The median value is 15.2 with an interquartile range between 14.6 and 15.8 and a 90% quantile interval ranging between 13.7 and 16.6.

Bullish Priors

As stated earlier, priors are one of the most controversial aspects of Bayesian analysis. Most argue against Bayes because they feel that priors can be be manipulated to fit the data. However, what many fail to recognize is that no analysis is void of human decision-making. All analysis is done by humans and thus there are a number of subjective decisions that need to be made along the way, such as deciding on what to do with outliers, how to handle missing data, the alpha level or level of confidence that you want to test your data against, etc. As I’ve said before, science isn’t often as objective as we’d like it to be. That all said, selecting priors can be done in a variety of ways (aside from just using non-informative priors as we did above). You could get expert opinion, you could use data and observations gained from a pilot study, or you can use information about parameters from previously conducted studies (though be cautious as these also might be bias due to publication issues such as the file drawer phenomenon, p-hacking, and researcher degrees of freedom).

When in doubt, it is probably best to be conservative with your priors. But, let’s say we sit down with a mechanic and inform him of a study where we are attempting to estimate the miles per gallon for 4, 6, and 8 cylinder cars. We ask him if he can help us with any prior knowledge about the decline in mpg when the number of cylinders increase. The mechanic is very bullish with his prior information and states,

“Of course I know the relationship between cylinders and miles per gallon!! Those 4 cylinder cars tend to be very economical and get around 50 mpg plus or minus 2. I haven’t seen too many 6 cylinder cars, but my hunch is that there are pretty similar to the 4 cylinder cars. Now 8 cylinder cars…I do a ton of work on those! Those cars get a bad wrap. In my experience they actually get better gas mileage than the 4 or 6 cylinder cars. My guess would be that they can get nearly 20 miles per gallon more than a 4 cylinder car!”

Clearly our mechanic has been sniffing too many fumes in the garage! But, let’s roll with his beliefs and codify them as prior knowledge for our model and see how such bullish priors influence the model’s behavior.

We set the intercept to be normally distributed with a mean of 50 and a standard deviation of 2.
Because the mechanic felt like the 6 cylinder car was similar to the 4 cylinder car we will stick suggest that the difference between 6 cylinders and 4 cylinders is normally distributed with a mean of 0 and standard deviation of 2.
Finally, we use the crazy mechanics belief that the 8 cylinder car gets roughly 20 more miles per gallon than the 4 cylinder car and we code its prior to be normally distributed with a mean of 20 and standard deviation of 5.

Fit the model…

## Use wildly different priors ---------------------------------------------------------
ind_var_priors2 <- normal(location = c(0, 20), scale = c(10, 5))

fit_rstan2 <- stan_glm(mpg ~ cyl, 
                       prior = ind_var_priors2,
                       prior_intercept = normal(50, 2),
                       prior_aux = cauchy(0, 10),
                       data = d)

summary(fit_rstan2, digits = 3)

Wow! Look how much the overly bullish/informative priors changed the model output.

Our new belief is that a 4 cylinder car gets approximately 39 mpg and the 6 cylinder car gets about 3 more mpg than that, on average.
The 8 cylinder car is now getting roughly 14 mpg more than the 4 cylinder car.

The bullish priors have overwhelmed the observed data. Notice that the results are not exact to the prior but the prior, as they are tugged a little bit closer to the observed data, though not by much. For example, we specified the 8 cylinder car to have about 20 mpg over a 4 cylinder car. The observed data doesn’t indicate this to be true (8 cylinder cars were on average 11 mpg LESS THAN a 4 cylinder car) so the coefficient is getting pulled down slightly, from our prior of 20 to 14.4.

Let’s plot the posterior distribution.

# posterior samples
post_rstan <- as.matrix(fit_rstan2) %>%
  as.data.frame() %>%
  rename('cyl4' = '(Intercept)')

post_rstan %>%
  head()

mu.cyl4 <- post_rstan$cyl4
mu.cyl6 <- post_rstan$cyl4 + post_rstan$cyl6
mu.cyl8 <- post_rstan$cyl4 + post_rstan$cyl8

rstan_results <- data.frame(mu.cyl4, mu.cyl6, mu.cyl8) %>%
  pivot_longer(cols = everything())


rstan_plt2 <- rstan_results %>%
  left_join(
    
    d %>%
      group_by(cyl) %>%
      summarize(avg = mean(mpg)) %>%
      rename(name = cyl) %>%
      mutate(name = case_when(name == "4" ~ "mu.cyl4",
                              name == "6" ~ "mu.cyl6",
                              name == "8" ~ "mu.cyl8"))
    
  ) %>%
  ggplot(aes(x = value, fill = name)) +
  geom_histogram(alpha = 0.4) +
  geom_vline(aes(xintercept = avg),
             color = "black",
             size = 1.2,
             linetype = "dashed") +
  facet_wrap(~name, scales = "free_x") +
  theme_light() +
  theme(strip.background = element_rect(fill = "black"),
        strip.text = element_text(color = "white", face = "bold")) +
  ggtitle("rstanarm 2")

rstan_plt2

Notice how different these posteriors are than the first Bayesian model. In every case, the predicted mpg from the number of cylinders are all over estimating the observed mpg by cylinder (dashed line).

Wrapping Up

Today we went through how to set priors on categorical variables using rstanarm. Additionally, we talked about some of the skepticism about priors and showed what can happen when the priors you select are too overconfident. The morale of the story is two-fold:

All statistics (Bayes, Frequentist, Machine Learning, etc) have some component of subjectivity as the human doing the analysis has to make decisions about what to do with their data at various points.
Don’t be overconfident with your priors. Minimally informative priors maybe be useful to allowing us to assert some level of knowledge of the outcome while letting that knowledge be influenced/updated by what we’ve just observed.

The full code is accessible on my GITHUB page.

If you notice any errors, please reach out!

TidyX Episode 134: Conditional Formatting DT tables

Last week, a viewer sent us a request via the YouTube channel asking if we could go over various ways of conditioning formatting DT tables.

We use DT a lot, especially in shiny apps, flexdahboards, and Rmarkdown reports. DT is my preferred table package when I need to make interactive tables that allow the user to sort and filter data. Conditional formatting is pretty easy in DT and this week, Ellis and I talk through using the basic functions, styleEqual() and styleInterval() to create conditionally formatted columns. We also go through more advanced conditional formatting features such as gradient conditional formatting and conditionally formatting entire rows based on the conditions of a single column.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

Removing columns with NA for fluid table building in shiny

One of the most frustrating aspects of building {shiny} apps is dealing with columns that have NAs when outputting tables. This is common in sport when dealing with players from different position groups who may have different stats that describe performance for those positions. Rather than writing a long series of if/else statements, I prefer to streamline the process by dropping those columns prior to returning the table of data. Not only does this make the app run smoothly but it also is easier to debug or add additional table information without having to deal with a lot of nested if/else statements.

The full code is accessible on my GITHUB page.

Load Packages & Simulate Data

## Removing columns with NA for fluid table building in shiny

## packages ---------------------------------------------------
library(tidyverse)
library(shiny)

## simulate data ----------------------------------------------
d <- tribble(
  ~player, ~position, ~stat1, ~stat2, ~stat3,
  'Frank', 'Pitcher', 10, NA, 33,
  'Tom', 'Batter', NA, 14, 12,
  'Jeff', 'Batter', NA, 5, NA,
  'Harold', 'Pitcher/Batter', 12, 33, 9
)

d

We can see from our little data set that different players have different stats populated. We really don’t want our users to deal with having to see NA in the table output. So, we need to devise a way to drop the columns with NA’s once a specific player has been selected.

Dropping Columns with NA in Base R

Let’s select on player and attempt to drop their columns with NA. In base R we will use the colSums() function to produce a count of the number of NA’s in each column.

## remove columns with NA for Frank, using Base R --------------------------

frank <- d %>% 
  filter(player == 'Frank')

# colSums() can be used to count the NA's in each column
colSums(is.na(frank))

We can see that stat2 has 1 NA while the other 4 columns are complete. We can use this information to retain those four columns and drop stat2.

frank[ , colSums(is.na(frank)) == 0]

Dropping Columns with NA in {dplyr}

We can perform a similar task using the select_if() function within the {dplyr} package and indicating that we want to select all columns without an NA.

frank %>%
  select_if(~!is.na(.x))

Build a shiny app that fluidly retains the columns without NA

Now that we have a few strategies for removing columns with NA, we can build a {shiny} app that allows the user to select a player and then the server fluidly will drop the columns with NA so that we don’t need to use a messy if/else chain.

Notice that prior to dropping columns with NA I set the names of all of the columns in the table so that they look nicer when the table gets rendered. We can see from the figures that no matter which player is selected, the server intelligently drops columns with missing data, allowing the user to see only the statistics that are meaningful for the individual.

# UI
ui <- fluidPage(
  
  sidebarPanel(
    
    selectInput(inputId = 'player',
                label = "Select Player",
                choices = sort(unique(d$player)),
                selected = FALSE,
                multiple = FALSE)
    
  ),
  
  mainPanel(
    
    tableOutput(outputId = 'tbl')
  )
)

# Server
server <- function(input, output){
  
  # get selected player
  dat_tbl <- reactive({ d %>%
      filter(player == input$player)
    
  })
  
  # build table
  output$tbl <- renderTable({ dat_tbl() %>%
      setNames(c("Player", "Position", "Stat 1", "Stat 2", "Stat 3")) %>%
      select_if(~!is.na(.x))
    
  })
  
}


# deploy
shinyApp(ui, server)

Wrapping Up

Instead of having users see columns with NA, make your renderTable() function fluid and automatically drop columns with missing values to improve the user experience.

The full code is accessible on my GITHUB page.

TidyX Episode 133: Intro to Flexdashboard

Somehow we made it through 132 episodes WITHOUT talking about {Flexdashboard}!! This week, Ellis Hughes and I remedy this and go through creating a basic Flexdashboard for training load monitoring.

In the sports science and sports medicine fields building dashboards for reporting training outcomes is incredibly popular. Most are doing so in excel, which can get messy given the amount of daily work one needs to do to keep it up to date (copying, pasting, pivot tabling, etc.). Let us help you simply your life and teach you how to build an automated report in {Flexdashboard}.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

Bayesian Linear Regression: Getting started with PyMC3

Previously I’ve used {rstanarm}, {brms}, and Stan for fitting Bayesian models. However, as I continue to work on improving my Python skills, I figured I’d try and delve into the PyMC3 framework for fitting such models. This article will go through the following steps:

Fitting the model
Making a point estimate prediction
Making a point estimate prediction with uncertainty
Calculating a posterior predictive distribution

I’ve covered the last three steps in a prior blog on making predictions with a Bayesian model. I know there are probably functions available in PyMC3 that can do these things automatically (just as there are in {rstanarm}) but instead of falling back on those, I create the posterior distributions here using numpy and build them myself.

The entire code and data are available on my GITHUB page, where I also have the model coded in {rstanarm}, for anyone interested in seeing the steps in a different code language.

Loading Libraries & Data

The data I’ll be using is the {mtcars} data set, which is available in R. I’ve saved a copy in .csv format so that I can load it into my Jupyter notebook.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
import pymc3 as pm
import arviz as az 
import os

# load mtcars
d = pd.read_csv('mtcars.csv')
d.head()

Exploratory Data Analysis

The model will regress mpg on engine weight (wt). Let’s plot and describe those two variables so that we have a sense for what we might be working with.

Linear Regression

Before fitting the Bayesian model, I want to fit a simple regression model to see what the coefficients look like.

import statsmodels.api as sm

x = d['wt']
y = d['mpg']

x = sm.add_constant(x)

fit = sm.OLS(y, x).fit()

fit.summary()

We can see that for every one unit increase in engine weight the miles per gallon decrease, on average, by about 5.3.

Bayesian Regression (PyMC3)

Fitting a Bayesian regression model in PyMC3 requires us to specify some priors. For this model I’ll use a prior intercept of 40 ± 10 and a prior beta for the wt variable of 0 ± 10. The thing to note here is that the priors I’m specifying priors were created by first looking at the data that has been collected (which is technically cheating). Normally we would have priors BEFORE collecting our data (using prior published research, data from a pilot study, prior intuition, etc) and then combine the prior with the observations to obtain a posterior distribution. However, the aim here is to understand how to code the model, so I’ll use these priors. I’ll write a longer blog on priors and what they do to a model in the coming weeks.

Some notes on fitting the model in PyMC3:

The model is named ‘fit_b‘
We specify the intercept as variable ‘a‘
The beta coefficient for wt is called ‘b‘
Both the intercept and slope re fit with normally distributed priors
Finally, ‘e‘ represents the model error and it is fit with a a Half Cauchy prior
Once the priors are set, the model is specified (y_pred) as mu = a + b * wt + e
The trace_b object stores our posterior samples, 2000 of them of which the first 1000 will be discarded because they are there to allow the model to tune itself

, sd = e, observed = d['mpg'])
    
    trace_b = pm.sample(2000, tune = 1000)

Once the model has been fit we plot the trace plots to see how well it performed.

We can also directly call the mean and standard deviation values of our fitted model, which are relatively similar to what we saw with the linear regression model above.

Point Predictions

Next, we want to make a single point prediction for the mpg would expect, on average, when wt is a specific value (in this example we will use wt = 3.3).

To do this, we simply store the average value of the posterior coefficients from our Bayesian regression and apply the specified model:

mu = a + b * new_wt

A car with an engine weight of 3.3 would get, on average, 19.7 mpg.

Point Prediction with Uncertainty

The point estimate is interesting (I guess), but there is uncertainty around that estimate as point predictions are never exact. We can compliment this point estimate by unveiling the uncertainty around it. The point prediction ± uncertainty interval informs us of the average value of mpg along with the uncertainty of the coefficients in our model.

To do this, we create a random sample of 1000 values from the posterior distributions for our model intercept and beta coefficient. Each of these 1000 values represent a potential intercept and slope that would be consistent with our data, which shows us the uncertainty that we have in our estimates. When we use the model equation, multiplying each of these 1000 values by the new_wt value we obtain 1000 possible predicted values of mpg given a weight of 3.3.

With this posterior distribution we can then plot a histogram of the results and obtain summary statistics such as the mean, standard deviation, and 90% credible interval.

Posterior Predictive Distribution

Finally, instead of just knowing the average predicted value of mpg ± uncertainty for the population, we might be interested in knowing what the predicted value of mpg would be for a new car in the population with a wt of 3.3. For that, we calculate the posterior predictive distribution. The uncertainty in this predictive distribution will be larger than the point prediction with uncertainty because we are using the posterior model error added to our prediction.

First, similar to the model coefficients, we have to get the random draws of our error term, which we will call sigma.

Next, we run the model as we did in step 2 above; however, we also add to each of the 1000 posterior predictions the sigma value by taking a random draw from a normal distribution with a mean of 0 and standard deviation equal to the sigma sample values.

pred_dist = intercept_sample + beta_sample * new_wt_rep + np.random.normal(loc = 0, scale = sigma_sample, size = 1000)

Finally, we plot the distribution of our predicted values along with the mean, standard deviation, and 90% credible interval. Notice that these values are larger than what we obtained in step 2 because we are now taking into account additional uncertainty about the new wt observation.

Wrapping Up

That’s a brief intro to Bayesian regression with PyMC3. There are a lot more things that we can do with PyMC3 and it’s available functions. My goal is to put together more blog articles on Bayesian modeling with both R and Python so show their flexibility. If you spot any errors, please let me know.

The data and full code (along with a companion code in {rstanarm}) is available on my GITHUB page.

Patrick Ward, PhD

Patrick Ward, PhD | Sports Science & Analytics

Bayesian Priors for Categorical Variables using rstanarm

TidyX Episode 134: Conditional Formatting DT tables

Removing columns with NA for fluid table building in shiny

TidyX Episode 133: Intro to Flexdashboard

Bayesian Linear Regression: Getting started with PyMC3