Category Archives: Bayesian Model Building

Bayesian Modeling with STAN Part 3 – Handling Categorical Predictors

We’ve been discussing how to code STAN models and Part 1 and Part 2 served as a basic intro for simple linear regression, multiple linear regression, and model checking. Today, we cover how to code categorical variables into your models. STAN doesn’t handle categorical variables so, it’s on the user to properly index these variables as integers in the model. This can take a bit of practice and can get confusing. In this blog article I offer a few different approaches to accomplish the same task.

Click here to read the RMarkdown Blog Article >> STAN-Part-3—Handling-Categorical-Predictors

Click here to access the code >> STAN Part 3 – Handling Categorical Predictors

Bayesian Modeling with STAN Part 2 – Multiple Linear Regression & Model Checking

In Part 1 of this series I went over some basics of STAN coding and we built our first linear regression. In Part 2, we will extend the simple linear model to a multiple linear regression and discuss some model checking methods for convergence and diagnosis of the Markov Chain.

Click here to read the RMarkdown Blog Article >> STAN Part-2 – Multiple Linear Regression & Model Checking

Click here to access the code >> STAN Part 2 – Multiple Linear Regression & Model Checking

What’s the Plan Stan? Bayesian Modeling with STAN Part 1

For those interested in Bayesian models and probabilistic coding, I’m starting a new series on coding in STAN. We’ve done a lot of Bayesian coding in this blog, often with {rstanarm}, {bmrs}, and coding empirical Bayesian models by hand. We’ve also done a little bit with {PyMC3}, from the Python coding language.

STAN offers us a lot more flexibility to specify whatever model we want and full autonomy of the priors we can place on the various parameters. It is a little challenging given it is coming out of the C++ coding language. So, we will start slow and build up. This first post will simply be about how to code a STAN model and extract information from it. We will progress to building more complex models, modeling different outcome distributions, building hierarchical models, evaluating model diagnostics, and making predictions from the model.

I’m going to work with the {rstan} and {cmdstanr} packages to allow us to use R as our coding language and then call the C++ compiler when we need to fit the model. Along the way I’ll also provide some Jupyter Notebooks and show how to code stand models in Python using {cmdstanpy}, a Python analog for {cmdstanr} that allows us to run our same STAN models in Python.

In order to not repeat everything here in the blog, I’ll simply update the blog with links to my github repo for each new section so that all of the code is accessible and housed in one place.

Click here to read the RMarkdown Blog Article >> STAN Part 1 – Intro to STAN Code

Click here to access the code >> STAN Part 1 – Intro to STAN Code

 

Learning Bayesian Statistics Podcast

I recently had the please of speaking with Alex Andorra on his amazing Learning Bayesian Statistics Podcast. This was an amazing experience given the quality of people he has interviewed in the past and the fact that this is one of my favorite podcasts. We talked a lot about Bayesian statistics in sport. Hopefully others find it interesting.

Listen here << Learning Bayesian Statistics Podcast >>

Estimating a Standard Deviation from a Small Sample

I really enjoy Allen Downey’s work and if you haven’t checked out his books Think Bayes, Probably Overthinking It, and Modeling and Simulation in Python, I highly suggest you do so. I can’t recommend his work enough.

Recently, he’s been doing this cool thing where he grabs a question from the Reddit Statistics forum and constructs a short blog article on how he might go about solving the problem at hand. You can find all the articles he’s done to date HERE. His most recent article was about estimating the standard deviation form a small sample (n = 6). To solve the individual’s problem, Allen uses grid approximation, which is a simple and convenient approach to tackling problems using a Bayesian framework. I worked through his Python code but, admittedly, I’m functional in Python but I’m not the best Pythonista. So, that got me thinking about whether we could solve this in a different way.

I’ve talked about different approaches, depending on what prior information you have available, to solving a normal-normal conjugate problem in the past. Those approaches were all directed at trying to create an estimate for the mean, while we kept the standard deviation value fixed (known). However, you can solve these equations the other way, holding the mean fixed (known) and attempt to estimate the standard deviation parameter. Like the grid approximation approach Allen used, these approaches are easy to apply (if you are willing to make some assumptions), without having to code your own Markov Chain (like we did when we wrote our own GIBBS Sampler to estimate the mean and standard deviation or to perform linear regression).

So, let’s try and tackle this problem in a different way than Allen used. The code will be available in my GitHub.

The Problem

The question that was posed indicated that the individual had 6 blood samples of potassium level from one person. What they wanted to do was estimate the range of probable values if they were to obtain say 100 or 1000 samples from the same person (without having to draw that much blood).

The Data

The observations looked like this:

## Six observations
d <- c(4.0, 3.9, 4.0, 4.0, 4.7, 4.2)

# place observation info into their own objects
obs_n <- length(d)
obs_df <- obs_n - 1
obs_mu <- mean(d)
obs_sd <- sd(d)

obs_mu
obs_sd

Prior Mean

First, we will get a prior for the mean of potassium levels in the population. Allen suggested that this is usually reported with 5th to 95th percentile values ranging from 3.5 to 5.4. So, let’s use those values to determine the prior mean and we will put a prior standard error around that value of 0.6.

low95 <- 3.5
high95 <- 5.4

prior_mu <- (low95 + high95) / 2
prior_se <- 0.6

# plot prior mu
plot(x = seq(from = 2, to = 7, length.out = 50),
     y = dnorm(x = seq(from = 2, to = 7, length.out = 50), mean = prior_mu, sd = prior_se),
     type = "l",
     main = "Prior Mu Density",
     ylab = "PDF",
     xlab = "Blood Potassium")

Prior Standard Deviation

There are a number of priors that one can use for a standard deviation — a value that cannot go negative (so bounded at 0) — such as gamma, exponential, inverse chi-square, etc. To have a good estimate of what the prior standard deviation is we’d probably need to know something about the blood testing measurement and how noisy it is as well as how much biological variability we’d expect from test to test, which Allen alluded to in his post. Allen chose to use a Gamma distribution with a shape parameter of 2 and a scale parameter of 0.5. I’ll use a different approach, but let’s look at his prior for the standard deviation with an inverse chi-square distribution.

# Downey's example uses a gamma distribution so let's plot that
alpha <- 2
beta <- 0.5

plot(x = seq(from = 0.01, to = 5, length.out = 100),
     y = dgamma(x = seq(from = 0.01, to = 5, length.out = 100), shape = alpha, scale = beta),
     type = "l",
     main = "Prior Sigma Density (Gamma Prior)",
     ylab = "PDF",
     xlab = "Sigma for Blood Potassium")

Calculate the Posterior SD

Now we get to the problem the individual was asking about — estimating the standard deviation from a small number of values.As stated above, Allen used grid approximation, so be sure to check out his article on how he did this. Instead, I’ll use an approach discussed in Chapter 15 of William Bolstad’s Introduction to Bayesian Statistics, 2nd Ed. We will use an inverse chi-square distribution and multiple it by the sum of squared error value, which represents the sum of squared difference of each observed value to our prior_mu, and take the square root of this product to go from a variance to a standard deviation.

# calculate the sum of squared error for each observation relative to the prior_mu
sse <- sum((d - prior_mu)^2)
sse

set.seed(333)
posterior_sigma <- mean(sqrt(sse *  1/rchisq(n = 1000, df = obs_n)))
posterior_sigma

The posterior estimate of the standard deviation is 0.485, which is slightly larger than what we observed in the data (obs_sd = 0.29) and a little higher than Allen got with the gamma distribution and grid approximation approach (0.404).

Posterior Mean

Because Allen used grid approximation, he had a probability mass for each value of mu and sigma, which allowed him to then create simulations to answer the question about what the standard deviation would be estimated to be with 100 or 1000 blood draws. So, I’ll calculate the posterior mean using the observed data and the prior_mu and prior_se, which we set above.

### Calculate a posterior mean
posterior_mu <- ((1 / prior_se^2 * prior_mu) + (1 / obs_sd^2 * obs_mu)) / ((1/obs_sd^2) + (1/prior_se^2))
posterior_se <- sqrt(1 / (1/obs_sd^2 + 1/prior_se^2))

posterior_mu
posterior_se

Simulating 100 Samples, Ten Times

Allen made a nice plot  from his grid approximation which showed about 10 different posterior simulations from the 1000 samples. I’ll try and recreate something similar here.

100 Samples

First, I’ll generate 100 samples using the posterior_mu and posterior_se, for the mean, and the sse and inverse chi-square distribution, for the standard deviation.

## What would we expect the standard deviation to be if we had 100 samples?
set.seed(456)
posterior_mu100 <- rnorm(n = 100, mean = posterior_mu, sd = posterior_se)
posterior_sd100 <- sqrt(sse *  1/rchisq(n = 100, df = obs_n))

df100 <- data.frame(mu = posterior_mu100, sigma = posterior_sd100)

head(df100)

Using this data frame of 100 possible posterior mu and sigma values, we will sample 100 pairs and using each pair run 100 simulations. We do this 10 different times to get ten, 100 simulated posterior mean values.

sim_storage100 <- matrix(data = NA, nrow = 100, ncol = 10)

for(i in 1:10){
  
  row_id <- sample(1:nrow(df100), size = 1, replace = TRUE)
  sim_storage100[, i] <- rnorm(n = 100, mean = df100[row_id, "mu"], sd = df100[row_id, "sigma"])
}

head(sim_storage100)

Now let’s plot the 10 lines!

Finally, we summarize the simulations in different ways

 

  1.  Get the mean and the standard deviation of each column of 100 simulations
  2. Calculate the Median ± 90% CI for each of the 10 simulations
  3. Calculate the Mean of each of the simulations and then calculate Median 90% Credible Interval across those values
  4. Finally, we can calculate the mean of the standard deviation for each of the simulations

This value is still larger than what we observed with our 6 observations (0.29) but smaller than what we observed from posterior sigma (0.489).

Wrapping Up

I thought this was a fun problem to try and solve because dealing with small numbers of observations in sport physiology is something that we see often. Knowing about the error measurement of your test may help you estimate, or get closer to, a true value for an individual based on only a few data points. Additionally, under the Bayesian framework, every new observation you make for that individual allows you to update. your prior and move closer to something that is more relevant to their biological nuances. I tried to take a different approach than what Allen did. Estimating the posterior mu and posterior sigma are useful steps that I’ve worked through before in this blog (and I highly recommend Bolstad’s book) but simulating the 100 samples, ten times, to attempt to match what Allen got with his grid approximation was new to me. I’ve never done it that way, so if anyone recognizes any errors or issues please email me. Finally, be sure to check out all of the great resources Allen Downey has been putting out.

If you’d like the code, it is available on my GitHub.