Estimating a Standard Deviation from a Small Sample

I really enjoy Allen Downey’s work and if you haven’t checked out his books Think Bayes, Probably Overthinking It, and Modeling and Simulation in Python, I highly suggest you do so. I can’t recommend his work enough.

Recently, he’s been doing this cool thing where he grabs a question from the Reddit Statistics forum and constructs a short blog article on how he might go about solving the problem at hand. You can find all the articles he’s done to date HERE. His most recent article was about estimating the standard deviation form a small sample (n = 6). To solve the individual’s problem, Allen uses grid approximation, which is a simple and convenient approach to tackling problems using a Bayesian framework. I worked through his Python code but, admittedly, I’m functional in Python but I’m not the best Pythonista. So, that got me thinking about whether we could solve this in a different way.

I’ve talked about different approaches, depending on what prior information you have available, to solving a normal-normal conjugate problem in the past. Those approaches were all directed at trying to create an estimate for the mean, while we kept the standard deviation value fixed (known). However, you can solve these equations the other way, holding the mean fixed (known) and attempt to estimate the standard deviation parameter. Like the grid approximation approach Allen used, these approaches are easy to apply (if you are willing to make some assumptions), without having to code your own Markov Chain (like we did when we wrote our own GIBBS Sampler to estimate the mean and standard deviation or to perform linear regression).

So, let’s try and tackle this problem in a different way than Allen used. The code will be available in my GitHub.

The Problem

The question that was posed indicated that the individual had 6 blood samples of potassium level from one person. What they wanted to do was estimate the range of probable values if they were to obtain say 100 or 1000 samples from the same person (without having to draw that much blood).

The Data

The observations looked like this:

## Six observations
d <- c(4.0, 3.9, 4.0, 4.0, 4.7, 4.2)

# place observation info into their own objects
obs_n <- length(d)
obs_df <- obs_n - 1
obs_mu <- mean(d)
obs_sd <- sd(d)

obs_mu
obs_sd

Prior Mean

First, we will get a prior for the mean of potassium levels in the population. Allen suggested that this is usually reported with 5th to 95th percentile values ranging from 3.5 to 5.4. So, let’s use those values to determine the prior mean and we will put a prior standard error around that value of 0.6.

low95 <- 3.5
high95 <- 5.4

prior_mu <- (low95 + high95) / 2
prior_se <- 0.6

# plot prior mu
plot(x = seq(from = 2, to = 7, length.out = 50),
     y = dnorm(x = seq(from = 2, to = 7, length.out = 50), mean = prior_mu, sd = prior_se),
     type = "l",
     main = "Prior Mu Density",
     ylab = "PDF",
     xlab = "Blood Potassium")

Prior Standard Deviation

There are a number of priors that one can use for a standard deviation — a value that cannot go negative (so bounded at 0) — such as gamma, exponential, inverse chi-square, etc. To have a good estimate of what the prior standard deviation is we’d probably need to know something about the blood testing measurement and how noisy it is as well as how much biological variability we’d expect from test to test, which Allen alluded to in his post. Allen chose to use a Gamma distribution with a shape parameter of 2 and a scale parameter of 0.5. I’ll use a different approach, but let’s look at his prior for the standard deviation with an inverse chi-square distribution.

# Downey's example uses a gamma distribution so let's plot that
alpha <- 2
beta <- 0.5

plot(x = seq(from = 0.01, to = 5, length.out = 100),
     y = dgamma(x = seq(from = 0.01, to = 5, length.out = 100), shape = alpha, scale = beta),
     type = "l",
     main = "Prior Sigma Density (Gamma Prior)",
     ylab = "PDF",
     xlab = "Sigma for Blood Potassium")

Calculate the Posterior SD

Now we get to the problem the individual was asking about — estimating the standard deviation from a small number of values.As stated above, Allen used grid approximation, so be sure to check out his article on how he did this. Instead, I’ll use an approach discussed in Chapter 15 of William Bolstad’s Introduction to Bayesian Statistics, 2nd Ed. We will use an inverse chi-square distribution and multiple it by the sum of squared error value, which represents the sum of squared difference of each observed value to our prior_mu, and take the square root of this product to go from a variance to a standard deviation.

# calculate the sum of squared error for each observation relative to the prior_mu
sse <- sum((d - prior_mu)^2)
sse

set.seed(333)
posterior_sigma <- mean(sqrt(sse *  1/rchisq(n = 1000, df = obs_n)))
posterior_sigma

The posterior estimate of the standard deviation is 0.485, which is slightly larger than what we observed in the data (obs_sd = 0.29) and a little higher than Allen got with the gamma distribution and grid approximation approach (0.404).

Posterior Mean

Because Allen used grid approximation, he had a probability mass for each value of mu and sigma, which allowed him to then create simulations to answer the question about what the standard deviation would be estimated to be with 100 or 1000 blood draws. So, I’ll calculate the posterior mean using the observed data and the prior_mu and prior_se, which we set above.

### Calculate a posterior mean
posterior_mu <- ((1 / prior_se^2 * prior_mu) + (1 / obs_sd^2 * obs_mu)) / ((1/obs_sd^2) + (1/prior_se^2))
posterior_se <- sqrt(1 / (1/obs_sd^2 + 1/prior_se^2))

posterior_mu
posterior_se

Simulating 100 Samples, Ten Times

Allen made a nice plot from his grid approximation which showed about 10 different posterior simulations from the 1000 samples. I’ll try and recreate something similar here.

100 Samples

First, I’ll generate 100 samples using the posterior_mu and posterior_se, for the mean, and the sse and inverse chi-square distribution, for the standard deviation.

## What would we expect the standard deviation to be if we had 100 samples?
set.seed(456)
posterior_mu100 <- rnorm(n = 100, mean = posterior_mu, sd = posterior_se)
posterior_sd100 <- sqrt(sse *  1/rchisq(n = 100, df = obs_n))

df100 <- data.frame(mu = posterior_mu100, sigma = posterior_sd100)

head(df100)

Using this data frame of 100 possible posterior mu and sigma values, we will sample 100 pairs and using each pair run 100 simulations. We do this 10 different times to get ten, 100 simulated posterior mean values.

sim_storage100 <- matrix(data = NA, nrow = 100, ncol = 10)

for(i in 1:10){
  
  row_id <- sample(1:nrow(df100), size = 1, replace = TRUE)
  sim_storage100[, i] <- rnorm(n = 100, mean = df100[row_id, "mu"], sd = df100[row_id, "sigma"])
}

head(sim_storage100)

Now let’s plot the 10 lines!

Finally, we summarize the simulations in different ways

Get the mean and the standard deviation of each column of 100 simulations
Calculate the Median ± 90% CI for each of the 10 simulations
Calculate the Mean of each of the simulations and then calculate Median 90% Credible Interval across those values
Finally, we can calculate the mean of the standard deviation for each of the simulations

This value is still larger than what we observed with our 6 observations (0.29) but smaller than what we observed from posterior sigma (0.489).

Wrapping Up

I thought this was a fun problem to try and solve because dealing with small numbers of observations in sport physiology is something that we see often. Knowing about the error measurement of your test may help you estimate, or get closer to, a true value for an individual based on only a few data points. Additionally, under the Bayesian framework, every new observation you make for that individual allows you to update. your prior and move closer to something that is more relevant to their biological nuances. I tried to take a different approach than what Allen did. Estimating the posterior mu and posterior sigma are useful steps that I’ve worked through before in this blog (and I highly recommend Bolstad’s book) but simulating the 100 samples, ten times, to attempt to match what Allen got with his grid approximation was new to me. I’ve never done it that way, so if anyone recognizes any errors or issues please email me. Finally, be sure to check out all of the great resources Allen Downey has been putting out.

If you’d like the code, it is available on my GitHub.