Estimating a Standard Deviation from a Small Sample

I really enjoy Allen Downey’s work and if you haven’t checked out his books Think Bayes, Probably Overthinking It, and Modeling and Simulation in Python, I highly suggest you do so. I can’t recommend his work enough.

Recently, he’s been doing this cool thing where he grabs a question from the Reddit Statistics forum and constructs a short blog article on how he might go about solving the problem at hand. You can find all the articles he’s done to date HERE. His most recent article was about estimating the standard deviation form a small sample (n = 6). To solve the individual’s problem, Allen uses grid approximation, which is a simple and convenient approach to tackling problems using a Bayesian framework. I worked through his Python code but, admittedly, I’m functional in Python but I’m not the best Pythonista. So, that got me thinking about whether we could solve this in a different way.

I’ve talked about different approaches, depending on what prior information you have available, to solving a normal-normal conjugate problem in the past. Those approaches were all directed at trying to create an estimate for the mean, while we kept the standard deviation value fixed (known). However, you can solve these equations the other way, holding the mean fixed (known) and attempt to estimate the standard deviation parameter. Like the grid approximation approach Allen used, these approaches are easy to apply (if you are willing to make some assumptions), without having to code your own Markov Chain (like we did when we wrote our own GIBBS Sampler to estimate the mean and standard deviation or to perform linear regression).

So, let’s try and tackle this problem in a different way than Allen used. The code will be available in my GitHub.

The Problem

The question that was posed indicated that the individual had 6 blood samples of potassium level from one person. What they wanted to do was estimate the range of probable values if they were to obtain say 100 or 1000 samples from the same person (without having to draw that much blood).

The Data

The observations looked like this:

## Six observations
d <- c(4.0, 3.9, 4.0, 4.0, 4.7, 4.2)

# place observation info into their own objects
obs_n <- length(d)
obs_df <- obs_n - 1
obs_mu <- mean(d)
obs_sd <- sd(d)

obs_mu
obs_sd

Prior Mean

First, we will get a prior for the mean of potassium levels in the population. Allen suggested that this is usually reported with 5th to 95th percentile values ranging from 3.5 to 5.4. So, let’s use those values to determine the prior mean and we will put a prior standard error around that value of 0.6.

low95 <- 3.5
high95 <- 5.4

prior_mu <- (low95 + high95) / 2
prior_se <- 0.6

# plot prior mu
plot(x = seq(from = 2, to = 7, length.out = 50),
     y = dnorm(x = seq(from = 2, to = 7, length.out = 50), mean = prior_mu, sd = prior_se),
     type = "l",
     main = "Prior Mu Density",
     ylab = "PDF",
     xlab = "Blood Potassium")

Prior Standard Deviation

There are a number of priors that one can use for a standard deviation — a value that cannot go negative (so bounded at 0) — such as gamma, exponential, inverse chi-square, etc. To have a good estimate of what the prior standard deviation is we’d probably need to know something about the blood testing measurement and how noisy it is as well as how much biological variability we’d expect from test to test, which Allen alluded to in his post. Allen chose to use a Gamma distribution with a shape parameter of 2 and a scale parameter of 0.5. I’ll use a different approach, but let’s look at his prior for the standard deviation with an inverse chi-square distribution.

# Downey's example uses a gamma distribution so let's plot that
alpha <- 2
beta <- 0.5

plot(x = seq(from = 0.01, to = 5, length.out = 100),
     y = dgamma(x = seq(from = 0.01, to = 5, length.out = 100), shape = alpha, scale = beta),
     type = "l",
     main = "Prior Sigma Density (Gamma Prior)",
     ylab = "PDF",
     xlab = "Sigma for Blood Potassium")

Calculate the Posterior SD

Now we get to the problem the individual was asking about — estimating the standard deviation from a small number of values.As stated above, Allen used grid approximation, so be sure to check out his article on how he did this. Instead, I’ll use an approach discussed in Chapter 15 of William Bolstad’s Introduction to Bayesian Statistics, 2nd Ed. We will use an inverse chi-square distribution and multiple it by the sum of squared error value, which represents the sum of squared difference of each observed value to our prior_mu, and take the square root of this product to go from a variance to a standard deviation.

# calculate the sum of squared error for each observation relative to the prior_mu
sse <- sum((d - prior_mu)^2)
sse

set.seed(333)
posterior_sigma <- mean(sqrt(sse *  1/rchisq(n = 1000, df = obs_n)))
posterior_sigma

The posterior estimate of the standard deviation is 0.485, which is slightly larger than what we observed in the data (obs_sd = 0.29) and a little higher than Allen got with the gamma distribution and grid approximation approach (0.404).

Posterior Mean

Because Allen used grid approximation, he had a probability mass for each value of mu and sigma, which allowed him to then create simulations to answer the question about what the standard deviation would be estimated to be with 100 or 1000 blood draws. So, I’ll calculate the posterior mean using the observed data and the prior_mu and prior_se, which we set above.

### Calculate a posterior mean
posterior_mu <- ((1 / prior_se^2 * prior_mu) + (1 / obs_sd^2 * obs_mu)) / ((1/obs_sd^2) + (1/prior_se^2))
posterior_se <- sqrt(1 / (1/obs_sd^2 + 1/prior_se^2))

posterior_mu
posterior_se

Simulating 100 Samples, Ten Times

Allen made a nice plot  from his grid approximation which showed about 10 different posterior simulations from the 1000 samples. I’ll try and recreate something similar here.

100 Samples

First, I’ll generate 100 samples using the posterior_mu and posterior_se, for the mean, and the sse and inverse chi-square distribution, for the standard deviation.

## What would we expect the standard deviation to be if we had 100 samples?
set.seed(456)
posterior_mu100 <- rnorm(n = 100, mean = posterior_mu, sd = posterior_se)
posterior_sd100 <- sqrt(sse *  1/rchisq(n = 100, df = obs_n))

df100 <- data.frame(mu = posterior_mu100, sigma = posterior_sd100)

head(df100)

Using this data frame of 100 possible posterior mu and sigma values, we will sample 100 pairs and using each pair run 100 simulations. We do this 10 different times to get ten, 100 simulated posterior mean values.

sim_storage100 <- matrix(data = NA, nrow = 100, ncol = 10)

for(i in 1:10){
  
  row_id <- sample(1:nrow(df100), size = 1, replace = TRUE)
  sim_storage100[, i] <- rnorm(n = 100, mean = df100[row_id, "mu"], sd = df100[row_id, "sigma"])
}

head(sim_storage100)

Now let’s plot the 10 lines!

Finally, we summarize the simulations in different ways

 

  1.  Get the mean and the standard deviation of each column of 100 simulations
  2. Calculate the Median ± 90% CI for each of the 10 simulations
  3. Calculate the Mean of each of the simulations and then calculate Median 90% Credible Interval across those values
  4. Finally, we can calculate the mean of the standard deviation for each of the simulations

This value is still larger than what we observed with our 6 observations (0.29) but smaller than what we observed from posterior sigma (0.489).

Wrapping Up

I thought this was a fun problem to try and solve because dealing with small numbers of observations in sport physiology is something that we see often. Knowing about the error measurement of your test may help you estimate, or get closer to, a true value for an individual based on only a few data points. Additionally, under the Bayesian framework, every new observation you make for that individual allows you to update. your prior and move closer to something that is more relevant to their biological nuances. I tried to take a different approach than what Allen did. Estimating the posterior mu and posterior sigma are useful steps that I’ve worked through before in this blog (and I highly recommend Bolstad’s book) but simulating the 100 samples, ten times, to attempt to match what Allen got with his grid approximation was new to me. I’ve never done it that way, so if anyone recognizes any errors or issues please email me. Finally, be sure to check out all of the great resources Allen Downey has been putting out.

If you’d like the code, it is available on my GitHub.

 

TidyX Episode 179 – How many SpaghettiOs does it take to write LOTR?

This week, Ellis Hughes and I have a super fun TidyX Screen Cast that incorporates a number of different important tasks in your programming arsenal: writing custom functions and returning lists.

The functions we write don’t only explain how many Spaghettio’s it takes to write out Lord of the Rings but can actually be applied to various different books and novels (as we show in the episode).

To watch the screen cast, CLICK HERE.

To access our code, CLICK HERE.

Frequentist & Bayesian Approaches to Regression Models by Hand – Compare and Contrast

One of the ways I try and learn things is to code them from first principles. It helps me see what is going on under the hood and also allows me wrap my head around how things work. Building regression models in R is incredibly easy using the lm() function and Bayesian regression models can be conveniently built with the same syntax in packages like {rstanarm} and {brms}. However, today I’m going to do everything by hand to get a grasp of how the Bayesian regression model works, at a very basic level. I put together the code using the mathematical presentation of these concepts from William Bolstad’s book, Introduction to Bayesian Statistics.

The entire script is available on my GITHUB page.

Loading Packages & Getting Data

I’m going to use the data from the {palmerpenguins} package and concentrate on a simple linear regression which will estimate flipper length from bill length.

### packages ------------------------------------------------------
library(tidyverse)
library(palmerpenguins)
library(patchwork)

theme_set(theme_classic())

### data ----------------------------------------------------------
data("penguins")
dat <- penguins %>%
  select(bill_length = bill_length_mm, flipper_length = flipper_length_mm) %>%
  na.omit()

head(dat)

 

EDA

These two variables share a relatively large correlation with each other.

Ordinary Least Squares Regression

First, I’ll start by building a regression model under the Frequentist paradigm, specifying no prior values on the parameters (even though in reality the OLS model is using a flat prior, where all values are equally plausible — a discussion for a different time), using the lm() function.

### Ordinary Least Squares Regression ------------------------------
fit_ols <- lm(flipper_length ~ I(bill_length - mean(dat$bill_length)), data = dat)
summary(fit_ols)

Technical Note: Since we don’t observe a bill length of 0 mm in penguins, I chose to transform the bill length data by grand mean centering it — subtracting each bill length from the population mean. This wont change the predictions from the model but does change our interpretation of the coefficients themselves (and the intercept is now directly interpretable, which it wouldn’t be if we left the bill length data in its untrasnformed state).

OLS Regression by Hand

Okay, that was easy. In a single line we constructed a model and we are up and running. But, let’s calculate this by hand so we can see where the coefficients and their corresponding standard errors came from.

# Calculate the least squares regression line by hand now
dat_stats <- dat %>%
  summarize(x_bar = mean(bill_length),
            x_sd = sd(bill_length),
            y_bar = mean(flipper_length),
            y_sd = sd(flipper_length),
            r = cor(bill_length, flipper_length),
            x2 = x_bar^2,
            y2 = y_bar^2,
            xy_bar = x_bar * y_bar,
            .groups = "drop")

dat_stats

In the above code, I’m calculating summary statistics (mean and SD) for both the independent and dependent variables, extracting their correlation coefficient, and getting their squared means and the product of their squared means to be used in downstream calculations of the regression coefficients. Here is what the output looks like:

The model intercept is simply the mean of our outcome variable (flipper length).

intercept <- dat_stats$y_bar

The coefficient for our independent variable is calculated as the correlation coefficient multiplied by the the SD of the y variable divided by the SD of the x variable.

beta <- with(dat_stats,
             r * (y_sd / x_sd))

beta

Finally, I’ll store the grand mean of bill length so that we can use it when we need to center bill length and make predictions.

x_bar <- dat_stats$x_bar
x_bar

Now that we have the intercept and beta coefficient for bill length calculated by hand we can construct the regression equation:

Notice that these are the exact values we obtained from our model using the lm() function.

We can use the two models to make predictions and show that they are the exact same.

## Make predictions with the two models
dat %>%
  mutate(pred_model = predict(fit_ols),
         pred_hand = intercept + beta * (bill_length - x_bar))

In the model estimated from the lm() function we also get a sigma parameter (residual squared error, RSE), which tells us, on average, how off the model predictions are. We can build this by hand by first calculating the squared error of the observed values and our predictions and then calculating the RSE as the square root of the sum of squared residuals divided by the model degrees of freedom.

# Calculate the estimated variance around the line
N_obs <- nrow(dat) dat %>%
  mutate(pred = intercept + beta * (bill_length - x_bar),
         resid = (flipper_length - pred),
         resid2 = resid^2) %>%
  summarize(n_model_params = 2,
            deg_freedom = N_obs - n_model_params,
            model_var = sum(resid2) / deg_freedom,
            model_sd = sqrt(model_var))

Again, our by hand calculations equal exactly as what we obtained from the lm() function, 10.6.

Bayesian Linear Regression by Hand

Now let’s turn our attention to building a Bayesian regression model.

First we need to calculate the sum of squared error for X, which we do with this equation:

ss_x = N * (mean(mu_x^2) – mu_x^2)

N <- nrow(dat)
mean_x2 <- mean(dat$bill_length^2)
mu_x2 <- mean(dat$bill_length)^2

N
mean_x2
mu_x2

ss_x <- N * (mean_x2 - mu_x2)
ss_x

Next, we need to specify some priors on the model parameters. The priors can come from a number of courses. We could set them subjectively (usually people hate this idea). Or, we could look in the scientific literature and use prior research as a guide for plausible values of the relationship between flipper length and bill length. In this case, I’m just going to specify some priors that are within the range of the data but have enough variance around them to “let the data speak”. (NOTE: you could rerun the entire analysis with different priors and smaller or larger variances to see how the model changes. I hope to do a longer blog post about priors in the future).

  • For the slope coefficient we decide to have a normal prior, N(1, 2^2)
  • For the intercept coefficient we choose a normal prior, N(180, 10^2)
  • We don’t know the true variance so we use the estimated variance from the least squares regression line
prior_model_var <- sigma(fit_ols)^2

prior_slope_mu <- 1
prior_slope_var <- 1^2

prior_intercept_mu <- 180
prior_intercept_var <- 10^2

Next, we calculate the posterior precision (1/variance) for the slope coefficient. We do it with this equation:

1/prior_slope_var + (ss_x / prior_model_var)

And then convert it to a standard deviation (which is more useful to us and easier to interpret than precision, since it is on the scale of the data).

# 1/prior_slope_var + (ss_x / prior_model_var)
posterior_slope_precision <- 1 / prior_slope_var + ss_x / prior_model_var

# Convert to SD
posterior_slope_sd <- posterior_slope_precision^-(1/2)

Once we have the precision for the posterior slope calculated we can calculate the posterior regression coefficient for the slope using this equation:

(1/prior_slope_var) / posterior_slope_var * prior_slope_mu + (ss_x / prior_model_var) / posterior_slope_var * beta

# posterior slope
posterior_slope_mu <- (1/prior_slope_var) / posterior_slope_precision * prior_slope_mu + (ss_x / prior_model_var) / posterior_slope_precision * beta
posterior_slope_mu

We can plot the prior and posterior slope values, by first simulating the posterior slope using its mu and SD values.

## Plot prior and posterior for the slope
set.seed(4)
prior_sim <- rnorm(n = 1e4, mean = prior_slope_mu, sd = sqrt(prior_slope_var))
posterior_sim <- rnorm(n = 1e4, mean = posterior_slope_mu, sd = posterior_slope_sd)

plot(density(posterior_sim),
     col = "blue",
     lwd = 4,
     xlim = c(-2, 3),
     main = "Prior & Posterior\nfor\nBayesian Regression Slope Coefficient")
lines(density(prior_sim),
      col = "red",
      lty = 2,
      lwd = 4)
legend("topleft",
       legend = c("Prior", "Posterior"),
       col = c("red", "blue"),
       lty = c(2, 1),
       lwd = c(2,2))

We notice that the prior for the slope, N(1, 1^2), had a rather broad range of plausible values. However, after observing some data and combining that observed data with our prior, we find the posterior to have settled into a reasonable range given the value, Again, if you had selected other priors or perhaps a prior with a much narrower variance, these results would be different and potentially more influenced by your prior.

Now that we have a slope parameter we need to calculate the intercept. The equations are commented out in the code below.

## Posterior precision for the intercept
# 1/prior_intercept_var + N/prior_model_var

posterior_intercept_precision <- 1/prior_intercept_var + N/prior_model_var
posterior_intercept_precision

# Convert to SD
posterior_intercept_sd <- posterior_intercept_precision^-(1/2)
posterior_intercept_sd

## Posterior intercept mean
# (1/prior_intercept_var) / posterior_intercept_precision * prior_intercept_mu + (N/prior_model_var) / posterior_intercept_precision * intercept
posterior_intercept_mu <- (1/prior_intercept_var) / posterior_intercept_precision * prior_intercept_mu + (N/prior_model_var) / posterior_intercept_precision * intercept
posterior_intercept_mu

Comparing our Bayesian regression coefficients and standard errors to the ones we obtained from the lm() function, we find that they are nearly identical.

Using the coefficients and their corresponding standard errors we can also compare the 95% Credible Interval from the Bayesian model to the 95% Confidence Interval from the OLS model (again, nearly identical).

And writing out the Bayesian model we see that it is the same as the Frequentist model.

Using the posterior mean and SDs for the slope and intercept we can simulate a distribution and plot and summarize them using quantile intervals and credible intervals, to get a picture of our uncertainty.

## Use simulation for the slope and intercept
set.seed(413)
posterior_intercept_sim <- rnorm(n = 1e4, mean = posterior_intercept_mu, sd = posterior_slope_sd)
posterior_slope_sim <- rnorm(n = 1e4, mean = posterior_slope_mu, sd = posterior_slope_sd)

par(mfrow = c(1, 2))
hist(posterior_intercept_sim)
hist(posterior_slope_sim)

Making Predictions

Now that we have our models specified it’s time to make some predictions!

## Add all predictions to the original data set
dat_final <- dat %>%
  mutate(pred_ols = predict(fit_ols),
         pred_ols_by_hand = intercept + beta * (bill_length - x_bar),
         pred_bayes_by_hand = posterior_intercept_mu + posterior_slope_mu * (bill_length - x_bar))

head(dat_final, 10)

Because the equations were pretty much identical we can see that all three models (OLS with lm(), OLS by hand, and Bayes by hand) produce the same predicted values for flipper length.

We can also visualize these predictions along with the prediction intervals.

## Predictions with uncertainty from the OLS model
ols_preds <- dat_final %>%
  bind_cols(predict(fit_ols, newdata = ., interval = "prediction")) %>%
  ggplot(aes(x = fit, y = flipper_length)) +
  geom_point(size = 2) +
  geom_ribbon(aes(ymin = lwr, ymax = upr),
              alpha = 0.4) +
  geom_smooth(method = "lm",
              se = FALSE) +
  ggtitle("OLS Predictions with 95% Prediction Intervals")

## Now predict with the Bayesian Model
# The error is calculated as:
# residual_variance = prior_model_var +  posterior_intercept_sd^2 + posterior_slope^2*(x - x_bar)
bayes_preds <- dat %>%
  mutate(fit = posterior_intercept_mu + posterior_slope_mu * (bill_length-x_bar),
         error_var = prior_model_var + posterior_intercept_sd^2 + posterior_slope_sd^2 * (bill_length - x_bar),
         rse = sqrt(error_var),
         lwr = fit - qt(p = 0.975, df = N - 2) * rse,
         upr = fit + qt(p = 0.975, df = N - 2) * rse) %>%
  ggplot(aes(x = fit, y = flipper_length)) +
  geom_point(size = 2) +
  geom_ribbon(aes(ymin = lwr, ymax = upr),
              alpha = 0.4) +
  geom_smooth(method = "lm",
              se = FALSE) +
  ggtitle("Bayesian Predictions with 95% Prediction Intervals")


ols_preds | bayes_preds

Okay, I know what you are thinking. The equations were the same, the predictions are the same, and the prediction intervals are the same….What’s the big deal?!

The big deal is that the Bayesian model gives us more flexibility. Not only can we specify priors, allowing us to have a higher weighting on more plausible values (based on prior data, prior research, domain expertise, etc.); but, we can also produce an entire posterior distribution for the predictions.

Let’s take a single row of observation from the data and make a prediction on it, complete with 95% credible intervals, using our Bayesian model.

single_obs <- dat %>% slice(36)
single_obs

pred_bayes <- single_obs %>%
  mutate(fit = posterior_intercept_mu + posterior_slope_mu * (bill_length-x_bar),
         error_var = prior_model_var + posterior_intercept_sd^2 + posterior_slope_sd^2 * (bill_length - x_bar),
         rse = sqrt(error_var),
         lwr = fit - qt(p = 0.975, df = N - 2) * rse,
         upr = fit + qt(p = 0.975, df = N - 2) * rse) 

Next, we simulate 10,000 observations using the mean prediction (fit) and the rse from the table above and summarize the results

set.seed(582)
pred_bayes_sim <- rnorm(n = 1e4, mean = pred_bayes$fit, sd = pred_bayes$rse)

mean(pred_bayes_sim)
quantile(pred_bayes_sim, probs = c(0.5, 0.025, 0.975))
mean(pred_bayes_sim) + qt(p = c(0.025, 0.975), df = N - 2) * sd(pred_bayes_sim)

Our mean value, the 95% quantile intervals, and then 95% credible intervals are the similar to the table above (they should be since we simulated with those parameters). We will also get similar values if we use our OLS model and make a prediction on this observation with prediction intervals.

But, because we simulated a full distribution, from the Bayesian prediction we also get 10,000 plausible values for the estimate of flipper length based on the observed bill length.

This offers us flexibility to make more direct statements about the predictive distribution. For example, we might ask, “What is the probability that the predicted flipper length is greater than 210 mm?” and we could directly compute this from the simulated data! This provides us with a number of options if we were using this model to make decisions on and perhaps have threshold values or ranges of critical importance where we need to know how much of the density of our predicted distribution is below or above the threshold or falls outside of the region of importance.

Wrapping Up

That’s a little bit on working with regression models by hand, from first principles. The Bayesian approach affords us a lot of control and flexibility over the model, via priors (more on that in a future blog post) and the predictions (via simulation). The Bayesian regression model calculated here was a simple approach and did not require Markov Chain Monte Carlo (MCMC), which would be run when using {rstanarm} and {brms}. If you are curious how this works, I wrote a previous blog on using a GIBBS Sampler to estimate the posterior distributions for a regression model, CLICK HERE. But, even with this simple approach we are afforded the ability to specify priors on the model parameters of the deterministic/fixed component of the model. We did not specify a prior on the stochastic/error part of the model (we used the sigma value calculated from the OLS model — which we also calculated by hand). This was necessary in this simple conjugate approach because the normal distribution is a two parameter distribution so we needed to treat one of the parameters as fixed (in this case the SD). Had we used an MCMC approach, we could have further specified a prior distribution on the model error.

If you notice any errors, feel free to drop me an email.

The code is available on my GITHUB page.

TidyX Episode 176: Extracting all predictions from a Random Forest, Building Uncertainty Intervals, and Creating a Player Compare Tool

In this installment of predicting whether MLB pitchers will make the Hall of Fame, Ellis Hughes and I take the Random Forest model we built in Episode 175 and discuss how you can extract the predictions from each of the individual trees and then build uncertainty intervals so that you can construct an entire distribution of predictions. We then take these predictions and build a custom function for comparing the predicted distribution of two players and whether they will make it to the Hall of Fame.

To watch the screen cast, CLICK HERE.

To access our code, CLICK HERE.