Category Archives: Bayesian Model Building

Learning Bayesian Statistics Podcast

I recently had the please of speaking with Alex Andorra on his amazing Learning Bayesian Statistics Podcast. This was an amazing experience given the quality of people he has interviewed in the past and the fact that this is one of my favorite podcasts. We talked a lot about Bayesian statistics in sport. Hopefully others find it interesting.

Listen here << Learning Bayesian Statistics Podcast >>

Estimating a Standard Deviation from a Small Sample

I really enjoy Allen Downey’s work and if you haven’t checked out his books Think Bayes, Probably Overthinking It, and Modeling and Simulation in Python, I highly suggest you do so. I can’t recommend his work enough.

Recently, he’s been doing this cool thing where he grabs a question from the Reddit Statistics forum and constructs a short blog article on how he might go about solving the problem at hand. You can find all the articles he’s done to date HERE. His most recent article was about estimating the standard deviation form a small sample (n = 6). To solve the individual’s problem, Allen uses grid approximation, which is a simple and convenient approach to tackling problems using a Bayesian framework. I worked through his Python code but, admittedly, I’m functional in Python but I’m not the best Pythonista. So, that got me thinking about whether we could solve this in a different way.

I’ve talked about different approaches, depending on what prior information you have available, to solving a normal-normal conjugate problem in the past. Those approaches were all directed at trying to create an estimate for the mean, while we kept the standard deviation value fixed (known). However, you can solve these equations the other way, holding the mean fixed (known) and attempt to estimate the standard deviation parameter. Like the grid approximation approach Allen used, these approaches are easy to apply (if you are willing to make some assumptions), without having to code your own Markov Chain (like we did when we wrote our own GIBBS Sampler to estimate the mean and standard deviation or to perform linear regression).

So, let’s try and tackle this problem in a different way than Allen used. The code will be available in my GitHub.

The Problem

The question that was posed indicated that the individual had 6 blood samples of potassium level from one person. What they wanted to do was estimate the range of probable values if they were to obtain say 100 or 1000 samples from the same person (without having to draw that much blood).

The Data

The observations looked like this:

## Six observations
d <- c(4.0, 3.9, 4.0, 4.0, 4.7, 4.2)

# place observation info into their own objects
obs_n <- length(d)
obs_df <- obs_n - 1
obs_mu <- mean(d)
obs_sd <- sd(d)

obs_mu
obs_sd

Prior Mean

First, we will get a prior for the mean of potassium levels in the population. Allen suggested that this is usually reported with 5th to 95th percentile values ranging from 3.5 to 5.4. So, let’s use those values to determine the prior mean and we will put a prior standard error around that value of 0.6.

low95 <- 3.5
high95 <- 5.4

prior_mu <- (low95 + high95) / 2
prior_se <- 0.6

# plot prior mu
plot(x = seq(from = 2, to = 7, length.out = 50),
     y = dnorm(x = seq(from = 2, to = 7, length.out = 50), mean = prior_mu, sd = prior_se),
     type = "l",
     main = "Prior Mu Density",
     ylab = "PDF",
     xlab = "Blood Potassium")

Prior Standard Deviation

There are a number of priors that one can use for a standard deviation — a value that cannot go negative (so bounded at 0) — such as gamma, exponential, inverse chi-square, etc. To have a good estimate of what the prior standard deviation is we’d probably need to know something about the blood testing measurement and how noisy it is as well as how much biological variability we’d expect from test to test, which Allen alluded to in his post. Allen chose to use a Gamma distribution with a shape parameter of 2 and a scale parameter of 0.5. I’ll use a different approach, but let’s look at his prior for the standard deviation with an inverse chi-square distribution.

# Downey's example uses a gamma distribution so let's plot that
alpha <- 2
beta <- 0.5

plot(x = seq(from = 0.01, to = 5, length.out = 100),
     y = dgamma(x = seq(from = 0.01, to = 5, length.out = 100), shape = alpha, scale = beta),
     type = "l",
     main = "Prior Sigma Density (Gamma Prior)",
     ylab = "PDF",
     xlab = "Sigma for Blood Potassium")

Calculate the Posterior SD

Now we get to the problem the individual was asking about — estimating the standard deviation from a small number of values.As stated above, Allen used grid approximation, so be sure to check out his article on how he did this. Instead, I’ll use an approach discussed in Chapter 15 of William Bolstad’s Introduction to Bayesian Statistics, 2nd Ed. We will use an inverse chi-square distribution and multiple it by the sum of squared error value, which represents the sum of squared difference of each observed value to our prior_mu, and take the square root of this product to go from a variance to a standard deviation.

# calculate the sum of squared error for each observation relative to the prior_mu
sse <- sum((d - prior_mu)^2)
sse

set.seed(333)
posterior_sigma <- mean(sqrt(sse *  1/rchisq(n = 1000, df = obs_n)))
posterior_sigma

The posterior estimate of the standard deviation is 0.485, which is slightly larger than what we observed in the data (obs_sd = 0.29) and a little higher than Allen got with the gamma distribution and grid approximation approach (0.404).

Posterior Mean

Because Allen used grid approximation, he had a probability mass for each value of mu and sigma, which allowed him to then create simulations to answer the question about what the standard deviation would be estimated to be with 100 or 1000 blood draws. So, I’ll calculate the posterior mean using the observed data and the prior_mu and prior_se, which we set above.

### Calculate a posterior mean
posterior_mu <- ((1 / prior_se^2 * prior_mu) + (1 / obs_sd^2 * obs_mu)) / ((1/obs_sd^2) + (1/prior_se^2))
posterior_se <- sqrt(1 / (1/obs_sd^2 + 1/prior_se^2))

posterior_mu
posterior_se

Simulating 100 Samples, Ten Times

Allen made a nice plot from his grid approximation which showed about 10 different posterior simulations from the 1000 samples. I’ll try and recreate something similar here.

100 Samples

First, I’ll generate 100 samples using the posterior_mu and posterior_se, for the mean, and the sse and inverse chi-square distribution, for the standard deviation.

## What would we expect the standard deviation to be if we had 100 samples?
set.seed(456)
posterior_mu100 <- rnorm(n = 100, mean = posterior_mu, sd = posterior_se)
posterior_sd100 <- sqrt(sse *  1/rchisq(n = 100, df = obs_n))

df100 <- data.frame(mu = posterior_mu100, sigma = posterior_sd100)

head(df100)

Using this data frame of 100 possible posterior mu and sigma values, we will sample 100 pairs and using each pair run 100 simulations. We do this 10 different times to get ten, 100 simulated posterior mean values.

sim_storage100 <- matrix(data = NA, nrow = 100, ncol = 10)

for(i in 1:10){
  
  row_id <- sample(1:nrow(df100), size = 1, replace = TRUE)
  sim_storage100[, i] <- rnorm(n = 100, mean = df100[row_id, "mu"], sd = df100[row_id, "sigma"])
}

head(sim_storage100)

Now let’s plot the 10 lines!

Finally, we summarize the simulations in different ways

Get the mean and the standard deviation of each column of 100 simulations
Calculate the Median ± 90% CI for each of the 10 simulations
Calculate the Mean of each of the simulations and then calculate Median 90% Credible Interval across those values
Finally, we can calculate the mean of the standard deviation for each of the simulations

This value is still larger than what we observed with our 6 observations (0.29) but smaller than what we observed from posterior sigma (0.489).

Wrapping Up

I thought this was a fun problem to try and solve because dealing with small numbers of observations in sport physiology is something that we see often. Knowing about the error measurement of your test may help you estimate, or get closer to, a true value for an individual based on only a few data points. Additionally, under the Bayesian framework, every new observation you make for that individual allows you to update. your prior and move closer to something that is more relevant to their biological nuances. I tried to take a different approach than what Allen did. Estimating the posterior mu and posterior sigma are useful steps that I’ve worked through before in this blog (and I highly recommend Bolstad’s book) but simulating the 100 samples, ten times, to attempt to match what Allen got with his grid approximation was new to me. I’ve never done it that way, so if anyone recognizes any errors or issues please email me. Finally, be sure to check out all of the great resources Allen Downey has been putting out.

If you’d like the code, it is available on my GitHub.

Frequentist & Bayesian Approaches to Regression Models by Hand – Compare and Contrast

One of the ways I try and learn things is to code them from first principles. It helps me see what is going on under the hood and also allows me wrap my head around how things work. Building regression models in R is incredibly easy using the lm() function and Bayesian regression models can be conveniently built with the same syntax in packages like {rstanarm} and {brms}. However, today I’m going to do everything by hand to get a grasp of how the Bayesian regression model works, at a very basic level. I put together the code using the mathematical presentation of these concepts from William Bolstad’s book, Introduction to Bayesian Statistics.

The entire script is available on my GITHUB page.

Loading Packages & Getting Data

I’m going to use the data from the {palmerpenguins} package and concentrate on a simple linear regression which will estimate flipper length from bill length.

### packages ------------------------------------------------------
library(tidyverse)
library(palmerpenguins)
library(patchwork)

theme_set(theme_classic())

### data ----------------------------------------------------------
data("penguins")
dat <- penguins %>%
  select(bill_length = bill_length_mm, flipper_length = flipper_length_mm) %>%
  na.omit()

head(dat)

EDA

These two variables share a relatively large correlation with each other.

Ordinary Least Squares Regression

First, I’ll start by building a regression model under the Frequentist paradigm, specifying no prior values on the parameters (even though in reality the OLS model is using a flat prior, where all values are equally plausible — a discussion for a different time), using the lm() function.

### Ordinary Least Squares Regression ------------------------------
fit_ols <- lm(flipper_length ~ I(bill_length - mean(dat$bill_length)), data = dat)
summary(fit_ols)

Technical Note: Since we don’t observe a bill length of 0 mm in penguins, I chose to transform the bill length data by grand mean centering it — subtracting each bill length from the population mean. This wont change the predictions from the model but does change our interpretation of the coefficients themselves (and the intercept is now directly interpretable, which it wouldn’t be if we left the bill length data in its untrasnformed state).

OLS Regression by Hand

Okay, that was easy. In a single line we constructed a model and we are up and running. But, let’s calculate this by hand so we can see where the coefficients and their corresponding standard errors came from.

# Calculate the least squares regression line by hand now
dat_stats <- dat %>%
  summarize(x_bar = mean(bill_length),
            x_sd = sd(bill_length),
            y_bar = mean(flipper_length),
            y_sd = sd(flipper_length),
            r = cor(bill_length, flipper_length),
            x2 = x_bar^2,
            y2 = y_bar^2,
            xy_bar = x_bar * y_bar,
            .groups = "drop")

dat_stats

In the above code, I’m calculating summary statistics (mean and SD) for both the independent and dependent variables, extracting their correlation coefficient, and getting their squared means and the product of their squared means to be used in downstream calculations of the regression coefficients. Here is what the output looks like:

The model intercept is simply the mean of our outcome variable (flipper length).

intercept <- dat_stats$y_bar

The coefficient for our independent variable is calculated as the correlation coefficient multiplied by the the SD of the y variable divided by the SD of the x variable.

beta <- with(dat_stats,
             r * (y_sd / x_sd))

beta

Finally, I’ll store the grand mean of bill length so that we can use it when we need to center bill length and make predictions.

x_bar <- dat_stats$x_bar
x_bar

Now that we have the intercept and beta coefficient for bill length calculated by hand we can construct the regression equation:

Notice that these are the exact values we obtained from our model using the lm() function.

We can use the two models to make predictions and show that they are the exact same.

## Make predictions with the two models
dat %>%
  mutate(pred_model = predict(fit_ols),
         pred_hand = intercept + beta * (bill_length - x_bar))

In the model estimated from the lm() function we also get a sigma parameter (residual squared error, RSE), which tells us, on average, how off the model predictions are. We can build this by hand by first calculating the squared error of the observed values and our predictions and then calculating the RSE as the square root of the sum of squared residuals divided by the model degrees of freedom.

# Calculate the estimated variance around the line
N_obs <- nrow(dat) dat %>%
  mutate(pred = intercept + beta * (bill_length - x_bar),
         resid = (flipper_length - pred),
         resid2 = resid^2) %>%
  summarize(n_model_params = 2,
            deg_freedom = N_obs - n_model_params,
            model_var = sum(resid2) / deg_freedom,
            model_sd = sqrt(model_var))

Again, our by hand calculations equal exactly as what we obtained from the lm() function, 10.6.

Bayesian Linear Regression by Hand

Now let’s turn our attention to building a Bayesian regression model.

First we need to calculate the sum of squared error for X, which we do with this equation:

ss_x = N * (mean(mu_x^2) – mu_x^2)

N <- nrow(dat)
mean_x2 <- mean(dat$bill_length^2)
mu_x2 <- mean(dat$bill_length)^2

N
mean_x2
mu_x2

ss_x <- N * (mean_x2 - mu_x2)
ss_x

Next, we need to specify some priors on the model parameters. The priors can come from a number of courses. We could set them subjectively (usually people hate this idea). Or, we could look in the scientific literature and use prior research as a guide for plausible values of the relationship between flipper length and bill length. In this case, I’m just going to specify some priors that are within the range of the data but have enough variance around them to “let the data speak”. (NOTE: you could rerun the entire analysis with different priors and smaller or larger variances to see how the model changes. I hope to do a longer blog post about priors in the future).

For the slope coefficient we decide to have a normal prior, N(1, 2^2)
For the intercept coefficient we choose a normal prior, N(180, 10^2)
We don’t know the true variance so we use the estimated variance from the least squares regression line

prior_model_var <- sigma(fit_ols)^2

prior_slope_mu <- 1
prior_slope_var <- 1^2

prior_intercept_mu <- 180
prior_intercept_var <- 10^2

Next, we calculate the posterior precision (1/variance) for the slope coefficient. We do it with this equation:

1/prior_slope_var + (ss_x / prior_model_var)

And then convert it to a standard deviation (which is more useful to us and easier to interpret than precision, since it is on the scale of the data).

# 1/prior_slope_var + (ss_x / prior_model_var)
posterior_slope_precision <- 1 / prior_slope_var + ss_x / prior_model_var

# Convert to SD
posterior_slope_sd <- posterior_slope_precision^-(1/2)

Once we have the precision for the posterior slope calculated we can calculate the posterior regression coefficient for the slope using this equation:

(1/prior_slope_var) / posterior_slope_var * prior_slope_mu + (ss_x / prior_model_var) / posterior_slope_var * beta

# posterior slope
posterior_slope_mu <- (1/prior_slope_var) / posterior_slope_precision * prior_slope_mu + (ss_x / prior_model_var) / posterior_slope_precision * beta
posterior_slope_mu

We can plot the prior and posterior slope values, by first simulating the posterior slope using its mu and SD values.

## Plot prior and posterior for the slope
set.seed(4)
prior_sim <- rnorm(n = 1e4, mean = prior_slope_mu, sd = sqrt(prior_slope_var))
posterior_sim <- rnorm(n = 1e4, mean = posterior_slope_mu, sd = posterior_slope_sd)

plot(density(posterior_sim),
     col = "blue",
     lwd = 4,
     xlim = c(-2, 3),
     main = "Prior & Posterior\nfor\nBayesian Regression Slope Coefficient")
lines(density(prior_sim),
      col = "red",
      lty = 2,
      lwd = 4)
legend("topleft",
       legend = c("Prior", "Posterior"),
       col = c("red", "blue"),
       lty = c(2, 1),
       lwd = c(2,2))

We notice that the prior for the slope, N(1, 1^2), had a rather broad range of plausible values. However, after observing some data and combining that observed data with our prior, we find the posterior to have settled into a reasonable range given the value, Again, if you had selected other priors or perhaps a prior with a much narrower variance, these results would be different and potentially more influenced by your prior.

Now that we have a slope parameter we need to calculate the intercept. The equations are commented out in the code below.

## Posterior precision for the intercept
# 1/prior_intercept_var + N/prior_model_var

posterior_intercept_precision <- 1/prior_intercept_var + N/prior_model_var
posterior_intercept_precision

# Convert to SD
posterior_intercept_sd <- posterior_intercept_precision^-(1/2)
posterior_intercept_sd

## Posterior intercept mean
# (1/prior_intercept_var) / posterior_intercept_precision * prior_intercept_mu + (N/prior_model_var) / posterior_intercept_precision * intercept
posterior_intercept_mu <- (1/prior_intercept_var) / posterior_intercept_precision * prior_intercept_mu + (N/prior_model_var) / posterior_intercept_precision * intercept
posterior_intercept_mu

Comparing our Bayesian regression coefficients and standard errors to the ones we obtained from the lm() function, we find that they are nearly identical.

Using the coefficients and their corresponding standard errors we can also compare the 95% Credible Interval from the Bayesian model to the 95% Confidence Interval from the OLS model (again, nearly identical).

And writing out the Bayesian model we see that it is the same as the Frequentist model.

Using the posterior mean and SDs for the slope and intercept we can simulate a distribution and plot and summarize them using quantile intervals and credible intervals, to get a picture of our uncertainty.

## Use simulation for the slope and intercept
set.seed(413)
posterior_intercept_sim <- rnorm(n = 1e4, mean = posterior_intercept_mu, sd = posterior_slope_sd)
posterior_slope_sim <- rnorm(n = 1e4, mean = posterior_slope_mu, sd = posterior_slope_sd)

par(mfrow = c(1, 2))
hist(posterior_intercept_sim)
hist(posterior_slope_sim)

Making Predictions

Now that we have our models specified it’s time to make some predictions!

## Add all predictions to the original data set
dat_final <- dat %>%
  mutate(pred_ols = predict(fit_ols),
         pred_ols_by_hand = intercept + beta * (bill_length - x_bar),
         pred_bayes_by_hand = posterior_intercept_mu + posterior_slope_mu * (bill_length - x_bar))

head(dat_final, 10)

Because the equations were pretty much identical we can see that all three models (OLS with lm(), OLS by hand, and Bayes by hand) produce the same predicted values for flipper length.

We can also visualize these predictions along with the prediction intervals.

## Predictions with uncertainty from the OLS model
ols_preds <- dat_final %>%
  bind_cols(predict(fit_ols, newdata = ., interval = "prediction")) %>%
  ggplot(aes(x = fit, y = flipper_length)) +
  geom_point(size = 2) +
  geom_ribbon(aes(ymin = lwr, ymax = upr),
              alpha = 0.4) +
  geom_smooth(method = "lm",
              se = FALSE) +
  ggtitle("OLS Predictions with 95% Prediction Intervals")

## Now predict with the Bayesian Model
# The error is calculated as:
# residual_variance = prior_model_var +  posterior_intercept_sd^2 + posterior_slope^2*(x - x_bar)
bayes_preds <- dat %>%
  mutate(fit = posterior_intercept_mu + posterior_slope_mu * (bill_length-x_bar),
         error_var = prior_model_var + posterior_intercept_sd^2 + posterior_slope_sd^2 * (bill_length - x_bar),
         rse = sqrt(error_var),
         lwr = fit - qt(p = 0.975, df = N - 2) * rse,
         upr = fit + qt(p = 0.975, df = N - 2) * rse) %>%
  ggplot(aes(x = fit, y = flipper_length)) +
  geom_point(size = 2) +
  geom_ribbon(aes(ymin = lwr, ymax = upr),
              alpha = 0.4) +
  geom_smooth(method = "lm",
              se = FALSE) +
  ggtitle("Bayesian Predictions with 95% Prediction Intervals")


ols_preds | bayes_preds

Okay, I know what you are thinking. The equations were the same, the predictions are the same, and the prediction intervals are the same….What’s the big deal?!

The big deal is that the Bayesian model gives us more flexibility. Not only can we specify priors, allowing us to have a higher weighting on more plausible values (based on prior data, prior research, domain expertise, etc.); but, we can also produce an entire posterior distribution for the predictions.

Let’s take a single row of observation from the data and make a prediction on it, complete with 95% credible intervals, using our Bayesian model.

single_obs <- dat %>% slice(36)
single_obs

pred_bayes <- single_obs %>%
  mutate(fit = posterior_intercept_mu + posterior_slope_mu * (bill_length-x_bar),
         error_var = prior_model_var + posterior_intercept_sd^2 + posterior_slope_sd^2 * (bill_length - x_bar),
         rse = sqrt(error_var),
         lwr = fit - qt(p = 0.975, df = N - 2) * rse,
         upr = fit + qt(p = 0.975, df = N - 2) * rse)

Next, we simulate 10,000 observations using the mean prediction (fit) and the rse from the table above and summarize the results

set.seed(582)
pred_bayes_sim <- rnorm(n = 1e4, mean = pred_bayes$fit, sd = pred_bayes$rse)

mean(pred_bayes_sim)
quantile(pred_bayes_sim, probs = c(0.5, 0.025, 0.975))
mean(pred_bayes_sim) + qt(p = c(0.025, 0.975), df = N - 2) * sd(pred_bayes_sim)

Our mean value, the 95% quantile intervals, and then 95% credible intervals are the similar to the table above (they should be since we simulated with those parameters). We will also get similar values if we use our OLS model and make a prediction on this observation with prediction intervals.

But, because we simulated a full distribution, from the Bayesian prediction we also get 10,000 plausible values for the estimate of flipper length based on the observed bill length.

This offers us flexibility to make more direct statements about the predictive distribution. For example, we might ask, “What is the probability that the predicted flipper length is greater than 210 mm?” and we could directly compute this from the simulated data! This provides us with a number of options if we were using this model to make decisions on and perhaps have threshold values or ranges of critical importance where we need to know how much of the density of our predicted distribution is below or above the threshold or falls outside of the region of importance.

Wrapping Up

That’s a little bit on working with regression models by hand, from first principles. The Bayesian approach affords us a lot of control and flexibility over the model, via priors (more on that in a future blog post) and the predictions (via simulation). The Bayesian regression model calculated here was a simple approach and did not require Markov Chain Monte Carlo (MCMC), which would be run when using {rstanarm} and {brms}. If you are curious how this works, I wrote a previous blog on using a GIBBS Sampler to estimate the posterior distributions for a regression model, CLICK HERE. But, even with this simple approach we are afforded the ability to specify priors on the model parameters of the deterministic/fixed component of the model. We did not specify a prior on the stochastic/error part of the model (we used the sigma value calculated from the OLS model — which we also calculated by hand). This was necessary in this simple conjugate approach because the normal distribution is a two parameter distribution so we needed to treat one of the parameters as fixed (in this case the SD). Had we used an MCMC approach, we could have further specified a prior distribution on the model error.

If you notice any errors, feel free to drop me an email.

The code is available on my GITHUB page.

Bayes Theorem: p(A|B) = p(A) * p(B|A) / p(B) — But why?

A student recently asked me if I could show them why Bayes Theorem looks the way it does. If we are trying to determine the probability A given B, how did we arrive at the theorem in the title?

Let’s create some data and see if we can sort it out. We will simulate a 2×2 table, similar to one we might find in sports medicine journals when looking at some type of test (positive or negative) and some type of outcome (disease or no disease).

dat_tbl <- data.frame(
  test = c("positive", "negative", "total"),
  disease = c(25, 5, 30),
  no_disease = c(3, 40, 43),
  total = c(28, 45, 73)
)

dat_tbl

A logical question we’d like to answer here is, “What is the probability of disease given a positive test.” Written in probability notation we are asking, p(Disease | Positive).

Using the data in the table, we can quickly compute this as 25 people who were positive and had the disease divided by 28 total positive tests. 25/28 = 89.3%

Of course, we could also compute this using Bayes Theorem:

p(Disease | Positive) = p(Disease) * p(Positive | Disease) / p(Positive)

We store the necessary values in R objects and then compute the result

### What we want to know: p(Disease | Positive)
# p(Disease | Positive) = p(Disease) * p(Positive | Disease) / p(Positive)
# p(A | B) = p(A) * p(B|A) / p(B)

p_disease <- 30/73
p_positive_given_disease <- 25/30
p_positive <- 28/73

p_no_disease <- 43/73
p_positive_given_no_disease <- 3/43

p_disease_given_positive <- (p_disease * p_positive_given_disease) / (p_disease * p_positive_given_disease + p_no_disease * p_positive_given_no_disease) 
p_disease_given_positive

This checks out. We get the exact same result as when we did 25/28.

Okay, how did we get here? Why does it work out like that?

The math works out because we start with two different joint probabilities, p(A n B) and p(B n A). Or, in our case, p(Disease n Positive) and p(Positive n Disease). Formally, we can write these as follows:

We’ve already stored the necessary probabilities in specific elements, above. But here it what they both look like using our 2×2. First I’ll calculate it with the counts directly from the table and then calculate it with the R elements that we stored. You’ll see the answers are the same.

## Joint Probability 1: p(Positive n Disease) = p(Positive | Disease) * p(Disease)

25/30 * 30/73

p_positive_given_disease * p_disease

## Joint Probability 2: p(Disease n Positive) = p(Disease | Positive) * p(Positive)

25/28 * 28/73

p_disease_given_positive * p_positive

Well, would you look at that! The two joint probabilities are equal to each other. We can formally test that they are equal to each other by setting up a logical equation in R.

## These two joint probabilities are equal!
#  p(Positive n Disease) = p(Disease n Positive)

(p_positive_given_disease * p_disease) == (p_disease_given_positive * p_positive)

So, if they are equal, what we are saying is this:

p(Disease | Positive) * p(Positive) = p(Positive | Disease) * p(Disease)

Now, with some algebra, we can divide the right side of the equation by p(Positive) and we are left with Bayes Theorem for our problem:

p(Disease | Positive) = p(Disease) * p(Positive | Disease) / p(Positive)

Putting it altogether, it looks like this:

So, all we’ve done is taken two joint probabilities and used some algebra to arrange the terms in order to get us to the conditional probability we were interested in, p(Disease | Positive) and we end up with Bayes Theorem.

How Bayesian Search Theory Can Help You Find More Deer

Introduction

I was talking with a friend the other day who was telling me about his brother, who leads guided deer hunts in Wyoming. Typically, clients will come out for a hunt over several days and rely on him to guide them to areas of the forest where there is a high probability of seeing deer. Of course, nothing is guaranteed! It’s entirely possible to go the length of the trip and not see any deer at all. So, my friend was explaining that his brother is very good at constantly keeping an eye on the environment and his surroundings and creating a mental map in his head about the areas that will maximize his clients chances of finding deer. (There is probably an actual name for this skill, but I’m not sure what it is).

This is an interesting problem and reminds me of Bayesian Search Theory, which is used to help locate objects based on prior knowledge/information and the probability of seeing the object within a specific area given it would actually be there. This approach was most recently popularized for its use in the search for the wreckage of Malaysian Airlines Flight 370.

Let’s walk through an example of how Bayesian Search Theory works.

Setting up the search grid

Let’s say our deer guide has a map that he has broken up into a 4×4 grid of squares. He places prior probabilities of seeing a deer in each region given what he knows about the terrain (e.g., areas with water and deer food will have a higher probability of deer activity).

His priors grid looks like this:

library(tidyverse)

theme_set(theme_light())

# priors for each square region looks like this
matrix(data = c(0.01, 0.02, 0.01, 0.1, 0.1, 0.03, 0.03, 0.03, 
                0.2, 0.2, 0.17, 0.1, 0.01, 0.01, 0.02, 0.01), 
       byrow = TRUE, 
       nrow = 4, ncol = 4) %>%
  as.data.frame() %>%
  setNames(c("1", "2", "3", "4")) %>%
  pivot_longer(cols = everything(),
               names_to = 'x_coord') %>%
  mutate(y_coord = rep(1:4, each = 4)) %>%
  relocate(y_coord, .before = x_coord) %>%
  ggplot(aes(x = x_coord, y = y_coord)) +
  geom_text(aes(label = value, color = value),
            size = 10) +
  scale_color_gradient(low = "red", high = "green") +
  labs(x = "X Coorindates",
         y = "Y Coordinates",
         title = "Prior Probabilities of 4x4 Square Region",
         color = "Probability") +
  theme(axis.text = element_text(size = 11, face = "bold"))

We see that the y-coordiate range of 3 has a high probability of deer activity. In particular, xy = (1, 3) and xy = (2, 3) seem to maximize the chances of seeing a deer.

The likelihood grid

The likelihood in this case describe the probability of seeing a deer in a specific square region given that a deer is actually there. p(Square Region | Deer)

To determine these likelihoods, our deer guide is constantly scouting the areas and making mental notes about what he sees. He documents certain things within each square region that would indicate deer are there. For example, deer droppings, foot prints, actually seeing some deer, and previous successful hunts in a certain region. Using this information he creates a grid of the following likelihoods:

matrix(data = c(0.88, 0.82, 0.88, 0.85, 0.77, 0.65, 0.83, 0.95, 
                0.98, 0.97, 0.93, 0.94, 0.93, 0.79, 0.68, 0.80), 
       byrow = TRUE, 
       nrow = 4, ncol = 4) %>%
  as.data.frame() %>%
  setNames(c("1", "2", "3", "4")) %>%
  pivot_longer(cols = everything(),
               names_to = 'x_coord') %>%
  mutate(y_coord = rep(1:4, each = 4)) %>%
  relocate(y_coord, .before = x_coord) %>%
  ggplot(aes(x = x_coord, y = y_coord)) +
  geom_text(aes(label = value, color = value),
            size = 10) +
  scale_color_gradient(low = "red", high = "green") +
  labs(x = "X Coorindates",
         y = "Y Coordinates",
         title = "Likelihoods for each 4x4 Square Region",
         subtitle = "p(Region | Seeing Deer)",
         color = "Probability") +
  theme(axis.text = element_text(size = 11, face = "bold"))

Combining our prior knowledge and likelihoods

To make things easier, I’ll put both the priors and likelihoods into a single data frame.

dat <- data.frame(
  coords = c("1,1", "2,1", "3,1", "4,1", "1,2", "2,2", "3,2", "4,2", "1,3", "2,3", "3,3", "4,3", "1,4", "2,4", "3,4", "4,4"),
  priors = c(0.01, 0.02, 0.01, 0.1, 0.1, 0.03, 0.03, 0.03, 
                0.2, 0.2, 0.17, 0.1, 0.01, 0.01, 0.02, 0.01),
  likelihoods = c(0.88, 0.82, 0.88, 0.85, 0.77, 0.65, 0.83, 0.95, 
                0.98, 0.97, 0.93, 0.94, 0.93, 0.79, 0.68, 0.80))

dat

Now he multiplies the prior and likelihood together to summarize his belief of deer in each square region.

dat <- dat %>%
  mutate(posterior = round(priors * likelihoods, 2))

dat %>%
  ggplot(aes(x = coords, y = posterior)) +
  geom_col(color = "black",
           fill = "light grey") +
  scale_y_continuous(labels = scales::percent_format()) +
  theme(axis.text = element_text(size = 11, face = "bold"))

As expected, squares (1,3), (2,3), and (3,3) have the highest probability of observing a deer. Those would be the first areas that our deer hunt guide would want to explore when taking new clients out.

Updating Beliefs After Searching a Region

Since square (1,3) has the highest probability our deer hunt guide decides to take his new clients out there to search for deer. After one day of searching they don’t find any deer. To ensure success tomorrow, he needs to update his knowledge not only about square (1, 3) but about all of the other squares in his 4×4 map.

To update square (1, 3) we use the following equation:

update.posterior = prior * (1 – likelihood) / (1 – prior*likelihood)

coord1_3 <- dat %>%
  filter(coords == "1,3")

coord1_3$priors * (1 - coord1_3$likelihoods) / (1 - coord1_3$priors * coord1_3$likelihoods)

Thew new probability for region (1, 3) is 0.5%. Once we update that square region we can update the other regions using the following equation:

update.prior = prior * (1 / (1 – prior * likelihood))

We can do this for the entire data set all at once:

dat <- dat %>%
  mutate(posterior2 = round(priors * (1 / (1 - priors*likelihoods)), 2),
         posterior2 = ifelse(coords == "1,3", 0.005, posterior2)) %>%
  mutate(updated_posterior = round(posterior2 * likelihoods, 3))

dat %>%
  select(coords, posterior, updated_posterior) %>%
  pivot_longer(cols = -coords) %>%
  ggplot(aes(x = coords, y = value, fill = name)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = scales::percent_format()) +
  theme(axis.text = element_text(size = 11, face = "bold"))


matrix(dat$updated_posterior, ncol = 4, nrow = 4, byrow = TRUE) %>%
  as.data.frame() %>%
  setNames(c("1", "2", "3", "4")) %>%
  pivot_longer(cols = everything(),
               names_to = 'x_coord') %>%
  mutate(y_coord = rep(1:4, each = 4)) %>%
  relocate(y_coord, .before = x_coord) %>%
  ggplot(aes(x = x_coord, y = y_coord)) +
  geom_text(aes(label = value, color = value),
            size = 10) +
  scale_color_gradient(low = "red", high = "green") +
  labs(x = "X Coorindates",
         y = "Y Coordinates",
         title = "Updated Posterior for each 4x4 Square Region after searching (1,3) and\nnot seeing deer",
         color = "Probability") +
  theme(axis.text = element_text(size = 11, face = "bold"))

We can see that after updating the posteriors following day 1, hist best approach is to search grid (2, 3) and (3,3) tomorrow, as the updated beliefs indicate that they have a higher probability of having a deer in them.

Conclusion

We can of course continue updating after searching each region until we finally find the deer but we will stop here and allow you to play with the code and continue on if you’d like. This tutorial is just to provide a brief look into how to use Bayesian Search Theory to locate objects in various spaces, so hopefully you can use this method and apply it to other aspects of your life.

The complete code is available on my GitHub page.