How Bayesian Search Theory Can Help You Find More Deer

Introduction

I was talking with a friend the other day who was telling me about his brother, who leads guided deer hunts in Wyoming. Typically, clients will come out for a hunt over several days and rely on him to guide them to areas of the forest where there is a high probability of seeing deer. Of course, nothing is guaranteed! It’s entirely possible to go the length of the trip and not see any deer at all. So, my friend was explaining that his brother is very good at constantly keeping an eye on the environment and his surroundings and creating a mental map in his head about the areas that will maximize his clients chances of finding deer. (There is probably an actual name for this skill, but I’m not sure what it is).

This is an interesting problem and reminds me of Bayesian Search Theory, which is used to help locate objects based on prior knowledge/information and the probability of seeing the object within a specific area given it would actually be there. This approach was most recently popularized for its use in the search for the wreckage of Malaysian Airlines Flight 370.

Let’s walk through an example of how Bayesian Search Theory works.

Setting up the search grid

Let’s say our deer guide has a map that he has broken up into a 4×4 grid of squares. He places prior probabilities of seeing a deer in each region given what he knows about the terrain (e.g., areas with water and deer food will have a higher probability of deer activity).

His priors grid looks like this:

library(tidyverse)

theme_set(theme_light())

# priors for each square region looks like this
matrix(data = c(0.01, 0.02, 0.01, 0.1, 0.1, 0.03, 0.03, 0.03, 
                0.2, 0.2, 0.17, 0.1, 0.01, 0.01, 0.02, 0.01), 
       byrow = TRUE, 
       nrow = 4, ncol = 4) %>%
  as.data.frame() %>%
  setNames(c("1", "2", "3", "4")) %>%
  pivot_longer(cols = everything(),
               names_to = 'x_coord') %>%
  mutate(y_coord = rep(1:4, each = 4)) %>%
  relocate(y_coord, .before = x_coord) %>%
  ggplot(aes(x = x_coord, y = y_coord)) +
  geom_text(aes(label = value, color = value),
            size = 10) +
  scale_color_gradient(low = "red", high = "green") +
  labs(x = "X Coorindates",
         y = "Y Coordinates",
         title = "Prior Probabilities of 4x4 Square Region",
         color = "Probability") +
  theme(axis.text = element_text(size = 11, face = "bold"))

We see that the y-coordiate range of 3 has a high probability of deer activity. In particular, xy = (1, 3) and xy = (2, 3) seem to maximize the chances of seeing a deer.

The likelihood grid

The likelihood in this case describe the probability of seeing a deer in a specific square region given that a deer is actually there. p(Square Region | Deer)

To determine these likelihoods, our deer guide is constantly scouting the areas and making mental notes about what he sees. He documents certain things within each square region that would indicate deer are there. For example, deer droppings, foot prints, actually seeing some deer, and previous successful hunts in a certain region. Using this information he creates a grid of the following likelihoods:

matrix(data = c(0.88, 0.82, 0.88, 0.85, 0.77, 0.65, 0.83, 0.95, 
                0.98, 0.97, 0.93, 0.94, 0.93, 0.79, 0.68, 0.80), 
       byrow = TRUE, 
       nrow = 4, ncol = 4) %>%
  as.data.frame() %>%
  setNames(c("1", "2", "3", "4")) %>%
  pivot_longer(cols = everything(),
               names_to = 'x_coord') %>%
  mutate(y_coord = rep(1:4, each = 4)) %>%
  relocate(y_coord, .before = x_coord) %>%
  ggplot(aes(x = x_coord, y = y_coord)) +
  geom_text(aes(label = value, color = value),
            size = 10) +
  scale_color_gradient(low = "red", high = "green") +
  labs(x = "X Coorindates",
         y = "Y Coordinates",
         title = "Likelihoods for each 4x4 Square Region",
         subtitle = "p(Region | Seeing Deer)",
         color = "Probability") +
  theme(axis.text = element_text(size = 11, face = "bold"))

Combining our prior knowledge and likelihoods

To make things easier, I’ll put both the priors and likelihoods into a single data frame.

dat <- data.frame(
  coords = c("1,1", "2,1", "3,1", "4,1", "1,2", "2,2", "3,2", "4,2", "1,3", "2,3", "3,3", "4,3", "1,4", "2,4", "3,4", "4,4"),
  priors = c(0.01, 0.02, 0.01, 0.1, 0.1, 0.03, 0.03, 0.03, 
                0.2, 0.2, 0.17, 0.1, 0.01, 0.01, 0.02, 0.01),
  likelihoods = c(0.88, 0.82, 0.88, 0.85, 0.77, 0.65, 0.83, 0.95, 
                0.98, 0.97, 0.93, 0.94, 0.93, 0.79, 0.68, 0.80))

dat

Now he multiplies the prior and likelihood together to summarize his belief of deer in each square region.

dat <- dat %>%
  mutate(posterior = round(priors * likelihoods, 2))

dat %>%
  ggplot(aes(x = coords, y = posterior)) +
  geom_col(color = "black",
           fill = "light grey") +
  scale_y_continuous(labels = scales::percent_format()) +
  theme(axis.text = element_text(size = 11, face = "bold"))

As expected, squares (1,3), (2,3), and (3,3) have the highest probability of observing a deer. Those would be the first areas that our deer hunt guide would want to explore when taking new clients out.

Updating Beliefs After Searching a Region

Since square (1,3) has the highest probability our deer hunt guide decides to take his new clients out there to search for deer. After one day of searching they don’t find any deer. To ensure success tomorrow, he needs to update his knowledge not only about square (1, 3) but about all of the other squares in his 4×4 map.

To update square (1, 3) we use the following equation:

update.posterior = prior * (1 – likelihood) / (1 – prior*likelihood)

coord1_3 <- dat %>%
  filter(coords == "1,3")

coord1_3$priors * (1 - coord1_3$likelihoods) / (1 - coord1_3$priors * coord1_3$likelihoods)

Thew new probability for region (1, 3) is 0.5%. Once we update that square region we can update the other regions using the following equation:

update.prior = prior * (1 / (1 – prior * likelihood))

We can do this for the entire data set all at once:

dat <- dat %>%
  mutate(posterior2 = round(priors * (1 / (1 - priors*likelihoods)), 2),
         posterior2 = ifelse(coords == "1,3", 0.005, posterior2)) %>%
  mutate(updated_posterior = round(posterior2 * likelihoods, 3))

dat %>%
  select(coords, posterior, updated_posterior) %>%
  pivot_longer(cols = -coords) %>%
  ggplot(aes(x = coords, y = value, fill = name)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = scales::percent_format()) +
  theme(axis.text = element_text(size = 11, face = "bold"))


matrix(dat$updated_posterior, ncol = 4, nrow = 4, byrow = TRUE) %>%
  as.data.frame() %>%
  setNames(c("1", "2", "3", "4")) %>%
  pivot_longer(cols = everything(),
               names_to = 'x_coord') %>%
  mutate(y_coord = rep(1:4, each = 4)) %>%
  relocate(y_coord, .before = x_coord) %>%
  ggplot(aes(x = x_coord, y = y_coord)) +
  geom_text(aes(label = value, color = value),
            size = 10) +
  scale_color_gradient(low = "red", high = "green") +
  labs(x = "X Coorindates",
         y = "Y Coordinates",
         title = "Updated Posterior for each 4x4 Square Region after searching (1,3) and\nnot seeing deer",
         color = "Probability") +
  theme(axis.text = element_text(size = 11, face = "bold"))

We can see that after updating the posteriors following day 1, hist best approach is to search grid (2, 3) and (3,3) tomorrow, as the updated beliefs indicate that they have a higher probability of having a deer in them.

Conclusion

We can of course continue updating after searching each region until we finally find the deer but we will stop here and allow you to play with the code and continue on if you’d like. This tutorial is just to provide a brief look into how to use Bayesian Search Theory to locate objects in various spaces, so hopefully you can use this method and apply it to other aspects of your life.

The complete code is available on my GitHub page.

TidyX Episode 161: ShinyLive – A serverless shiny option

A shiny app has always required a server to run. If you wanted to connect data to your users and allow them to interact with it you needed server accessibility. Well, not any more! This week, Ellis Hughes and I explore ShinyLive as an option for creating serverless shiny app. It’s early days, but the future is bright. Watch us walk through trying to learn how to set up a serverless shiny app  and figure out the ins and outs of the code.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

TidyX Episode 160: Prepopulate a shiny app url with user defined parameters

This week, Ellis Hughes and I discuss methods for prepopulating a shiny app’s user defined parameters from a URL. This type of trick comes in handy if you have a separate application that people are using and you want to provide them a link to click out to your shiny app. However, when opening that link you don’t want it to open the shiny app from the typical starting point. Rather, you want them to be take directly to the data they require based on the information they were reviewing in the other application.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

Simulations in R Part 7: Measurement Error in Regression

We’ve been working with building simulations in the past 6 articles of this series. In the last two installments we talked specifically about using simulation to explore different linear regression model assumptions. Today, we continue with linear regression and we use simulation to understand how measurement error (something we all face) influences our linear regression parameter estimates.

Here were the past 6 sections in this series:

  • Part 1 discussed the basic functions for simulating and sampling data in R
  • Part 2 walked us through how to perform bootstrap resampling and then simulate bivariate and multivariate distributions
  • Part 3 we worked making group comparisons by simulating thousands of t-tests
  • Part 4 building simulations for linear regression
  • Part 5 using simulation to investigate the homoskedasticity assumption in regression
  • Part 6 using simulation to investigate the multicollinearity assumption in regression

As always, complete code for this article is available on my GitHub page.

Measurement Error

Measurement error occurs when the values we have observed during data collection differ from the true values. This can sometimes be the cause of imperfect proxy measures where we are using certain measurements (perhaps tests that are easier to obtain in our population or setting) in place of the thing we actually care about. Or, it can happen because tests are imperfect and all data collection has limitations and error (which we try as hard as possible to minimize).

Measurement error can be systematic or random. It can be challenging to detect in a single sample. We will build a simulation to show how measurement error can bias our regression coefficients and perhaps hide the true relationship.

Constructing the simulation

For this simulation, we are going to use a random draw from a normal distribution for the independent variable to represent noise in the model, which will behave like measurement error. We will use a nested for() loop, as we did in Part 6 of this series, where the outer loop stores the outcomes of the regression model under each level of measurement error, which is built in the inner loop.

We begin by creating 11 levels of measurement error, ranging from 0 (no measurement error) to 1 (extreme measurement error). This values are going to serve as the standard deviation when we randomly draw from a normal distribution with a mean of 0. In this way, we are creating noise in the model.

## levels of error measurement to test
# NOTE: these values will be used as the standard deviation in our random draws
meas_error <- seq(from = 0, to = 1, by = 0.1)

# create the number of simulations and reps you want to run
n <- 1000

# true variables
intercept <- 2
beta1 <- 5
independent_var <- runif(n = n, min = -1, max = 1)

## create a final array store each of the model error levels in their own list
final_df <- array(data = NA, dim = c(n, 2, length(meas_error)))

## create a data frame to store the absolute bias at each level of measurement error
error_bias_df <- matrix(nrow = n, ncol = length(meas_error))

Next, we build our nested for() loop and simulate models under the different measurement error conditions.

## loop
for(j in 1:length(meas_error)){
  
  # a store vector for the absolute bias from each inner loop
  abs_bias_vect <- rep(0, times = n)
  
  ## storage data frame for the beta coefficient results in each inner loop simulated regression
  df_betas <- matrix(NA, nrow = n, ncol = 2)
  
  # simulate independent variable 1000x with measurement error
  ind_var_with_error <- independent_var + rnorm(n = n, 0, meas_error[j])
  
  for(i in 1:n){
    
    y_hat <- intercept + beta1*independent_var + rnorm(n = n, 0, 1)
    fit <- lm(y_hat ~ ind_var_with_error)
    
    df_betas[i, 1] <- fit$coef[1]
    df_betas[i, 2] <- fit$coef[2]
    
    abs_bias_vect[i] <- abs(fit$coef[2] - beta1)
  }
  
  ## store final results of each inner loop
  # store the model betas in a list for each level of measurement error
  final_df[, , j] <- df_betas
  
  # store the absolute bias
  error_bias_df[, j] <- abs_bias_vect

}

Have a look at the first few rows of each results data frame.

 

## check the first few values of each new element
head(error_bias_df)

## the final_df's are stored as a list of arrays for each level of measurement error
# Here is the 11th array (measurement error == 1.0)
head(final_df[, ,11])

Plotting the results

Now that we have stored the data for each level of measurement error, let’s do some plotting of the data to explore how measurement error influences our regression model.

Plot the standard deviation of the beta coefficient for the model with no measurement error and the model with the most extreme measurement error.

no_error <- final_df[, ,1] %>% as.data.frame() %>% mutate(measurement_error = "No Measurement Error")

extreme_error <- final_df[, ,11] %>% as.data.frame() %>% mutate(measurement_error = "Extreme Measurement Error")

no_error %>%
  bind_rows(
    extreme_error
  ) %>%
  ggplot(aes(x = V2, fill = measurement_error)) +
  geom_density(alpha = 0.8) +
  facet_wrap(~measurement_error, scales = "free") +
  theme(strip.text = element_text(size = 14, face = "bold"),
        legend.position = "top") +
  labs(x = "Simulated Model Beta Coefficient",
       fill = "Measurement Error",
       title = "Beta Coefficients Differences Due to Measurement Error")

Notice that the value with no measurement error the beta coefficient is around 5, just as we specified the true value in our simulation. However, the model with a measurement error of 1 (shown in green) has the beta coefficient centered around 1.2, which is substantially lower than the true beta coefficient value of 5.

Plot the change in absolute bias across each level of measurement error

First let’s look at the average bias across each level of measurement error.

absolute_error <- error_bias_df %>%
  as.data.frame() %>%
  setNames(paste0('x', meas_error))

# On average, how much absolute bias would we expect
colMeans(absolute_error) %>%
  data.frame() %>%
  rownames_to_column() %>%
  mutate('rowname' = parse_number(rowname)) %>%
  setNames(c("Level of measurement error", 'Absolute bias')) %>%
  gt::gt()

Next, make a plot of the relationship.

absolute_error %>%
  pivot_longer(cols = everything()) %>%
  ggplot(aes(x = name, y = value, group = 1)) +
  geom_point(shape = 21,
             size = 4,
             color = "black",
             fill = "light grey") +
  stat_summary(fun = mean,
               geom = "line",
               color = "red",
               size = 1.2,
               linetype = "dashed") +
  labs(x = "Amount of Measurement Error",
       y = "Absolute Bias",
       title = "Absolute Bias of the True Parameter Value")

Notice that as the amount of measurement increases so too does the absolute bias of the model coefficient.

 

Wrapping Up

Measurement error is something that all of us deal with in practice, whether you are conducting science in a lab or working in an applied setting. Knowing how measurement error influences regression coefficients and the tricks it can play in our beliefs to unveil true parameter values is important to keep in mind. Expressing our uncertainty around model outputs is critical to communicating what we think we know about our observations and how (un)certain we may be. This is one of the values of, in my opinion, Bayesian model building, as we can work directly with sampling from posterior distributions that provide us a way of building up distributions, which allow us to explore uncertainty and make probabilistic statements.

The complete code for this article is available on my GitHub page.

Validity, Reliability, & Responsiveness — A few papers on measurement in sport science

I had the pleasure of speaking at the National Strength and Conditioning Association‘s (NSCA) National Conference this summer and while there I made it a point to attend the Sport Science & Performance Technology Special Interest Group meeting as well.

One thing that immediately stood out to me was the number of questions raised specific to what types of technologies to purchase (e.g. “Which brand of force plates should we buy?”, “Does anyone have a list comparing and contrasting different technologies so that we can determine what would be best for us?”, etc.).

While these are fine questions, I do feel they are a bit like putting the cart before the horse. Before thinking about what technology to purchase, we should spend a good bit of time gaining clarity on the question(s) we are attempting to answer. Once we have a firm understanding of the question we can then begin the process of evaluating whether a technology exists that can help us collect the necessary data to explore that question. In fact, this was the main crux of my lecture at the conference, as I spoke about using the PPDAC Framework in practice (I wrote an article about this framework a couple of years ago).

A force plate, a GPS unit, or an accelerometer won’t solve all of our problems. In fact, depending on our question, they might not solve any of our problems! Moreover, as sport scientists we need to concern ourselves not only with the research question but, also whether the desired technology is useful within our ecological setting. Just because something worked in a controlled lab environment or was valid in a different sport does not mean it will be useful for our sport, or in our setting, or with our athletes, or given our unique constraints.

So, I decided to share a few resources pertaining to measurement theory concepts such as validity, reliability, and responsiveness/sensitivity for those working in the sport science space who are interested in more critical approaches to evaluating the technology we use in practice.

Additionally, for those interested, several years ago I wrote a full R code blog for the last paper above (Swinton et al) ,which can he accessed HERE.

Happy reading!