Author Archives: Patrick

Simulations in R Part 8 – Simulating Mixed Models

We’ve built up a number of simulations over the 7 articles in this series. The last few articles have been looking at linear regression and simulating different intricacies in the data that allow us to explore model assumptions. To end this series, we will now extend the linear model to a mixed model. We will start by building a linear regression model and go through the steps of simulation to build up the hierarchical structure of the data.

For those interesting here are the previous 7 articles in this series:

  • Part 1 discussed the basic functions for simulating and sampling data in R
  • Part 2 walked us through how to perform bootstrap resampling and then simulate bivariate and multivariate distributions
  • Part 3 we worked making group comparisons by simulating thousands of t-tests
  • Part 4 building simulations for linear regression
  • Part 5 using simulation to investigate the homoskedasticity assumption in regression
  • Part 6 using simulation to investigate the multicollinearity assumption in regression
  • Part 7 simulating measurement error in regression

As always, complete code for this article is available on my GitHub page.

Side Note: This is a very code heavy article as the goal is to show not only simulations of data and models but how to extract the model parameters from the list of simulations. As such, this a rather long article with minimal interpretation of the models.

First a linear model

Our data will look at training load for 10 players in two different positions, Forwards (F) and Mids (M). The Forwards will be simulated to have less training load than the Mids and we will add some random error around the difference in these group means. We will simulate this as a linaer model where b0 represents the model intercept and is the mean value for Forwards and b1 represents the coefficient for when the group is Mids.

library(tidyverse)
library(broom)
library(lme4)

theme_set(theme_light())
## Model Parameters
n_positions <- 2
n_players <- 10
b0 <- 80
b1 <- 20
sigma <- 10

## Simulate
set.seed(300)
pos <- rep(c("M", "F"), each = n_players)
player_id <- rep(1:10, times = 2)
model_error <- rnorm(n = n_positions*n_players, mean = 0, sd = sigma)
training_load <- b0 + b1 * (pos == "M") + model_error

d <- data.frame(pos, player_id, training_load)
d

Calculate some summary statistics

d %>%
  ggplot(aes(x = pos, y = training_load)) +
  geom_boxplot()

d %>%
  group_by(pos) %>%
  summarize(N = n(),
            avg = mean(training_load),
            sd = sd(training_load)) %>%
  mutate(se = sd / sqrt(N)) %>%
  knitr::kable()

Build a linear model

fit <- lm(training_load ~ pos, data = d)
summary(fit)

Now, we wrap all the steps in a function so that we can use replicate() to create thousands of simulations or to change parameters of our simulation. The end result of the function is going to be the linear regression model.

sim_func <- function(n_positions = 2, n_players = 10, b0 = 80, b1 = 20, sigma = 10){

  ## simulate data
  pos <- rep(c("M", "F"), each = n_players)
  player_id <- rep(1:10, times = 2)
  model_error <- rnorm(n = n_positions*n_players, mean = 0, sd = sigma)
  training_load <- b0 + b1 * (pos == "M") + model_error

  ## store in data frame
  d <- data.frame(pos, player_id, training_load)
  
  ## construct linear model
  fit <- lm(training_load ~ pos, data = d)
  summary(fit)
}

Try the function out with the default parameters

sim_func()

Use replicate() to create many simulations.

Doing this simulation once doesn’t help us. We want to be able to do this thousands of times. All of the articles in this series have used for() loops up until this point. But, if you recall the first article in the series where I laid out several helpful functions for coding simulations, I showed an example of the replicate() function, which will take a function and run it’s result for as many times as you specify. I found this function while I was working through the simulation chapter of Gelman et al’s book, Regression and Other Stories. I think in cases like a mixed model simulation, where you can have many layers and complexities to the data, writing a simple function and then replicating it thousands of times is much easier to debug and much cleaner for others to read than having a bunch of nested for() loops.

Technical Note: We specify the argument simplify = FALSE so that the results are returned in a list format. This makes more sense since the results are the regression summary results and not a data frame.

team_training <- replicate(n = 1000,
                  sim_func(),
                  simplify = FALSE)

Coefficient Results

team_training %>%
  map_df(tidy) %>%
  select(term, estimate) %>%
  ggplot(aes(x = estimate, fill = term)) +
  geom_density() +
  facet_wrap(~term, scales = "free_x")

team_training %>%
  map_df(tidy) %>%
  select(term, estimate) %>%
  group_by(term) %>%
  summarize(avg = mean(estimate),
            SD = sd(estimate))

Compare the simulated results to the results of the original model fit.

tidy(fit)

Model fit parameters

team_training %>%
  map_df(glance) %>%
  select(adj.r.squared, sigma) %>%
  summarize(across(.cols = everything(),
                   ~mean(.x)))


Compare these results to the original fit

fit %>% 
  glance()

Mixed Model 1

Now that we have the general frame work for building a simulation function and using replicate() we will to build a mixed model simulation.

Above we had a team with two positions groups and individual players nested within those position groups. In this mixed model, we will add a second team so that we can explore hierarchical data.

We will simulate data from 3 teams, each with 2 positions (Forward & Mid). This is a pretty simple mixed model. We will build a more complex one after we get a handle on the code below.

## Model Parameters
n_teams <- 3
n_positions <- 2
n_players <- 10

team1_fwd_avg <- 130
team1_fwd_sd <- 15
team1_mid_avg <- 100
team1_mid_sd <- 5

team2_fwd_avg <- 150
team2_fwd_sd <- 20
team2_mid_avg <- 90
team2_mid_sd <- 10

team3_fwd_avg <- 180
team3_fwd_sd <- 15
team3_mid_avg <- 150
team3_mid_sd <- 15


## Simulated data frame
team <- rep(c("Team1","Team2", "Team3"), each = n_players * n_positions)
pos <- c(rep(c("M", "F"), each = n_players), rep(c("M", "F"), each = n_players), rep(c("M", "F"), each = n_players))
player_id <- as.factor(round(seq(from = 100, to = 300, length = length(team)), 0))

d <- data.frame(team, pos, player_id) d %>%
  head()

# simulate training loads
set.seed(555)
training_load <- c(rnorm(n = n_players, mean = team1_mid_avg, sd = team1_mid_sd), rnorm(n = n_players, mean = team1_fwd_avg, sd = team1_fwd_sd), rnorm(n = n_players, mean = team2_mid_avg, sd = team2_mid_sd), rnorm(n = n_players, mean = team2_fwd_avg, sd = team2_fwd_sd), rnorm(n = n_players, mean = team3_mid_avg, sd = team3_mid_sd), rnorm(n = n_players, mean = team3_fwd_avg, sd = team3_fwd_sd)) d ,- d %>%
  bind_cols(training_load = training_load) 

d %>%
  head()


Calculate summary statistics

## Average training load by team
d %>%
  group_by(team) %>%
  summarize(avg = mean(training_load),
            SD = sd(training_load))

## Average training load by pos
d %>%
  group_by(pos) %>%
  summarize(avg = mean(training_load),
            SD = sd(training_load))

## Average training load by team & position
d %>%
  group_by(team, pos) %>%
  summarize(avg = mean(training_load),
            SD = sd(training_load)) %>%
  arrange(pos)

## Plot
d %>%
  ggplot(aes(x = training_load, fill = team)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~pos)

Construct the mixed model and evaluate the outputs

## Mixed Model
fit_lmer <- lmer(training_load ~ pos + (1 |team), data = d)
summary(fit_lmer)
coef(fit_lmer)
fixef(fit_lmer)
ranef(fit_lmer)
sigma(fit_lmer)
hist(residuals(fit_lmer))

Write a mixed model function

sim_func_lmer <- function(n_teams = 3, 
                          n_positions = 2, 
                          n_players = 10, 
                          team1_fwd_avg = 130,
                          team1_fwd_sd = 15,
                          team1_mid_avg = 100,
                          team1_mid_sd = 5,
                          team2_fwd_avg = 150,
                          team2_fwd_sd = 20,
                          team2_mid_avg = 90,
                          team2_mid_sd = 10,
                          team3_fwd_avg = 180,
                          team3_fwd_sd = 15,
                          team3_mid_avg = 150,
                          team3_mid_sd = 15){

        ## Simulated data frame
        team <- rep(c("Team1","Team2", "Team3"), each = n_players * n_positions)
        pos <- c(rep(c("M", "F"), each = n_players), rep(c("M", "F"), each = n_players), rep(c("M", "F"), each = n_players))
        player_id <- as.factor(round(seq(from = 100, to = 300, length = length(team)), 0))
        
        d <- data.frame(team, pos, player_id)

        # simulate training loads
        training_load <- c(rnorm(n = n_players, mean = team1_mid_avg, sd = team1_mid_sd),
                   rnorm(n = n_players, mean = team1_fwd_avg, sd = team1_fwd_sd),
                   rnorm(n = n_players, mean = team2_mid_avg, sd = team2_mid_sd),
                   rnorm(n = n_players, mean = team2_fwd_avg, sd = team2_fwd_sd),
                   rnorm(n = n_players, mean = team3_mid_avg, sd = team3_mid_sd),
                   rnorm(n = n_players, mean = team3_fwd_avg, sd = team3_fwd_sd))
  
        ## construct the mixed model
  fit_lmer <- lmer(training_load ~ pos + (1 |team), data = d)
  summary(fit_lmer)
}

Try out the function

sim_func_lmer()

Now use replicate() and create 1000 simulations of the model and look at the first model in the list

team_training_lmer <- replicate(n = 1000,
                  sim_func_lmer(),
                  simplify = FALSE)

## look at the first model in the list
team_training_lmer[[1]]$coefficient

Store the coefficient results of the 1000 simulations in a data frame, create plots of the model coefficients, and compare the results of the simulation to the original mixed model.


lmer_coef <- matrix(NA, ncol = 5, nrow = length(team_training_lmer))
colnames(lmer_coef) <- c("intercept", "intercept_se", "posM", "posM_se", 'model_sigma')

for(i in 1:length(team_training_lmer)){
  
  lmer_coef[i, 1] <- team_training_lmer[[i]]$coefficients[1]
  lmer_coef[i, 2] <- team_training_lmer[[i]]$coefficients[3]
  lmer_coef[i, 3] <- team_training_lmer[[i]]$coefficients[2]
  lmer_coef[i, 4] <- team_training_lmer[[i]]$coefficients[4]
  lmer_coef[i, 5] <- team_training_lmer[[i]]$sigma
  
}

lmer_coef <- as.data.frame(lmer_coef) head(lmer_coef) ## Plot the coefficient for position lmer_coef %>%
  ggplot(aes(x = posM)) +
  geom_density(fill = "palegreen")

## Summarize the coefficients and their standard errors for the simulations
lmer_coef %>% 
  summarize(across(.cols = everything(),
                   ~mean(.x)))

## compare to the original model
broom.mixed::tidy(fit_lmer)
sigma(fit_lmer)

Extract the random effects for the intercept and the residual for each of the simulated models

ranef_sim <- matrix(NA, ncol = 2, nrow = length(team_training_lmer))
colnames(ranef_sim) <- c("intercept_sd", "residual_sd")

for(i in 1:length(team_training_lmer)){
  
  ranef_sim[i, 1] <- team_training_lmer[[i]]$varcor %>% as.data.frame() %>% select(sdcor) %>% slice(1) %>% pull(sdcor)
  ranef_sim[i, 2] <- team_training_lmer[[i]]$varcor %>% as.data.frame() %>% select(sdcor) %>% slice(2) %>% pull(sdcor)
  
}

ranef_sim <- as.data.frame(ranef_sim) head(ranef_sim) ## Summarize the results ranef_sim %>%
  summarize(across(.cols = everything(),
                   ~mean(.x)))

## Compare with the original model
VarCorr(fit_lmer)

Mixed Model 2

Above was a pretty simple model, just to get our feet wet. Let’s create a more complicated model. Usually in sport and exercise science we have repeated measures of individuals. Often, researchers will set the individual players as random effects with the fixed effects being the component that the researcher is attempting to make an inference about.

In this example, we will set up a team of 12 players with three positions (4 players per position): Forward, Mid, Defender. The aim is to explore the training load differences between position groups while accounting for repeated observations of individuals (in this case, each player will have 20 training sessions). Similar to our first regression model, we will build a data frame of everything we need and then calculate the outcome variable (training load) with a regression model using parameters that we specify. To make this work, we will need to specific an intercept and slope for the position group and an intercept and slope for the individual players as well as a model sigma value. Once we’ve done that, we will fit a mixed model, write a function, and then create 1000 simulations.

## Set up the data frame
n_pos <- 3
n_players <- 12
n_obs <- 20
players <- as.factor(round(seq(from = 100, to = 300, length = n_players), 0))


dat <- data.frame( player_id = rep(players, each = n_obs), pos = rep(c("Fwd", "Mid", "Def"), each = n_players/n_pos * n_obs), training_day = rep(1:n_obs, times = n_players) ) dat %>%
  head()

## Create model parameters
# NOTE: Defender will be the intercept
set.seed(6687)
pos_intercept <- 150
posF_coef <- 170
posM_coef <- -70
individual_intercept <- 50
individual_slope <- 10
sigma <- 10
model_error <- rnorm(n = nrow(dat), mean = 0, sd = sigma)


## we will also create some individual player variance
individual_player_variance <- c()

for(i in players){

  individual_player_variance[i] <- rnorm(n = 1, 
                  mean = runif(min = 2, max = 10, n = 1), 
                  sd = runif(min = 2, max = 5, n = 1))
  
}

individual_player_variance <- rep(individual_player_variance, each = n_obs)

dat$training_load <- pos_intercept + posF_coef * (dat$pos == "Fwd") + posM_coef * (dat$pos == "Mid") + individual_intercept + individual_slope * individual_player_variance + model_error dat %>%
  head()


Calculate summary stats

## Average training load by pos
dat %>%
  group_by(pos) %>%
  summarize(avg = mean(training_load),
            SD = sd(training_load))

## Plot
dat %>%
  ggplot(aes(x = training_load, fill = pos)) +
  geom_density(alpha = 0.5)

Construct the mixed model and evaluate the output

## Mixed Model
fit_lmer_pos <- lmer(training_load ~ pos + (1 | player_id), data = dat)
summary(fit_lmer_pos)
coef(fit_lmer_pos)
fixef(fit_lmer_pos)
ranef(fit_lmer_pos)
sigma(fit_lmer_pos)
hist(residuals(fit_lmer_pos))


Create a function for the simulation

sim_func_lmer2 <- function(n_pos = 3,
                          n_players = 12,
                          n_obs = 20,
                          pos_intercept = 150,
                          posF_coef = 170,
                          posM_coef = -70,
                          individual_intercept = 50,
                          individual_slope = 10,
                          sigma = 10){
  
  players <- as.factor(round(seq(from = 100, to = 300, length = n_players), 0))

  dat <- data.frame(
  player_id = rep(players, each = n_obs),
  pos = rep(c("Fwd", "Mid", "Def"), each = n_players/n_pos * n_obs),
  training_day = rep(1:n_obs, times = n_players)
  )
  
  model_error <- rnorm(n = nrow(dat), mean = 0, sd = sigma)
  
  individual_player_variance <- c()

  for(i in players){

    individual_player_variance[i] <- rnorm(n = 1, 
                  mean = runif(min = 2, max = 10, n = 1), 
                  sd = runif(min = 2, max = 5, n = 1))
    }

  individual_player_variance <- rep(individual_player_variance, each = n_obs)

  dat$training_load <- pos_intercept + posF_coef * (dat$pos == "Fwd") + posM_coef * (dat$pos == "Mid") + individual_intercept + individual_slope * individual_player_variance + model_error

  fit_lmer_pos <- lmer(training_load ~ pos + (1 | player_id), data = dat)
  summary(fit_lmer_pos)
}

Now use `replicate()` and create 1000 simulations of the model and look at the first model in the list.

player_training_lmer <- replicate(n = 1000,
                  sim_func_lmer2(),
                  simplify = FALSE)

## look at the first model in the list
player_training_lmer[[1]]$coefficient

Store the coefficient results from the simulations, summarize them, and compare them to the original mixed model.

lmer_player_coef <- matrix(NA, ncol = 7, nrow = length(player_training_lmer))
colnames(lmer_player_coef) <- c("intercept", "intercept_se","posFwd", "posFwd_se", "posMid", "posMid_se", 'model_sigma')

for(i in 1:length(player_training_lmer)){
  
  lmer_player_coef[i, 1] <- player_training_lmer[[i]]$coefficients[1]
  lmer_player_coef[i, 2] <- player_training_lmer[[i]]$coefficients[4]
  lmer_player_coef[i, 3] <- player_training_lmer[[i]]$coefficients[2]
  lmer_player_coef[i, 4] <- player_training_lmer[[i]]$coefficients[5]
  lmer_player_coef[i, 5] <- player_training_lmer[[i]]$coefficients[3]
  lmer_player_coef[i, 6] <- player_training_lmer[[i]]$coefficients[6]
  lmer_player_coef[i, 7] <- player_training_lmer[[i]]$sigma
  
}

lmer_player_coef <- as.data.frame(lmer_player_coef) head(lmer_player_coef) ## Plot the coefficient for position lmer_player_coef %>%
  ggplot(aes(x = posFwd)) +
  geom_density(fill = "palegreen")

lmer_player_coef %>%
  ggplot(aes(x = posMid)) +
  geom_density(fill = "palegreen")


## Summarize the coefficients and their standard errors for the simulations
lmer_player_coef %>% 
  summarize(across(.cols = everything(),
                   ~mean(.x)))

## compare to the original model
broom.mixed::tidy(fit_lmer_pos)
sigma(fit_lmer_pos)


Extract the random effects for the intercept and the residual for each of the simulated models.

ranef_sim_player <- matrix(NA, ncol = 2, nrow = length(player_training_lmer))
colnames(ranef_sim_player) <- c("player_sd", "residual_sd")

for(i in 1:length(player_training_lmer)){
  
  ranef_sim_player[i, 1] <- player_training_lmer[[i]]$varcor %>% as.data.frame() %>% select(sdcor) %>% slice(1) %>% pull(sdcor)
  ranef_sim_player[i, 2] <- player_training_lmer[[i]]$varcor %>% as.data.frame() %>% select(sdcor) %>% slice(2) %>% pull(sdcor)
  
}

ranef_sim_player <- as.data.frame(ranef_sim_player) head(ranef_sim_player) ## Summarize the results ranef_sim_player %>%
  summarize(across(.cols = everything(),
                   ~mean(.x)))

## Compare with the original model
VarCorr(fit_lmer_pos)

Wrapping Up

Mixed models can get really complicated and have a lot of layers to them. For example, we could make this a multivariable model with independent variables that have some level of correlation with each other. We could also add some level of autocorrelation for each player’s observations. There are also a number of different ways that you can construct these types of simulations. The two approaches used here are just dipping their toes in. Perhaps in future articles I’ll put together code for more complex mixed models.

All of the code for this article and the other 7 articles in this series are available on my GitHub page.

How Bayesian Search Theory Can Help You Find More Deer

Introduction

I was talking with a friend the other day who was telling me about his brother, who leads guided deer hunts in Wyoming. Typically, clients will come out for a hunt over several days and rely on him to guide them to areas of the forest where there is a high probability of seeing deer. Of course, nothing is guaranteed! It’s entirely possible to go the length of the trip and not see any deer at all. So, my friend was explaining that his brother is very good at constantly keeping an eye on the environment and his surroundings and creating a mental map in his head about the areas that will maximize his clients chances of finding deer. (There is probably an actual name for this skill, but I’m not sure what it is).

This is an interesting problem and reminds me of Bayesian Search Theory, which is used to help locate objects based on prior knowledge/information and the probability of seeing the object within a specific area given it would actually be there. This approach was most recently popularized for its use in the search for the wreckage of Malaysian Airlines Flight 370.

Let’s walk through an example of how Bayesian Search Theory works.

Setting up the search grid

Let’s say our deer guide has a map that he has broken up into a 4×4 grid of squares. He places prior probabilities of seeing a deer in each region given what he knows about the terrain (e.g., areas with water and deer food will have a higher probability of deer activity).

His priors grid looks like this:

library(tidyverse)

theme_set(theme_light())

# priors for each square region looks like this
matrix(data = c(0.01, 0.02, 0.01, 0.1, 0.1, 0.03, 0.03, 0.03, 
                0.2, 0.2, 0.17, 0.1, 0.01, 0.01, 0.02, 0.01), 
       byrow = TRUE, 
       nrow = 4, ncol = 4) %>%
  as.data.frame() %>%
  setNames(c("1", "2", "3", "4")) %>%
  pivot_longer(cols = everything(),
               names_to = 'x_coord') %>%
  mutate(y_coord = rep(1:4, each = 4)) %>%
  relocate(y_coord, .before = x_coord) %>%
  ggplot(aes(x = x_coord, y = y_coord)) +
  geom_text(aes(label = value, color = value),
            size = 10) +
  scale_color_gradient(low = "red", high = "green") +
  labs(x = "X Coorindates",
         y = "Y Coordinates",
         title = "Prior Probabilities of 4x4 Square Region",
         color = "Probability") +
  theme(axis.text = element_text(size = 11, face = "bold"))

We see that the y-coordiate range of 3 has a high probability of deer activity. In particular, xy = (1, 3) and xy = (2, 3) seem to maximize the chances of seeing a deer.

The likelihood grid

The likelihood in this case describe the probability of seeing a deer in a specific square region given that a deer is actually there. p(Square Region | Deer)

To determine these likelihoods, our deer guide is constantly scouting the areas and making mental notes about what he sees. He documents certain things within each square region that would indicate deer are there. For example, deer droppings, foot prints, actually seeing some deer, and previous successful hunts in a certain region. Using this information he creates a grid of the following likelihoods:

matrix(data = c(0.88, 0.82, 0.88, 0.85, 0.77, 0.65, 0.83, 0.95, 
                0.98, 0.97, 0.93, 0.94, 0.93, 0.79, 0.68, 0.80), 
       byrow = TRUE, 
       nrow = 4, ncol = 4) %>%
  as.data.frame() %>%
  setNames(c("1", "2", "3", "4")) %>%
  pivot_longer(cols = everything(),
               names_to = 'x_coord') %>%
  mutate(y_coord = rep(1:4, each = 4)) %>%
  relocate(y_coord, .before = x_coord) %>%
  ggplot(aes(x = x_coord, y = y_coord)) +
  geom_text(aes(label = value, color = value),
            size = 10) +
  scale_color_gradient(low = "red", high = "green") +
  labs(x = "X Coorindates",
         y = "Y Coordinates",
         title = "Likelihoods for each 4x4 Square Region",
         subtitle = "p(Region | Seeing Deer)",
         color = "Probability") +
  theme(axis.text = element_text(size = 11, face = "bold"))

Combining our prior knowledge and likelihoods

To make things easier, I’ll put both the priors and likelihoods into a single data frame.

dat <- data.frame(
  coords = c("1,1", "2,1", "3,1", "4,1", "1,2", "2,2", "3,2", "4,2", "1,3", "2,3", "3,3", "4,3", "1,4", "2,4", "3,4", "4,4"),
  priors = c(0.01, 0.02, 0.01, 0.1, 0.1, 0.03, 0.03, 0.03, 
                0.2, 0.2, 0.17, 0.1, 0.01, 0.01, 0.02, 0.01),
  likelihoods = c(0.88, 0.82, 0.88, 0.85, 0.77, 0.65, 0.83, 0.95, 
                0.98, 0.97, 0.93, 0.94, 0.93, 0.79, 0.68, 0.80))

dat

Now he multiplies the prior and likelihood together to summarize his belief of deer in each square region.

dat <- dat %>%
  mutate(posterior = round(priors * likelihoods, 2))

dat %>%
  ggplot(aes(x = coords, y = posterior)) +
  geom_col(color = "black",
           fill = "light grey") +
  scale_y_continuous(labels = scales::percent_format()) +
  theme(axis.text = element_text(size = 11, face = "bold"))

As expected, squares (1,3), (2,3), and (3,3) have the highest probability of observing a deer. Those would be the first areas that our deer hunt guide would want to explore when taking new clients out.

Updating Beliefs After Searching a Region

Since square (1,3) has the highest probability our deer hunt guide decides to take his new clients out there to search for deer. After one day of searching they don’t find any deer. To ensure success tomorrow, he needs to update his knowledge not only about square (1, 3) but about all of the other squares in his 4×4 map.

To update square (1, 3) we use the following equation:

update.posterior = prior * (1 – likelihood) / (1 – prior*likelihood)

coord1_3 <- dat %>%
  filter(coords == "1,3")

coord1_3$priors * (1 - coord1_3$likelihoods) / (1 - coord1_3$priors * coord1_3$likelihoods)

Thew new probability for region (1, 3) is 0.5%. Once we update that square region we can update the other regions using the following equation:

update.prior = prior * (1 / (1 – prior * likelihood))

We can do this for the entire data set all at once:

dat <- dat %>%
  mutate(posterior2 = round(priors * (1 / (1 - priors*likelihoods)), 2),
         posterior2 = ifelse(coords == "1,3", 0.005, posterior2)) %>%
  mutate(updated_posterior = round(posterior2 * likelihoods, 3))

dat %>%
  select(coords, posterior, updated_posterior) %>%
  pivot_longer(cols = -coords) %>%
  ggplot(aes(x = coords, y = value, fill = name)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = scales::percent_format()) +
  theme(axis.text = element_text(size = 11, face = "bold"))


matrix(dat$updated_posterior, ncol = 4, nrow = 4, byrow = TRUE) %>%
  as.data.frame() %>%
  setNames(c("1", "2", "3", "4")) %>%
  pivot_longer(cols = everything(),
               names_to = 'x_coord') %>%
  mutate(y_coord = rep(1:4, each = 4)) %>%
  relocate(y_coord, .before = x_coord) %>%
  ggplot(aes(x = x_coord, y = y_coord)) +
  geom_text(aes(label = value, color = value),
            size = 10) +
  scale_color_gradient(low = "red", high = "green") +
  labs(x = "X Coorindates",
         y = "Y Coordinates",
         title = "Updated Posterior for each 4x4 Square Region after searching (1,3) and\nnot seeing deer",
         color = "Probability") +
  theme(axis.text = element_text(size = 11, face = "bold"))

We can see that after updating the posteriors following day 1, hist best approach is to search grid (2, 3) and (3,3) tomorrow, as the updated beliefs indicate that they have a higher probability of having a deer in them.

Conclusion

We can of course continue updating after searching each region until we finally find the deer but we will stop here and allow you to play with the code and continue on if you’d like. This tutorial is just to provide a brief look into how to use Bayesian Search Theory to locate objects in various spaces, so hopefully you can use this method and apply it to other aspects of your life.

The complete code is available on my GitHub page.

TidyX Episode 161: ShinyLive – A serverless shiny option

A shiny app has always required a server to run. If you wanted to connect data to your users and allow them to interact with it you needed server accessibility. Well, not any more! This week, Ellis Hughes and I explore ShinyLive as an option for creating serverless shiny app. It’s early days, but the future is bright. Watch us walk through trying to learn how to set up a serverless shiny app  and figure out the ins and outs of the code.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

TidyX Episode 160: Prepopulate a shiny app url with user defined parameters

This week, Ellis Hughes and I discuss methods for prepopulating a shiny app’s user defined parameters from a URL. This type of trick comes in handy if you have a separate application that people are using and you want to provide them a link to click out to your shiny app. However, when opening that link you don’t want it to open the shiny app from the typical starting point. Rather, you want them to be take directly to the data they require based on the information they were reviewing in the other application.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

Simulations in R Part 7: Measurement Error in Regression

We’ve been working with building simulations in the past 6 articles of this series. In the last two installments we talked specifically about using simulation to explore different linear regression model assumptions. Today, we continue with linear regression and we use simulation to understand how measurement error (something we all face) influences our linear regression parameter estimates.

Here were the past 6 sections in this series:

  • Part 1 discussed the basic functions for simulating and sampling data in R
  • Part 2 walked us through how to perform bootstrap resampling and then simulate bivariate and multivariate distributions
  • Part 3 we worked making group comparisons by simulating thousands of t-tests
  • Part 4 building simulations for linear regression
  • Part 5 using simulation to investigate the homoskedasticity assumption in regression
  • Part 6 using simulation to investigate the multicollinearity assumption in regression

As always, complete code for this article is available on my GitHub page.

Measurement Error

Measurement error occurs when the values we have observed during data collection differ from the true values. This can sometimes be the cause of imperfect proxy measures where we are using certain measurements (perhaps tests that are easier to obtain in our population or setting) in place of the thing we actually care about. Or, it can happen because tests are imperfect and all data collection has limitations and error (which we try as hard as possible to minimize).

Measurement error can be systematic or random. It can be challenging to detect in a single sample. We will build a simulation to show how measurement error can bias our regression coefficients and perhaps hide the true relationship.

Constructing the simulation

For this simulation, we are going to use a random draw from a normal distribution for the independent variable to represent noise in the model, which will behave like measurement error. We will use a nested for() loop, as we did in Part 6 of this series, where the outer loop stores the outcomes of the regression model under each level of measurement error, which is built in the inner loop.

We begin by creating 11 levels of measurement error, ranging from 0 (no measurement error) to 1 (extreme measurement error). This values are going to serve as the standard deviation when we randomly draw from a normal distribution with a mean of 0. In this way, we are creating noise in the model.

## levels of error measurement to test
# NOTE: these values will be used as the standard deviation in our random draws
meas_error <- seq(from = 0, to = 1, by = 0.1)

# create the number of simulations and reps you want to run
n <- 1000

# true variables
intercept <- 2
beta1 <- 5
independent_var <- runif(n = n, min = -1, max = 1)

## create a final array store each of the model error levels in their own list
final_df <- array(data = NA, dim = c(n, 2, length(meas_error)))

## create a data frame to store the absolute bias at each level of measurement error
error_bias_df <- matrix(nrow = n, ncol = length(meas_error))

Next, we build our nested for() loop and simulate models under the different measurement error conditions.

## loop
for(j in 1:length(meas_error)){
  
  # a store vector for the absolute bias from each inner loop
  abs_bias_vect <- rep(0, times = n)
  
  ## storage data frame for the beta coefficient results in each inner loop simulated regression
  df_betas <- matrix(NA, nrow = n, ncol = 2)
  
  # simulate independent variable 1000x with measurement error
  ind_var_with_error <- independent_var + rnorm(n = n, 0, meas_error[j])
  
  for(i in 1:n){
    
    y_hat <- intercept + beta1*independent_var + rnorm(n = n, 0, 1)
    fit <- lm(y_hat ~ ind_var_with_error)
    
    df_betas[i, 1] <- fit$coef[1]
    df_betas[i, 2] <- fit$coef[2]
    
    abs_bias_vect[i] <- abs(fit$coef[2] - beta1)
  }
  
  ## store final results of each inner loop
  # store the model betas in a list for each level of measurement error
  final_df[, , j] <- df_betas
  
  # store the absolute bias
  error_bias_df[, j] <- abs_bias_vect

}

Have a look at the first few rows of each results data frame.

 

## check the first few values of each new element
head(error_bias_df)

## the final_df's are stored as a list of arrays for each level of measurement error
# Here is the 11th array (measurement error == 1.0)
head(final_df[, ,11])

Plotting the results

Now that we have stored the data for each level of measurement error, let’s do some plotting of the data to explore how measurement error influences our regression model.

Plot the standard deviation of the beta coefficient for the model with no measurement error and the model with the most extreme measurement error.

no_error <- final_df[, ,1] %>% as.data.frame() %>% mutate(measurement_error = "No Measurement Error")

extreme_error <- final_df[, ,11] %>% as.data.frame() %>% mutate(measurement_error = "Extreme Measurement Error")

no_error %>%
  bind_rows(
    extreme_error
  ) %>%
  ggplot(aes(x = V2, fill = measurement_error)) +
  geom_density(alpha = 0.8) +
  facet_wrap(~measurement_error, scales = "free") +
  theme(strip.text = element_text(size = 14, face = "bold"),
        legend.position = "top") +
  labs(x = "Simulated Model Beta Coefficient",
       fill = "Measurement Error",
       title = "Beta Coefficients Differences Due to Measurement Error")

Notice that the value with no measurement error the beta coefficient is around 5, just as we specified the true value in our simulation. However, the model with a measurement error of 1 (shown in green) has the beta coefficient centered around 1.2, which is substantially lower than the true beta coefficient value of 5.

Plot the change in absolute bias across each level of measurement error

First let’s look at the average bias across each level of measurement error.

absolute_error <- error_bias_df %>%
  as.data.frame() %>%
  setNames(paste0('x', meas_error))

# On average, how much absolute bias would we expect
colMeans(absolute_error) %>%
  data.frame() %>%
  rownames_to_column() %>%
  mutate('rowname' = parse_number(rowname)) %>%
  setNames(c("Level of measurement error", 'Absolute bias')) %>%
  gt::gt()

Next, make a plot of the relationship.

absolute_error %>%
  pivot_longer(cols = everything()) %>%
  ggplot(aes(x = name, y = value, group = 1)) +
  geom_point(shape = 21,
             size = 4,
             color = "black",
             fill = "light grey") +
  stat_summary(fun = mean,
               geom = "line",
               color = "red",
               size = 1.2,
               linetype = "dashed") +
  labs(x = "Amount of Measurement Error",
       y = "Absolute Bias",
       title = "Absolute Bias of the True Parameter Value")

Notice that as the amount of measurement increases so too does the absolute bias of the model coefficient.

 

Wrapping Up

Measurement error is something that all of us deal with in practice, whether you are conducting science in a lab or working in an applied setting. Knowing how measurement error influences regression coefficients and the tricks it can play in our beliefs to unveil true parameter values is important to keep in mind. Expressing our uncertainty around model outputs is critical to communicating what we think we know about our observations and how (un)certain we may be. This is one of the values of, in my opinion, Bayesian model building, as we can work directly with sampling from posterior distributions that provide us a way of building up distributions, which allow us to explore uncertainty and make probabilistic statements.

The complete code for this article is available on my GitHub page.