Category Archives: Model Building in R

Frequentist & Bayesian Approaches to Regression Models by Hand – Compare and Contrast

One of the ways I try and learn things is to code them from first principles. It helps me see what is going on under the hood and also allows me wrap my head around how things work. Building regression models in R is incredibly easy using the lm() function and Bayesian regression models can be conveniently built with the same syntax in packages like {rstanarm} and {brms}. However, today I’m going to do everything by hand to get a grasp of how the Bayesian regression model works, at a very basic level. I put together the code using the mathematical presentation of these concepts from William Bolstad’s book, Introduction to Bayesian Statistics.

The entire script is available on my GITHUB page.

Loading Packages & Getting Data

I’m going to use the data from the {palmerpenguins} package and concentrate on a simple linear regression which will estimate flipper length from bill length.

### packages ------------------------------------------------------
library(tidyverse)
library(palmerpenguins)
library(patchwork)

theme_set(theme_classic())

### data ----------------------------------------------------------
data("penguins")
dat <- penguins %>%
  select(bill_length = bill_length_mm, flipper_length = flipper_length_mm) %>%
  na.omit()

head(dat)

 

EDA

These two variables share a relatively large correlation with each other.

Ordinary Least Squares Regression

First, I’ll start by building a regression model under the Frequentist paradigm, specifying no prior values on the parameters (even though in reality the OLS model is using a flat prior, where all values are equally plausible — a discussion for a different time), using the lm() function.

### Ordinary Least Squares Regression ------------------------------
fit_ols <- lm(flipper_length ~ I(bill_length - mean(dat$bill_length)), data = dat)
summary(fit_ols)

Technical Note: Since we don’t observe a bill length of 0 mm in penguins, I chose to transform the bill length data by grand mean centering it — subtracting each bill length from the population mean. This wont change the predictions from the model but does change our interpretation of the coefficients themselves (and the intercept is now directly interpretable, which it wouldn’t be if we left the bill length data in its untrasnformed state).

OLS Regression by Hand

Okay, that was easy. In a single line we constructed a model and we are up and running. But, let’s calculate this by hand so we can see where the coefficients and their corresponding standard errors came from.

# Calculate the least squares regression line by hand now
dat_stats <- dat %>%
  summarize(x_bar = mean(bill_length),
            x_sd = sd(bill_length),
            y_bar = mean(flipper_length),
            y_sd = sd(flipper_length),
            r = cor(bill_length, flipper_length),
            x2 = x_bar^2,
            y2 = y_bar^2,
            xy_bar = x_bar * y_bar,
            .groups = "drop")

dat_stats

In the above code, I’m calculating summary statistics (mean and SD) for both the independent and dependent variables, extracting their correlation coefficient, and getting their squared means and the product of their squared means to be used in downstream calculations of the regression coefficients. Here is what the output looks like:

The model intercept is simply the mean of our outcome variable (flipper length).

intercept <- dat_stats$y_bar

The coefficient for our independent variable is calculated as the correlation coefficient multiplied by the the SD of the y variable divided by the SD of the x variable.

beta <- with(dat_stats,
             r * (y_sd / x_sd))

beta

Finally, I’ll store the grand mean of bill length so that we can use it when we need to center bill length and make predictions.

x_bar <- dat_stats$x_bar
x_bar

Now that we have the intercept and beta coefficient for bill length calculated by hand we can construct the regression equation:

Notice that these are the exact values we obtained from our model using the lm() function.

We can use the two models to make predictions and show that they are the exact same.

## Make predictions with the two models
dat %>%
  mutate(pred_model = predict(fit_ols),
         pred_hand = intercept + beta * (bill_length - x_bar))

In the model estimated from the lm() function we also get a sigma parameter (residual squared error, RSE), which tells us, on average, how off the model predictions are. We can build this by hand by first calculating the squared error of the observed values and our predictions and then calculating the RSE as the square root of the sum of squared residuals divided by the model degrees of freedom.

# Calculate the estimated variance around the line
N_obs <- nrow(dat) dat %>%
  mutate(pred = intercept + beta * (bill_length - x_bar),
         resid = (flipper_length - pred),
         resid2 = resid^2) %>%
  summarize(n_model_params = 2,
            deg_freedom = N_obs - n_model_params,
            model_var = sum(resid2) / deg_freedom,
            model_sd = sqrt(model_var))

Again, our by hand calculations equal exactly as what we obtained from the lm() function, 10.6.

Bayesian Linear Regression by Hand

Now let’s turn our attention to building a Bayesian regression model.

First we need to calculate the sum of squared error for X, which we do with this equation:

ss_x = N * (mean(mu_x^2) – mu_x^2)

N <- nrow(dat)
mean_x2 <- mean(dat$bill_length^2)
mu_x2 <- mean(dat$bill_length)^2

N
mean_x2
mu_x2

ss_x <- N * (mean_x2 - mu_x2)
ss_x

Next, we need to specify some priors on the model parameters. The priors can come from a number of courses. We could set them subjectively (usually people hate this idea). Or, we could look in the scientific literature and use prior research as a guide for plausible values of the relationship between flipper length and bill length. In this case, I’m just going to specify some priors that are within the range of the data but have enough variance around them to “let the data speak”. (NOTE: you could rerun the entire analysis with different priors and smaller or larger variances to see how the model changes. I hope to do a longer blog post about priors in the future).

  • For the slope coefficient we decide to have a normal prior, N(1, 2^2)
  • For the intercept coefficient we choose a normal prior, N(180, 10^2)
  • We don’t know the true variance so we use the estimated variance from the least squares regression line
prior_model_var <- sigma(fit_ols)^2

prior_slope_mu <- 1
prior_slope_var <- 1^2

prior_intercept_mu <- 180
prior_intercept_var <- 10^2

Next, we calculate the posterior precision (1/variance) for the slope coefficient. We do it with this equation:

1/prior_slope_var + (ss_x / prior_model_var)

And then convert it to a standard deviation (which is more useful to us and easier to interpret than precision, since it is on the scale of the data).

# 1/prior_slope_var + (ss_x / prior_model_var)
posterior_slope_precision <- 1 / prior_slope_var + ss_x / prior_model_var

# Convert to SD
posterior_slope_sd <- posterior_slope_precision^-(1/2)

Once we have the precision for the posterior slope calculated we can calculate the posterior regression coefficient for the slope using this equation:

(1/prior_slope_var) / posterior_slope_var * prior_slope_mu + (ss_x / prior_model_var) / posterior_slope_var * beta

# posterior slope
posterior_slope_mu <- (1/prior_slope_var) / posterior_slope_precision * prior_slope_mu + (ss_x / prior_model_var) / posterior_slope_precision * beta
posterior_slope_mu

We can plot the prior and posterior slope values, by first simulating the posterior slope using its mu and SD values.

## Plot prior and posterior for the slope
set.seed(4)
prior_sim <- rnorm(n = 1e4, mean = prior_slope_mu, sd = sqrt(prior_slope_var))
posterior_sim <- rnorm(n = 1e4, mean = posterior_slope_mu, sd = posterior_slope_sd)

plot(density(posterior_sim),
     col = "blue",
     lwd = 4,
     xlim = c(-2, 3),
     main = "Prior & Posterior\nfor\nBayesian Regression Slope Coefficient")
lines(density(prior_sim),
      col = "red",
      lty = 2,
      lwd = 4)
legend("topleft",
       legend = c("Prior", "Posterior"),
       col = c("red", "blue"),
       lty = c(2, 1),
       lwd = c(2,2))

We notice that the prior for the slope, N(1, 1^2), had a rather broad range of plausible values. However, after observing some data and combining that observed data with our prior, we find the posterior to have settled into a reasonable range given the value, Again, if you had selected other priors or perhaps a prior with a much narrower variance, these results would be different and potentially more influenced by your prior.

Now that we have a slope parameter we need to calculate the intercept. The equations are commented out in the code below.

## Posterior precision for the intercept
# 1/prior_intercept_var + N/prior_model_var

posterior_intercept_precision <- 1/prior_intercept_var + N/prior_model_var
posterior_intercept_precision

# Convert to SD
posterior_intercept_sd <- posterior_intercept_precision^-(1/2)
posterior_intercept_sd

## Posterior intercept mean
# (1/prior_intercept_var) / posterior_intercept_precision * prior_intercept_mu + (N/prior_model_var) / posterior_intercept_precision * intercept
posterior_intercept_mu <- (1/prior_intercept_var) / posterior_intercept_precision * prior_intercept_mu + (N/prior_model_var) / posterior_intercept_precision * intercept
posterior_intercept_mu

Comparing our Bayesian regression coefficients and standard errors to the ones we obtained from the lm() function, we find that they are nearly identical.

Using the coefficients and their corresponding standard errors we can also compare the 95% Credible Interval from the Bayesian model to the 95% Confidence Interval from the OLS model (again, nearly identical).

And writing out the Bayesian model we see that it is the same as the Frequentist model.

Using the posterior mean and SDs for the slope and intercept we can simulate a distribution and plot and summarize them using quantile intervals and credible intervals, to get a picture of our uncertainty.

## Use simulation for the slope and intercept
set.seed(413)
posterior_intercept_sim <- rnorm(n = 1e4, mean = posterior_intercept_mu, sd = posterior_slope_sd)
posterior_slope_sim <- rnorm(n = 1e4, mean = posterior_slope_mu, sd = posterior_slope_sd)

par(mfrow = c(1, 2))
hist(posterior_intercept_sim)
hist(posterior_slope_sim)

Making Predictions

Now that we have our models specified it’s time to make some predictions!

## Add all predictions to the original data set
dat_final <- dat %>%
  mutate(pred_ols = predict(fit_ols),
         pred_ols_by_hand = intercept + beta * (bill_length - x_bar),
         pred_bayes_by_hand = posterior_intercept_mu + posterior_slope_mu * (bill_length - x_bar))

head(dat_final, 10)

Because the equations were pretty much identical we can see that all three models (OLS with lm(), OLS by hand, and Bayes by hand) produce the same predicted values for flipper length.

We can also visualize these predictions along with the prediction intervals.

## Predictions with uncertainty from the OLS model
ols_preds <- dat_final %>%
  bind_cols(predict(fit_ols, newdata = ., interval = "prediction")) %>%
  ggplot(aes(x = fit, y = flipper_length)) +
  geom_point(size = 2) +
  geom_ribbon(aes(ymin = lwr, ymax = upr),
              alpha = 0.4) +
  geom_smooth(method = "lm",
              se = FALSE) +
  ggtitle("OLS Predictions with 95% Prediction Intervals")

## Now predict with the Bayesian Model
# The error is calculated as:
# residual_variance = prior_model_var +  posterior_intercept_sd^2 + posterior_slope^2*(x - x_bar)
bayes_preds <- dat %>%
  mutate(fit = posterior_intercept_mu + posterior_slope_mu * (bill_length-x_bar),
         error_var = prior_model_var + posterior_intercept_sd^2 + posterior_slope_sd^2 * (bill_length - x_bar),
         rse = sqrt(error_var),
         lwr = fit - qt(p = 0.975, df = N - 2) * rse,
         upr = fit + qt(p = 0.975, df = N - 2) * rse) %>%
  ggplot(aes(x = fit, y = flipper_length)) +
  geom_point(size = 2) +
  geom_ribbon(aes(ymin = lwr, ymax = upr),
              alpha = 0.4) +
  geom_smooth(method = "lm",
              se = FALSE) +
  ggtitle("Bayesian Predictions with 95% Prediction Intervals")


ols_preds | bayes_preds

Okay, I know what you are thinking. The equations were the same, the predictions are the same, and the prediction intervals are the same….What’s the big deal?!

The big deal is that the Bayesian model gives us more flexibility. Not only can we specify priors, allowing us to have a higher weighting on more plausible values (based on prior data, prior research, domain expertise, etc.); but, we can also produce an entire posterior distribution for the predictions.

Let’s take a single row of observation from the data and make a prediction on it, complete with 95% credible intervals, using our Bayesian model.

single_obs <- dat %>% slice(36)
single_obs

pred_bayes <- single_obs %>%
  mutate(fit = posterior_intercept_mu + posterior_slope_mu * (bill_length-x_bar),
         error_var = prior_model_var + posterior_intercept_sd^2 + posterior_slope_sd^2 * (bill_length - x_bar),
         rse = sqrt(error_var),
         lwr = fit - qt(p = 0.975, df = N - 2) * rse,
         upr = fit + qt(p = 0.975, df = N - 2) * rse) 

Next, we simulate 10,000 observations using the mean prediction (fit) and the rse from the table above and summarize the results

set.seed(582)
pred_bayes_sim <- rnorm(n = 1e4, mean = pred_bayes$fit, sd = pred_bayes$rse)

mean(pred_bayes_sim)
quantile(pred_bayes_sim, probs = c(0.5, 0.025, 0.975))
mean(pred_bayes_sim) + qt(p = c(0.025, 0.975), df = N - 2) * sd(pred_bayes_sim)

Our mean value, the 95% quantile intervals, and then 95% credible intervals are the similar to the table above (they should be since we simulated with those parameters). We will also get similar values if we use our OLS model and make a prediction on this observation with prediction intervals.

But, because we simulated a full distribution, from the Bayesian prediction we also get 10,000 plausible values for the estimate of flipper length based on the observed bill length.

This offers us flexibility to make more direct statements about the predictive distribution. For example, we might ask, “What is the probability that the predicted flipper length is greater than 210 mm?” and we could directly compute this from the simulated data! This provides us with a number of options if we were using this model to make decisions on and perhaps have threshold values or ranges of critical importance where we need to know how much of the density of our predicted distribution is below or above the threshold or falls outside of the region of importance.

Wrapping Up

That’s a little bit on working with regression models by hand, from first principles. The Bayesian approach affords us a lot of control and flexibility over the model, via priors (more on that in a future blog post) and the predictions (via simulation). The Bayesian regression model calculated here was a simple approach and did not require Markov Chain Monte Carlo (MCMC), which would be run when using {rstanarm} and {brms}. If you are curious how this works, I wrote a previous blog on using a GIBBS Sampler to estimate the posterior distributions for a regression model, CLICK HERE. But, even with this simple approach we are afforded the ability to specify priors on the model parameters of the deterministic/fixed component of the model. We did not specify a prior on the stochastic/error part of the model (we used the sigma value calculated from the OLS model — which we also calculated by hand). This was necessary in this simple conjugate approach because the normal distribution is a two parameter distribution so we needed to treat one of the parameters as fixed (in this case the SD). Had we used an MCMC approach, we could have further specified a prior distribution on the model error.

If you notice any errors, feel free to drop me an email.

The code is available on my GITHUB page.

Calculating Confidence Intervals in R

A favorite paper of mine is the 1986 paper by Gardner and Altman regarding confidence intervals and estimation as a more useful way of reporting data than a dichotomous p-value:

Gardner, MJ. Altman, DG. (1986). Confidence intervals rather than P values: Estimation rather than hypothesis testing. Brit Med J; 292:746-750.

In this paper, Gardner and Altman discuss three main points for either moving away from or supplementing statistical reporting with p-values:

  1. Research often focuses on null hypothesis significance testing with the goal being to identify statistically significant results.
  2. However, we are often interested in the magnitude of the factor of interest.
  3. Given that research deals with samples of a broader population, the readers are not only interested in the observed magnitude of the estimand but also the degree of variability and plausible range of values for the population. This variability can be quantified using confidence intervals.

Aside from the paper providing a clear explanation of the issue at hand, their appendix offers the equations to calculate confidence intervals for means, differences in means, proportions, and differences in proportions. Thus, I decided to compile the appendix in an R script for those looking to code confidence intervals (and not have to rely on pre-built functions).

All of this code is available on my GITHUB page.

Confidence Intervals for Means and Differences

Single Sample

  1. Obtain the mean
  2. Calculate the standard error (SE) of the mean as SE = SD/sqrt(N)
  3. Multiply by a t-critical value specific to the level of confidence of interest and the degrees of freedom (DF) for the single sample, DF = N – 1

The confidence intervals are calculated as:

Low = mean – t_{crit} * SE
High = mean + t_{crit} * SE

Example

We collect data on 30 participants on a special test of strength and observe a mean of 40 and standard deviation of 10. We want to calculate the 90% confidence interval.

The steps above can easily be computed in R. First, we write down the known info (the data that is provided to us). We then calculate the standard error and degrees of freedom (N – 1). To obtain the critical value for our desired level of interest, we use the t-distribution is specific to the degrees of freedom of our data.

## Known info
N <- 30
avg <- 40
SD <- 10

## Calculate DF and SE
SE <- SD / sqrt(N)
DF <- N - 1

## Calculate the confidence level of interest (90%), the amount of data in each of the the two tails ((1 - level of interest) / 2) and the t-critical value
level_of_interest <- 0.90
tail_value <- (1 - level_of_interest)/2

t_crit <- abs(qt(tail_value, df = DF))
t_crit

## Calculate the 90% CI
low90 <- round(avg - t_crit * SE, 1)
high90 <- round(avg + t_crit * SE, 1)

cat("The 90% Confidence Interval is:", low90, " to ", high90)

Two Samples

  1. Obtain the sample mean and standard deviations for the two samples
  2. Pool the estimate of the standard deviation:s = sqrt(((n_1 – 1)*s^2_1 + n_2 – 1)*s^2_2) / (n_1 + n_2 – 2))
  3. Calculate the SE for the difference:SE_{diff} = s * sqrt(1/n_1 + 1/n_2)
  4. Calculate the confidence interval as:

Low = (x_1 – x_2) – t_{crit} * SE_{diff}
High = (x_1 – x_2) + t_{crit} * SE_{diff}

Example

The example in the paper provides the following info:

  • Blood pressure levels were measured in 100 diabetic and 100 non-diabetic men aged 40-49 years old.
  • Mean systolic blood pressure was 146.4 mmHg (SD = 18.5) in diabetics and 140.4 mmHg (SD = 16.8) in non-diabetics.

Calculate the 95% CI.

## store the known data
N_diabetic <- 100
N_non_diabetic <- 100
diabetic_avg <- 146.4
diabetic_sd <-18.5
non_diabetic_avg <- 140.4
non_diabetic_sd <- 16.8

## calculate the difference in means, the pooled SD, and the SE of diff
group_diff <- diabetic_avg - non_diabetic_avg

pooled_sd <- sqrt(((N_diabetic - 1)*diabetic_sd^2 + (N_non_diabetic - 1)*non_diabetic_sd^2) / (N_diabetic + N_non_diabetic - 2))

se_diff <- pooled_sd * sqrt(1/N_diabetic + 1/N_non_diabetic)

## Calculate the confidence level of interest (95%), the amount of data in each of the the two tails ((1 - level of interest) / 2) and the t-critical value
level_of_interest <- 0.95
tail_value <- (1 - level_of_interest)/2

t_crit <- abs(qt(tail_value, df = N_diabetic + N_non_diabetic - 2))
t_crit

## Calculate the 95% CI
low95 <- round(group_diff - t_crit * se_diff, 1)
high95 <- round(group_diff + t_crit * se_diff, 1)

cat("The 95% Confidence Interval is:", low95, " to ", high95)

Confidence Intervals for Proportions

Single Sample

  1. Obtain the proportion for the population
  2. Calculate the SE of the proportion, SE = sqrt((p * (1-p)) / N)
  3. Obtain the z-critical value from a standard normal distribution for the level of confidence of interest (since the value for a proportion does not depend on sample size as it does for means).
  4. Calculate the confidence interval:

low = p – z_{crit} * SE
high = p + z_{crit} * SE

Example

We observe a basketball player with 80 field goal attempts and a FG% of 39%. Calculate the 90% CI.

## Store the known info
N <- 80
fg_pct <- 0.39

## Calculate SE
se <- sqrt((fg_pct * (1 - fg_pct)) / N)

## Calculate z-critical value for 50% confidence
level_of_interest <- 0.95
tail_value <- (1 - level_of_interest) / 2
z_crit <- qnorm(p = tail_value, lower.tail = FALSE)

## Calculate the 95% CI
low95 <- round(fg_pct - z_crit * se, 3)
high95 <- round(fg_pct + z_crit * se, 3)

cat("The 95% Confidence Interval is:", low95, " to ", high95)

Two Samples

  1. Calculate the difference in proportions between the two groups
  2. Calculate the SE of the difference in proportions:SE_{diff} = sqrt(((p_1 * (1-p_1)) / n_1) + ((p_2 * (1 – p_2)) / n_2))
  3. Calculate the z-critical value for the level of interest
  4. Calculate the confidence interval as:

low = (p_1 – p_2) – (z_{crit} * se_{diff})
high = (p_1 – p_2) + (z_{crit} * se_{diff})

Example of two unpaired samples

The study provides the following table of example data:

data.frame(
  response = c("improvement", "no improvement", "total"),
  treatment_A = c(61, 19, 80),
  treatment_B = c(45, 35, 80)
)

The difference we are interested in is between the proportion who improved in treatment A and the proportion of those who improved in treatment B.

## Obtain the two proportions from the table and total sample sizes
pr_A <- 61/80
n_A <- 80
pr_B <- 45/80
n_B <- 80

## calculate the difference in proportions
diff_pr <- pr_A - pr_B

## Calculate the SE
se_diff <- sqrt((pr_A * (1 - pr_A))/n_A + (pr_B * (1 - pr_B))/n_B)

## Get z-critical value for 95% confidence
level_of_interest <- 0.95
tail_value <- (1 - level_of_interest) / 2
z_crit <- qnorm(p = tail_value, lower.tail = FALSE)

## Calculate the 95% CI
low95 <- round(diff_pr - z_crit * se_diff, 3)
high95 <- round(diff_pr + z_crit * se_diff, 3)

cat("The 95% Confidence Interval is:", low95, " to ", high95)

 

Example for two paired samples

We can organize the data in a table like this:

data.frame(
  test_1 = c("Present", "Present", "Absent", "Absent"),
  test_2 = c("Present", "Absent", "Present", "Absent"),
  number_of_subjects = c("a", "b", "c", "d")
)

Let’s say we measured a group of subjects for a specific disease twice in a study. A subject either has the disease (present) or does not (absent) in the two time points. We observe the following data:

dat <- data.frame(
  test_1 = c("Present", "Present", "Absent", "Absent"),
  test_2 = c("Present", "Absent", "Present", "Absent"),
  number_of_subjects = c(10, 25, 45, 5)
)

dat

## total sample size
N <- sum(dat$number_of_subjects)

If we care about comparing those that had the disease (Present) on both occasions (both Test1 and Test2) we calculate them as:

p_1 = (a + b) / N
p_2 = (a + c) / N
Diff = p_1 – p_2

The SE of the difference is:

SE_{diff} = 1/N * sqrt(b + c – (b-c)^2/N)

## Obtain the info we need from the table
p1 <- (10 + 25) / N
p2 <- (10 + 45) / N
diff_prop <- p1 - p2
diff_prop

## Calculate the SE of the difference
se_diff <- 1 / N * sqrt(25 + 45 - (25+45)^2/N)

## Get z-critical value for 95% confidence
level_of_interest <- 0.95
tail_value <- (1 - level_of_interest) / 2
z_crit <- qnorm(p = tail_value, lower.tail = FALSE)

## Calculate the 95% CI
low95 <- round(diff_prop - z_crit * se_diff, 3)
high95 <- round(diff_prop + z_crit * se_diff, 3)

cat("The 95% Confidence Interval is:", low95, " to ", high95)

As always, all of this code is available on my GITHUB page.

And, if you’d like to read the full paper, you can find it here:

Gardner, MJ. Altman, DG. (1986). Confidence intervals rather than P values: Estimation rather than hypothesis testing. Brit Med J; 292:746-750.

Simulations in R Part 8 – Simulating Mixed Models

We’ve built up a number of simulations over the 7 articles in this series. The last few articles have been looking at linear regression and simulating different intricacies in the data that allow us to explore model assumptions. To end this series, we will now extend the linear model to a mixed model. We will start by building a linear regression model and go through the steps of simulation to build up the hierarchical structure of the data.

For those interesting here are the previous 7 articles in this series:

  • Part 1 discussed the basic functions for simulating and sampling data in R
  • Part 2 walked us through how to perform bootstrap resampling and then simulate bivariate and multivariate distributions
  • Part 3 we worked making group comparisons by simulating thousands of t-tests
  • Part 4 building simulations for linear regression
  • Part 5 using simulation to investigate the homoskedasticity assumption in regression
  • Part 6 using simulation to investigate the multicollinearity assumption in regression
  • Part 7 simulating measurement error in regression

As always, complete code for this article is available on my GitHub page.

Side Note: This is a very code heavy article as the goal is to show not only simulations of data and models but how to extract the model parameters from the list of simulations. As such, this a rather long article with minimal interpretation of the models.

First a linear model

Our data will look at training load for 10 players in two different positions, Forwards (F) and Mids (M). The Forwards will be simulated to have less training load than the Mids and we will add some random error around the difference in these group means. We will simulate this as a linaer model where b0 represents the model intercept and is the mean value for Forwards and b1 represents the coefficient for when the group is Mids.

library(tidyverse)
library(broom)
library(lme4)

theme_set(theme_light())
## Model Parameters
n_positions <- 2
n_players <- 10
b0 <- 80
b1 <- 20
sigma <- 10

## Simulate
set.seed(300)
pos <- rep(c("M", "F"), each = n_players)
player_id <- rep(1:10, times = 2)
model_error <- rnorm(n = n_positions*n_players, mean = 0, sd = sigma)
training_load <- b0 + b1 * (pos == "M") + model_error

d <- data.frame(pos, player_id, training_load)
d

Calculate some summary statistics

d %>%
  ggplot(aes(x = pos, y = training_load)) +
  geom_boxplot()

d %>%
  group_by(pos) %>%
  summarize(N = n(),
            avg = mean(training_load),
            sd = sd(training_load)) %>%
  mutate(se = sd / sqrt(N)) %>%
  knitr::kable()

Build a linear model

fit <- lm(training_load ~ pos, data = d)
summary(fit)

Now, we wrap all the steps in a function so that we can use replicate() to create thousands of simulations or to change parameters of our simulation. The end result of the function is going to be the linear regression model.

sim_func <- function(n_positions = 2, n_players = 10, b0 = 80, b1 = 20, sigma = 10){

  ## simulate data
  pos <- rep(c("M", "F"), each = n_players)
  player_id <- rep(1:10, times = 2)
  model_error <- rnorm(n = n_positions*n_players, mean = 0, sd = sigma)
  training_load <- b0 + b1 * (pos == "M") + model_error

  ## store in data frame
  d <- data.frame(pos, player_id, training_load)
  
  ## construct linear model
  fit <- lm(training_load ~ pos, data = d)
  summary(fit)
}

Try the function out with the default parameters

sim_func()

Use replicate() to create many simulations.

Doing this simulation once doesn’t help us. We want to be able to do this thousands of times. All of the articles in this series have used for() loops up until this point. But, if you recall the first article in the series where I laid out several helpful functions for coding simulations, I showed an example of the replicate() function, which will take a function and run it’s result for as many times as you specify. I found this function while I was working through the simulation chapter of Gelman et al’s book, Regression and Other Stories. I think in cases like a mixed model simulation, where you can have many layers and complexities to the data, writing a simple function and then replicating it thousands of times is much easier to debug and much cleaner for others to read than having a bunch of nested for() loops.

Technical Note: We specify the argument simplify = FALSE so that the results are returned in a list format. This makes more sense since the results are the regression summary results and not a data frame.

team_training <- replicate(n = 1000,
                  sim_func(),
                  simplify = FALSE)

Coefficient Results

team_training %>%
  map_df(tidy) %>%
  select(term, estimate) %>%
  ggplot(aes(x = estimate, fill = term)) +
  geom_density() +
  facet_wrap(~term, scales = "free_x")

team_training %>%
  map_df(tidy) %>%
  select(term, estimate) %>%
  group_by(term) %>%
  summarize(avg = mean(estimate),
            SD = sd(estimate))

Compare the simulated results to the results of the original model fit.

tidy(fit)

Model fit parameters

team_training %>%
  map_df(glance) %>%
  select(adj.r.squared, sigma) %>%
  summarize(across(.cols = everything(),
                   ~mean(.x)))


Compare these results to the original fit

fit %>% 
  glance()

Mixed Model 1

Now that we have the general frame work for building a simulation function and using replicate() we will to build a mixed model simulation.

Above we had a team with two positions groups and individual players nested within those position groups. In this mixed model, we will add a second team so that we can explore hierarchical data.

We will simulate data from 3 teams, each with 2 positions (Forward & Mid). This is a pretty simple mixed model. We will build a more complex one after we get a handle on the code below.

## Model Parameters
n_teams <- 3
n_positions <- 2
n_players <- 10

team1_fwd_avg <- 130
team1_fwd_sd <- 15
team1_mid_avg <- 100
team1_mid_sd <- 5

team2_fwd_avg <- 150
team2_fwd_sd <- 20
team2_mid_avg <- 90
team2_mid_sd <- 10

team3_fwd_avg <- 180
team3_fwd_sd <- 15
team3_mid_avg <- 150
team3_mid_sd <- 15


## Simulated data frame
team <- rep(c("Team1","Team2", "Team3"), each = n_players * n_positions)
pos <- c(rep(c("M", "F"), each = n_players), rep(c("M", "F"), each = n_players), rep(c("M", "F"), each = n_players))
player_id <- as.factor(round(seq(from = 100, to = 300, length = length(team)), 0))

d <- data.frame(team, pos, player_id) d %>%
  head()

# simulate training loads
set.seed(555)
training_load <- c(rnorm(n = n_players, mean = team1_mid_avg, sd = team1_mid_sd), rnorm(n = n_players, mean = team1_fwd_avg, sd = team1_fwd_sd), rnorm(n = n_players, mean = team2_mid_avg, sd = team2_mid_sd), rnorm(n = n_players, mean = team2_fwd_avg, sd = team2_fwd_sd), rnorm(n = n_players, mean = team3_mid_avg, sd = team3_mid_sd), rnorm(n = n_players, mean = team3_fwd_avg, sd = team3_fwd_sd)) d ,- d %>%
  bind_cols(training_load = training_load) 

d %>%
  head()


Calculate summary statistics

## Average training load by team
d %>%
  group_by(team) %>%
  summarize(avg = mean(training_load),
            SD = sd(training_load))

## Average training load by pos
d %>%
  group_by(pos) %>%
  summarize(avg = mean(training_load),
            SD = sd(training_load))

## Average training load by team & position
d %>%
  group_by(team, pos) %>%
  summarize(avg = mean(training_load),
            SD = sd(training_load)) %>%
  arrange(pos)

## Plot
d %>%
  ggplot(aes(x = training_load, fill = team)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~pos)

Construct the mixed model and evaluate the outputs

## Mixed Model
fit_lmer <- lmer(training_load ~ pos + (1 |team), data = d)
summary(fit_lmer)
coef(fit_lmer)
fixef(fit_lmer)
ranef(fit_lmer)
sigma(fit_lmer)
hist(residuals(fit_lmer))

Write a mixed model function

sim_func_lmer <- function(n_teams = 3, 
                          n_positions = 2, 
                          n_players = 10, 
                          team1_fwd_avg = 130,
                          team1_fwd_sd = 15,
                          team1_mid_avg = 100,
                          team1_mid_sd = 5,
                          team2_fwd_avg = 150,
                          team2_fwd_sd = 20,
                          team2_mid_avg = 90,
                          team2_mid_sd = 10,
                          team3_fwd_avg = 180,
                          team3_fwd_sd = 15,
                          team3_mid_avg = 150,
                          team3_mid_sd = 15){

        ## Simulated data frame
        team <- rep(c("Team1","Team2", "Team3"), each = n_players * n_positions)
        pos <- c(rep(c("M", "F"), each = n_players), rep(c("M", "F"), each = n_players), rep(c("M", "F"), each = n_players))
        player_id <- as.factor(round(seq(from = 100, to = 300, length = length(team)), 0))
        
        d <- data.frame(team, pos, player_id)

        # simulate training loads
        training_load <- c(rnorm(n = n_players, mean = team1_mid_avg, sd = team1_mid_sd),
                   rnorm(n = n_players, mean = team1_fwd_avg, sd = team1_fwd_sd),
                   rnorm(n = n_players, mean = team2_mid_avg, sd = team2_mid_sd),
                   rnorm(n = n_players, mean = team2_fwd_avg, sd = team2_fwd_sd),
                   rnorm(n = n_players, mean = team3_mid_avg, sd = team3_mid_sd),
                   rnorm(n = n_players, mean = team3_fwd_avg, sd = team3_fwd_sd))
  
        ## construct the mixed model
  fit_lmer <- lmer(training_load ~ pos + (1 |team), data = d)
  summary(fit_lmer)
}

Try out the function

sim_func_lmer()

Now use replicate() and create 1000 simulations of the model and look at the first model in the list

team_training_lmer <- replicate(n = 1000,
                  sim_func_lmer(),
                  simplify = FALSE)

## look at the first model in the list
team_training_lmer[[1]]$coefficient

Store the coefficient results of the 1000 simulations in a data frame, create plots of the model coefficients, and compare the results of the simulation to the original mixed model.


lmer_coef <- matrix(NA, ncol = 5, nrow = length(team_training_lmer))
colnames(lmer_coef) <- c("intercept", "intercept_se", "posM", "posM_se", 'model_sigma')

for(i in 1:length(team_training_lmer)){
  
  lmer_coef[i, 1] <- team_training_lmer[[i]]$coefficients[1]
  lmer_coef[i, 2] <- team_training_lmer[[i]]$coefficients[3]
  lmer_coef[i, 3] <- team_training_lmer[[i]]$coefficients[2]
  lmer_coef[i, 4] <- team_training_lmer[[i]]$coefficients[4]
  lmer_coef[i, 5] <- team_training_lmer[[i]]$sigma
  
}

lmer_coef <- as.data.frame(lmer_coef) head(lmer_coef) ## Plot the coefficient for position lmer_coef %>%
  ggplot(aes(x = posM)) +
  geom_density(fill = "palegreen")

## Summarize the coefficients and their standard errors for the simulations
lmer_coef %>% 
  summarize(across(.cols = everything(),
                   ~mean(.x)))

## compare to the original model
broom.mixed::tidy(fit_lmer)
sigma(fit_lmer)

Extract the random effects for the intercept and the residual for each of the simulated models

ranef_sim <- matrix(NA, ncol = 2, nrow = length(team_training_lmer))
colnames(ranef_sim) <- c("intercept_sd", "residual_sd")

for(i in 1:length(team_training_lmer)){
  
  ranef_sim[i, 1] <- team_training_lmer[[i]]$varcor %>% as.data.frame() %>% select(sdcor) %>% slice(1) %>% pull(sdcor)
  ranef_sim[i, 2] <- team_training_lmer[[i]]$varcor %>% as.data.frame() %>% select(sdcor) %>% slice(2) %>% pull(sdcor)
  
}

ranef_sim <- as.data.frame(ranef_sim) head(ranef_sim) ## Summarize the results ranef_sim %>%
  summarize(across(.cols = everything(),
                   ~mean(.x)))

## Compare with the original model
VarCorr(fit_lmer)

Mixed Model 2

Above was a pretty simple model, just to get our feet wet. Let’s create a more complicated model. Usually in sport and exercise science we have repeated measures of individuals. Often, researchers will set the individual players as random effects with the fixed effects being the component that the researcher is attempting to make an inference about.

In this example, we will set up a team of 12 players with three positions (4 players per position): Forward, Mid, Defender. The aim is to explore the training load differences between position groups while accounting for repeated observations of individuals (in this case, each player will have 20 training sessions). Similar to our first regression model, we will build a data frame of everything we need and then calculate the outcome variable (training load) with a regression model using parameters that we specify. To make this work, we will need to specific an intercept and slope for the position group and an intercept and slope for the individual players as well as a model sigma value. Once we’ve done that, we will fit a mixed model, write a function, and then create 1000 simulations.

## Set up the data frame
n_pos <- 3
n_players <- 12
n_obs <- 20
players <- as.factor(round(seq(from = 100, to = 300, length = n_players), 0))


dat <- data.frame( player_id = rep(players, each = n_obs), pos = rep(c("Fwd", "Mid", "Def"), each = n_players/n_pos * n_obs), training_day = rep(1:n_obs, times = n_players) ) dat %>%
  head()

## Create model parameters
# NOTE: Defender will be the intercept
set.seed(6687)
pos_intercept <- 150
posF_coef <- 170
posM_coef <- -70
individual_intercept <- 50
individual_slope <- 10
sigma <- 10
model_error <- rnorm(n = nrow(dat), mean = 0, sd = sigma)


## we will also create some individual player variance
individual_player_variance <- c()

for(i in players){

  individual_player_variance[i] <- rnorm(n = 1, 
                  mean = runif(min = 2, max = 10, n = 1), 
                  sd = runif(min = 2, max = 5, n = 1))
  
}

individual_player_variance <- rep(individual_player_variance, each = n_obs)

dat$training_load <- pos_intercept + posF_coef * (dat$pos == "Fwd") + posM_coef * (dat$pos == "Mid") + individual_intercept + individual_slope * individual_player_variance + model_error dat %>%
  head()


Calculate summary stats

## Average training load by pos
dat %>%
  group_by(pos) %>%
  summarize(avg = mean(training_load),
            SD = sd(training_load))

## Plot
dat %>%
  ggplot(aes(x = training_load, fill = pos)) +
  geom_density(alpha = 0.5)

Construct the mixed model and evaluate the output

## Mixed Model
fit_lmer_pos <- lmer(training_load ~ pos + (1 | player_id), data = dat)
summary(fit_lmer_pos)
coef(fit_lmer_pos)
fixef(fit_lmer_pos)
ranef(fit_lmer_pos)
sigma(fit_lmer_pos)
hist(residuals(fit_lmer_pos))


Create a function for the simulation

sim_func_lmer2 <- function(n_pos = 3,
                          n_players = 12,
                          n_obs = 20,
                          pos_intercept = 150,
                          posF_coef = 170,
                          posM_coef = -70,
                          individual_intercept = 50,
                          individual_slope = 10,
                          sigma = 10){
  
  players <- as.factor(round(seq(from = 100, to = 300, length = n_players), 0))

  dat <- data.frame(
  player_id = rep(players, each = n_obs),
  pos = rep(c("Fwd", "Mid", "Def"), each = n_players/n_pos * n_obs),
  training_day = rep(1:n_obs, times = n_players)
  )
  
  model_error <- rnorm(n = nrow(dat), mean = 0, sd = sigma)
  
  individual_player_variance <- c()

  for(i in players){

    individual_player_variance[i] <- rnorm(n = 1, 
                  mean = runif(min = 2, max = 10, n = 1), 
                  sd = runif(min = 2, max = 5, n = 1))
    }

  individual_player_variance <- rep(individual_player_variance, each = n_obs)

  dat$training_load <- pos_intercept + posF_coef * (dat$pos == "Fwd") + posM_coef * (dat$pos == "Mid") + individual_intercept + individual_slope * individual_player_variance + model_error

  fit_lmer_pos <- lmer(training_load ~ pos + (1 | player_id), data = dat)
  summary(fit_lmer_pos)
}

Now use `replicate()` and create 1000 simulations of the model and look at the first model in the list.

player_training_lmer <- replicate(n = 1000,
                  sim_func_lmer2(),
                  simplify = FALSE)

## look at the first model in the list
player_training_lmer[[1]]$coefficient

Store the coefficient results from the simulations, summarize them, and compare them to the original mixed model.

lmer_player_coef <- matrix(NA, ncol = 7, nrow = length(player_training_lmer))
colnames(lmer_player_coef) <- c("intercept", "intercept_se","posFwd", "posFwd_se", "posMid", "posMid_se", 'model_sigma')

for(i in 1:length(player_training_lmer)){
  
  lmer_player_coef[i, 1] <- player_training_lmer[[i]]$coefficients[1]
  lmer_player_coef[i, 2] <- player_training_lmer[[i]]$coefficients[4]
  lmer_player_coef[i, 3] <- player_training_lmer[[i]]$coefficients[2]
  lmer_player_coef[i, 4] <- player_training_lmer[[i]]$coefficients[5]
  lmer_player_coef[i, 5] <- player_training_lmer[[i]]$coefficients[3]
  lmer_player_coef[i, 6] <- player_training_lmer[[i]]$coefficients[6]
  lmer_player_coef[i, 7] <- player_training_lmer[[i]]$sigma
  
}

lmer_player_coef <- as.data.frame(lmer_player_coef) head(lmer_player_coef) ## Plot the coefficient for position lmer_player_coef %>%
  ggplot(aes(x = posFwd)) +
  geom_density(fill = "palegreen")

lmer_player_coef %>%
  ggplot(aes(x = posMid)) +
  geom_density(fill = "palegreen")


## Summarize the coefficients and their standard errors for the simulations
lmer_player_coef %>% 
  summarize(across(.cols = everything(),
                   ~mean(.x)))

## compare to the original model
broom.mixed::tidy(fit_lmer_pos)
sigma(fit_lmer_pos)


Extract the random effects for the intercept and the residual for each of the simulated models.

ranef_sim_player <- matrix(NA, ncol = 2, nrow = length(player_training_lmer))
colnames(ranef_sim_player) <- c("player_sd", "residual_sd")

for(i in 1:length(player_training_lmer)){
  
  ranef_sim_player[i, 1] <- player_training_lmer[[i]]$varcor %>% as.data.frame() %>% select(sdcor) %>% slice(1) %>% pull(sdcor)
  ranef_sim_player[i, 2] <- player_training_lmer[[i]]$varcor %>% as.data.frame() %>% select(sdcor) %>% slice(2) %>% pull(sdcor)
  
}

ranef_sim_player <- as.data.frame(ranef_sim_player) head(ranef_sim_player) ## Summarize the results ranef_sim_player %>%
  summarize(across(.cols = everything(),
                   ~mean(.x)))

## Compare with the original model
VarCorr(fit_lmer_pos)

Wrapping Up

Mixed models can get really complicated and have a lot of layers to them. For example, we could make this a multivariable model with independent variables that have some level of correlation with each other. We could also add some level of autocorrelation for each player’s observations. There are also a number of different ways that you can construct these types of simulations. The two approaches used here are just dipping their toes in. Perhaps in future articles I’ll put together code for more complex mixed models.

All of the code for this article and the other 7 articles in this series are available on my GitHub page.

Simulations in R Part 7: Measurement Error in Regression

We’ve been working with building simulations in the past 6 articles of this series. In the last two installments we talked specifically about using simulation to explore different linear regression model assumptions. Today, we continue with linear regression and we use simulation to understand how measurement error (something we all face) influences our linear regression parameter estimates.

Here were the past 6 sections in this series:

  • Part 1 discussed the basic functions for simulating and sampling data in R
  • Part 2 walked us through how to perform bootstrap resampling and then simulate bivariate and multivariate distributions
  • Part 3 we worked making group comparisons by simulating thousands of t-tests
  • Part 4 building simulations for linear regression
  • Part 5 using simulation to investigate the homoskedasticity assumption in regression
  • Part 6 using simulation to investigate the multicollinearity assumption in regression

As always, complete code for this article is available on my GitHub page.

Measurement Error

Measurement error occurs when the values we have observed during data collection differ from the true values. This can sometimes be the cause of imperfect proxy measures where we are using certain measurements (perhaps tests that are easier to obtain in our population or setting) in place of the thing we actually care about. Or, it can happen because tests are imperfect and all data collection has limitations and error (which we try as hard as possible to minimize).

Measurement error can be systematic or random. It can be challenging to detect in a single sample. We will build a simulation to show how measurement error can bias our regression coefficients and perhaps hide the true relationship.

Constructing the simulation

For this simulation, we are going to use a random draw from a normal distribution for the independent variable to represent noise in the model, which will behave like measurement error. We will use a nested for() loop, as we did in Part 6 of this series, where the outer loop stores the outcomes of the regression model under each level of measurement error, which is built in the inner loop.

We begin by creating 11 levels of measurement error, ranging from 0 (no measurement error) to 1 (extreme measurement error). This values are going to serve as the standard deviation when we randomly draw from a normal distribution with a mean of 0. In this way, we are creating noise in the model.

## levels of error measurement to test
# NOTE: these values will be used as the standard deviation in our random draws
meas_error <- seq(from = 0, to = 1, by = 0.1)

# create the number of simulations and reps you want to run
n <- 1000

# true variables
intercept <- 2
beta1 <- 5
independent_var <- runif(n = n, min = -1, max = 1)

## create a final array store each of the model error levels in their own list
final_df <- array(data = NA, dim = c(n, 2, length(meas_error)))

## create a data frame to store the absolute bias at each level of measurement error
error_bias_df <- matrix(nrow = n, ncol = length(meas_error))

Next, we build our nested for() loop and simulate models under the different measurement error conditions.

## loop
for(j in 1:length(meas_error)){
  
  # a store vector for the absolute bias from each inner loop
  abs_bias_vect <- rep(0, times = n)
  
  ## storage data frame for the beta coefficient results in each inner loop simulated regression
  df_betas <- matrix(NA, nrow = n, ncol = 2)
  
  # simulate independent variable 1000x with measurement error
  ind_var_with_error <- independent_var + rnorm(n = n, 0, meas_error[j])
  
  for(i in 1:n){
    
    y_hat <- intercept + beta1*independent_var + rnorm(n = n, 0, 1)
    fit <- lm(y_hat ~ ind_var_with_error)
    
    df_betas[i, 1] <- fit$coef[1]
    df_betas[i, 2] <- fit$coef[2]
    
    abs_bias_vect[i] <- abs(fit$coef[2] - beta1)
  }
  
  ## store final results of each inner loop
  # store the model betas in a list for each level of measurement error
  final_df[, , j] <- df_betas
  
  # store the absolute bias
  error_bias_df[, j] <- abs_bias_vect

}

Have a look at the first few rows of each results data frame.

 

## check the first few values of each new element
head(error_bias_df)

## the final_df's are stored as a list of arrays for each level of measurement error
# Here is the 11th array (measurement error == 1.0)
head(final_df[, ,11])

Plotting the results

Now that we have stored the data for each level of measurement error, let’s do some plotting of the data to explore how measurement error influences our regression model.

Plot the standard deviation of the beta coefficient for the model with no measurement error and the model with the most extreme measurement error.

no_error <- final_df[, ,1] %>% as.data.frame() %>% mutate(measurement_error = "No Measurement Error")

extreme_error <- final_df[, ,11] %>% as.data.frame() %>% mutate(measurement_error = "Extreme Measurement Error")

no_error %>%
  bind_rows(
    extreme_error
  ) %>%
  ggplot(aes(x = V2, fill = measurement_error)) +
  geom_density(alpha = 0.8) +
  facet_wrap(~measurement_error, scales = "free") +
  theme(strip.text = element_text(size = 14, face = "bold"),
        legend.position = "top") +
  labs(x = "Simulated Model Beta Coefficient",
       fill = "Measurement Error",
       title = "Beta Coefficients Differences Due to Measurement Error")

Notice that the value with no measurement error the beta coefficient is around 5, just as we specified the true value in our simulation. However, the model with a measurement error of 1 (shown in green) has the beta coefficient centered around 1.2, which is substantially lower than the true beta coefficient value of 5.

Plot the change in absolute bias across each level of measurement error

First let’s look at the average bias across each level of measurement error.

absolute_error <- error_bias_df %>%
  as.data.frame() %>%
  setNames(paste0('x', meas_error))

# On average, how much absolute bias would we expect
colMeans(absolute_error) %>%
  data.frame() %>%
  rownames_to_column() %>%
  mutate('rowname' = parse_number(rowname)) %>%
  setNames(c("Level of measurement error", 'Absolute bias')) %>%
  gt::gt()

Next, make a plot of the relationship.

absolute_error %>%
  pivot_longer(cols = everything()) %>%
  ggplot(aes(x = name, y = value, group = 1)) +
  geom_point(shape = 21,
             size = 4,
             color = "black",
             fill = "light grey") +
  stat_summary(fun = mean,
               geom = "line",
               color = "red",
               size = 1.2,
               linetype = "dashed") +
  labs(x = "Amount of Measurement Error",
       y = "Absolute Bias",
       title = "Absolute Bias of the True Parameter Value")

Notice that as the amount of measurement increases so too does the absolute bias of the model coefficient.

 

Wrapping Up

Measurement error is something that all of us deal with in practice, whether you are conducting science in a lab or working in an applied setting. Knowing how measurement error influences regression coefficients and the tricks it can play in our beliefs to unveil true parameter values is important to keep in mind. Expressing our uncertainty around model outputs is critical to communicating what we think we know about our observations and how (un)certain we may be. This is one of the values of, in my opinion, Bayesian model building, as we can work directly with sampling from posterior distributions that provide us a way of building up distributions, which allow us to explore uncertainty and make probabilistic statements.

The complete code for this article is available on my GitHub page.

Simulations in R Part 6: Multicollinearity Assumption in Regression

For the next installment of our simulation blog series we will use simulation to look at the Multicollinearity assumption in linear regression.

For a refresher, here are the previous 5 articles in this series:

  • Part 1 discussed the basic functions for simulating and sampling data in R.
  • Part 2 walked us through how to perform bootstrap resampling and then simulate bivariate and multivariate distributions.
  • Part 3 we worked making group comparisons by simulating thousands of t-tests
  • Part 4 building simulations for linear regression
  • Part 5 using simulation to investigate the homoskedasticity assumption in regression.

The entire code for this series is accessible on my GITHUB page.

Multicollinearity

Multicollinearity occurs when two independent variables in a regression model are highly correlated with each other. Such a situation can produce problems with interpretation of the beta coefficients of the model, may increase standard errors in the model, and can lead to over fitting of the data. We can simulate this issue in order to get a better understanding of how multicollinearity can influence a regression model.

Constructing the simulation

We will use the mvnorm package to help us construct a simulation where the two independent variables share a certain level of correlation between each other.

First, we will create the true model parameters: an intercept of 2, a beta1 of 5, and a beta2 of 10. We also create a vector of correlation coefficients from 0 to 0.99 and a few data frames to store the results of our model. We will also specify that at each correlation coefficient we want 200 random draws from the multivariate normal distribution.

## load packages
library(tidymodels)
library(patchwork)
library(mvtnorm)

set.seed(999)

# create the true model parameters
intercept <- 2
beta1 <- 5
beta2 <- 10

## Number of draws from a multivariate normal distribution
n <- 200

## Create a data frame to store model results
sim_params <- data.frame(intercept = NA,
                      intercept_se = NA,
                      beta1 = NA,
                      beta1_se = NA,
                      beta2 = NA,
                      beta2_se = NA,
                      model_rse = NA)

## create levels of multicollinearity between the two independent variables
cor_coefs <- c(seq(from = 0, to = 0.9, by = 0.1), 0.99)

# data frame to store the average beta coefficient and their standard deviations form the simulation
mean_betas <- data.frame(beta1 = NA,
                       sd_beta1 = NA,
                       beta2 = NA,
                       sd_beta2 = NA)

Next, we will create a nested for() loop to construct out simulations.

  •  The outer part of the loop begins by creating the multivariate normal distribution. We use the rmvnorm(), which means we first specify a correlation matrix using our vector of correlation coefficients we specified above. Once we have the two correlated variables we can put them into the inner loop.
  • The inner loop is where we create a regression equation for the given correlation between the two variables. We create 100 regression simulations for each correlation coefficient.
  • Once the inner loop is finished running we store the results at the bottom of the outer loop. Instead of storing all of the results, we take the average of the 100 beta coefficients and their respective standard errors for each correlation coefficient.
## loop
for(j in 1:length(cor_coefs)){
  
  ## Create a correlation matrix between beta1 and beta2
  beta_corr <- matrix(c(1, cor_coefs[j], cor_coefs[j], 1), nrow = 2, ncol = 2)
  
  ## create a multivariate normal distribution 
  cor_df <- rmvnorm(n = n, mean = c(0, 0), sigma = beta_corr)
  X1 <- cor_df[, 1]
  X2 <- cor_df[, 2]
  
  ## simulate 100 regression simulations
  for(i in 1:100){
    
    # set up the model
    y_hat <- intercept + beta1*X1 + beta2*X2 + rnorm(n = n, mean = 0, sd = 1)
    
    # construct a regression equation
    model <- lm(y_hat ~ X1 + X2)
    
    # store the variance-covariance matrix
    vcv <- vcov(model)
    
    # estimates for the intercept
    sim_params[i, 1] <- model$coef[1]
  
    # estimates for the beta1
    sim_params[i, 3] <- model$coef[2]
    
    # estimates for beta2
    sim_params[i, 5] <- model$coef[3]
  
    # SE for the intercept
    sim_params[i, 2] <- sqrt(diag(vcv)[1])
    
    # SE for beta1
    sim_params[i, 4] <- sqrt(diag(vcv)[2])
    
    # SE for beta2
    sim_params[i, 6] <- sqrt(diag(vcv)[3])
  
    # model RSE
    sim_params[i, 7] <- summary(model)$sigma
  }

  mean_betas[j, ] <- c(mean(sim_params[, 3]), mean(sim_params[, 4]), mean(sim_params[, 5]), mean(sim_params[, 6]))

}

The final result is a data frame with 10 rows, one for each correlation coefficient and the average of 100 regression simulations for the beta coefficients and their standard errors.

# Add the correlation coefficients to the final data frame
mean_betas$cor_coef <- cor_coefs mean_betas %>%
  knitr::kable()

Finally, we plot each of the results with the correlation coefficients on the x-axis.

# Plot the results
par(mfrow = c(2,2))
plot(x = mean_betas$cor_coef,
     y = mean_betas$beta1,
     main = "Beta1",
     lwd = 3,
     type = "b",
     ylim = c(2, 7),
     xlab = "Correlation betwen X1 and X2",
     ylab = "Beta1")
plot(x = mean_betas$cor_coef,
     y = mean_betas$sd_beta1,
     main = "Beta1 Standard Error",
     lwd = 3,
     type = "b",
     xlab = "Correlation betwen X1 and X2",
     ylab = "SE of B1")
plot(x = mean_betas$cor_coef,
     y = mean_betas$beta2,
     main = "Beta2",
     lwd = 3,
     type = "b",
     ylim = c(7, 13),
     xlab = "Correlation betwen X1 and X2",
     ylab = "Beta 2")
plot(x = mean_betas$cor_coef,
     y = mean_betas$sd_beta2,
     main = "Beta2 Standard Error",
     lwd = 3,
     type = "b",
     xlab = "Correlation betwen X1 and X2",
     ylab = "SE of B2")

The beta coefficients themselves remain relatively unchanged in our simulation across the various correlations levels. However, once the correlation between the two independent variables reaches about 0.7 the standard errors around the beta coefficients begin to increase exponentially, increasing our uncertainty about the true parameter values.

Wrapping Up

We’ve now covered using simulation to investigate two assumptions of linear regression. Our next installment will investigate another linear regression assumption before we proceed on to simulating other types of models.

For the full code, check out my GITHUB page.