Simulations in R Part 2: Bootstrapping & Simulating Bivariate and Multivariate Distributions

In Part 1 of this series we covered some of the basic functions that we will need to perform resampling and simulations in R.

In Part 2 we will now move to building bootstrap resamples and simulating distributions for both bivariate and multivariate relationships. Some stuff we will cover:

  • Coding bootstrap resampling by hand for both a single variable (mean) and regression coefficients.
  • Using the boot() function from the boot package to perform bootstrapping for both a single variable and regression coefficients.
  • Creating a simulation where two variables are in some way dependent on each other (correlate).
  • Creating a simulation where multiple variables are correlated with each other.
  • Finally, to understand what is going on with these simulated distributions we will also work through code that shows us the relationship between variables using both covariance and correlation matrices.

As always, all code is freely available in the Github repository.

Resampling

The resampling approach we will use here is bootstrapping. The general concept of bootstrapping is as follows:

  • Draw multiple random samples from observed data with replacement.
  • Draws must be independent and each observation must have an equal chance of being selected.
  • The bootstrap sample should be the same size as the observed data in order to use as much information from the sample as possible.
  • Calculate the mean resampled data and store it.
  • Repeat this process thousands of times and summarize the mean of resampled means and the standard deviation of resampled means to obtain summary statistics of your bootstrapped resamples.

Write the bootstrap resampling by hand

The code below goes through the process of creating some fake data and then writing a for() loop that produces 1000 bootstrap resamples. The for() loop was introduced in Part 1 of this series. In a nutshell, we are taking a random sample from the fake data (with replacement), calculating the mean of that random sample, and storing it in the element boot_dat. From there, we calculate the summary statistics or the original sample and the bootstrap resample, which we store in a data frame for comparison purposes, and then produce a histogram of the original sample and the bootstrap resample, which are visualized below the code.

library(tidyverse)

## create fake data
dat <- c(5, 10, 44, 3, 29, 10, 16.7, 22.3, 28, 1.4, 25)

### Bootstrap Resamples ###
# we want 1000 bootstrap resamples
n_boots <- 1000

## create an empty vector to store our bootstraps
boot_dat <- rep(NA, n_boots)

# set seed for reproducibility
set.seed(786)

# write for() loop for the resampling
for(i in 1:n_boots){
  # random sample of 1:n number of observations in our data, with replacement
  ind <- sample(1:length(dat), replace = TRUE)
  
  # Use the row indexes to select the given values from the vector and calculate the mean
  boot_dat[i] <- mean(dat[ind])
}

# Look at the first 6 bootstrapped means
head(boot_dat)

### Compare Bootstrap data to original data ###
## mean and standard deviation of the fake data
dat_mean <- mean(dat)
dat_sd <- sd(dat)

# standard error of the mean
dat_se <- sd(dat) / sqrt(length(dat))

# 95% confidence interval
dat_ci95 <- paste0(round(dat_mean - 1.96*dat_se, 1), ", ", round(dat_mean + 1.96*dat_se, 1))

# mean an SD of bootstrapped data
boot_mean <- mean(boot_dat)

# the vector is the mean of each bootstrap sample, so the standard deviation of these means represents the standard error
boot_se <- sd(boot_dat)

# to get the standard deviation we can convert the standard error back
boot_sd <- boot_se * sqrt(length(dat))

# 95% quantile interval
boot_ci95 <- paste0(round(boot_mean - 1.96*boot_se, 1), ", ", round(boot_mean + 1.96*boot_se, 1)) ## Put everything together data.frame(data = c("fake sample", "bootstrapped resamples"), N = c(length(dat), length(boot_dat)), mean = c(dat_mean, boot_mean), sd = c(dat_sd, boot_sd), se = c(dat_se, boot_se), ci95 = c(dat_ci95, boot_ci95)) %&gt;%
  knitr::kable()

# plot the distributions
par(mfrow = c(1, 2))
hist(dat,
     xlab = "Obsevations",
     main = "Fake Data")
abline(v = dat_mean,
       col = "red",
       lwd = 3,
       lty = 2)
hist(boot_dat,
     xlab = "bootstrapped means",
     main = "1000 bootstrap resamples")
abline(v = boot_mean,
       col = "red",
       lwd = 3,
       lty = 2)

R offers a bootstrap function from the boot package that allows you to do the same thing without writing out your own for() loop. Below is an example of coding the same procedure in the boot package and the outputs the function provides, which are similar to the output we get above, save for slight differences due to random sampling.

# write a function to calculate the mean of our sample data
sample_mean <- function(x, d){
     return(mean(x[d]))
}

# run the boot() function
library(boot)

# run the boot function
boot_func_output <- boot(dat, statistic = sample_mean, R = 1000)

# produce a plot of the output
plot(boot_func_output)

# get the mean and standard error
boot_func_output

# get 95% CI around the mean
boot.ci(boot_func_output, type = "basic", conf = 0.95)

We can bootstrap pretty much anything we want. We don’t have to limit ourselves to producing the distribution around the mean of a population. For example, let’s bootstrap regression coefficients to understand the uncertainty in them.

First, let’s use the boot() function to conduct our analysis. We will fit a simple linear regression predicting miles per gallon from engine weight using the mtcars package. We will then write a function for that uses a random sample of rows to create the same linear regression and store those coefficients from 1000 linear regressions so that we can plot a histogram representing the slope coefficient from the resampled models and also summarize the distribution with confidence intervals.

# load the mtcars data
d <- mtcars d %>%
  head()

# fit a regression model
fit_mpg <- lm(mpg ~ wt, data = d)
summary(fit_mpg)
coef(fit_mpg)
confint(fit_mpg)

# Write a function that can perform a bootstrap over the intercept and slope of the model
# bootstrap function
reg_coef_boot <- function(data, row_id){
  # we want to resample the rows
  fit <- lm(mpg ~ wt, data = d[row_id, ])
  coef(fit)
}

# run this once on a small subset of the row ids to see how it works
reg_coef_boot(data = d, row_id = 1:20)

# run the boot() function 1000 times
coef_boot <- boot(data = d, reg_coef_boot, 1000) 

# check the output (coefficient and SE) 
coef_boot 

# get the confidence intervals 
boot.ci(coef_boot, index= 2) 

# all 1000 of the bootstrap resamples can be called coef_boot$t %&gt;%
  head()

# plot the first 20 bootstrapped intercepts and slopes over the original data
plot(x = d$wt,
     y = d$mpg,
     pch = 19)
for(i in 1:20){
  abline(a = coef_boot$t[i, 1],
       b = coef_boot$t[i, 2],
       lty = 2,
       lwd = 3,
       col = "light grey")
}

## histogram of the slope coefficient
hist(coef_boot$t[, 2])

We can do this by hand if we don’t want to use the built in boot() function. (SIDE NOTE: I usually prefer to code my own resampling and simulations as it gives me more flexibility with respect to the things I’d like to add or the values I’d like to store from each iteration.

Below, instead of producing a histogram of just the slope coefficient, I use both the resampled intercept and slope and add 20 (of the 1000) lines to a scatter plot to show the way in which each of these lines represents a plausible regression line for the data. As you can see, the regression line confidence interval is starting to take shape, even with just 20 out of 1000 resamples, and this gives us a good understanding of not only the variability in our possible regression line fit for the underlying data but also, perhaps should make us less overconfident in our research findings knowing that there are many possible outcomes from the sample data we have obtained.

## 1000 resamples
n_samples <- 1000

## N observations
n_obs <- nrow(mtcars)

## empty storage data frame for the coefficients
coef_storage <- data.frame(
  intercept = rep(NA, n_samples),
  slope = rep(NA, n_samples)
)

for(i in 1:n_samples){
  
  ## sample dependent and independent variables
  row_ids <- sample(1:n_obs, size = n_obs, replace = TRUE)
  new_df <- d[row_ids, ]
  
  ## construct model
  model <- lm(mpg ~ wt, data = new_df)
  
  ## store coefficients
  # intercept
  coef_storage[i, 1] <- coef(model)[1]
  
  # slope
  coef_storage[i, 2] <- coef(model)[2]
  
}

## see results
head(coef_storage)
tail(coef_storage)

## Compare the results to those of the boot function
apply(X = coef_boot$t, MARGIN = 2, FUN = mean)
apply(X = coef_storage, MARGIN = 2, FUN = mean)

apply(X = coef_boot$t, MARGIN = 2, FUN = sd)
apply(X = coef_storage, MARGIN = 2, FUN = sd)

## plot first 20 lines
plot(x = d$wt,
     y = d$mpg,
     pch = 19)
for(i in 1:20){
  abline(a = coef_storage[i, 1],
       b = coef_storage[i, 2],
       lty = 2,
       lwd = 3,
       col = "light grey")
}

Simulating a relationship between two variables

As discussed in Part 1, simulation differs from resampling in that we use the parameters of the observed data to compute a new distribution, versus sampling from the data we have on hand.

For example, using the mean and standard deviation of mpg from the mtcars data set, we can simulate 1000 random draws from a normal distribution.

## load the mtcars data set
d <- mtcars

## make a random draw from the normal distribution for mph
set.seed(5)
mpg_sim <- rnorm(n = 1000, mean = mean(d$mpg), sd = sd(d$mpg))

## plot and summarize
hist(mpg_sim)

mean(mpg_sim)
sd(mpg_sim)

Frequently, we are interested in the relationship between two variables (e.g., correlation, regression, etc.). Let’s simulate two variables, x and y, which are linearly related in some way. To do this, we first simulate the variable x and then simulate y to be x plus some level of random noise.

# simulate x and y
set.seed(1098)
x <- rnorm(n = 10, mean = 50, sd = 10)
y <- x + rnorm(n = length(x), mean = 0, sd = 10)

# put the results in a data frame
dat <- data.frame(x, y)
dat

# how correlated are the two variables
cor.test(x, y)

# fit a regression for the two variables
fit <- lm(y ~ x)
summary(fit)

# plot the two variables with the regression line
plot(x, y, pch = 19)
abline(fit, col = "red", lwd = 2, lty = 2)

Simulating a data set with multiple variables

Frequently, we might have a hypothesis regarding how correlated multiple variables are with each other. The example above produced a relationship of two variables with a direct relationship between them along with some noise. We might want to specify this relationship given a correlation coefficient or covariance between them. Additionally, we might have more than two variables that we want to simulate relationships between.

To do this in R we can take advantage of two packages:

  • MASS via the mvrnorm()
  • mvtnorm via the mvrnorm()

Both packages have a function for simulating multivariate normal distributions. The primary difference is that the Sigma argument in the MASS package function, mvrnorm(), accepts a covariance matrix while the sigma argument in the mvtnorm package, rmvnorm() accepts a correlation matrix. I’ll show both examples but I tend to stick with the mtvnorm package because (at least for my brain) it is easier for me to think in terms of correlation coefficients instead of covariances.

First we simulate some data:

## create fake data
set.seed(1234)
fake_dat <- data.frame(
  group = rep(c("a", "b", "c"), each = 5),
  x = rnorm(n = 15, mean = 10, sd = 2),
  y = rnorm(n = 15, mean = 30, sd = 10),
  z = rnorm(n = 15, mean = 75, sd = 20)
)

fake_dat

Look at the correlation and variance between the three numeric variables.

# correlation
round(cor(fake_dat[, -1]), 3)

# variance
round(var(fake_dat[, -1]), 3)

We can use this information to simulate new x, y, or z variables.

Simulating x and y with the MASS package

Remember, for the MASS package, the Sigma argument is a matrix of covariances for the variables you are simulating from a multivariate normal distribution.

## get a vector of the mean for each variable
variable_means <- apply(X = fake_dat[, c("x", "y")], MARGIN = 2, FUN = mean)

## Get a matrix of the covariance between x and y
variable_sigmas <- var(fake_dat[, c("x", "y")])

## simulate 1000 new x and y variables using the MASS package
set.seed(98)
new_sim <- MASS::mvrnorm(n = 1000, mu = variable_means, Sigma = variable_sigmas)
head(new_sim)

### look at the results relative to the original x and y
## column means
variable_means
apply(X = new_sim, MARGIN = 2, FUN = mean)

## covariance
var(fake_dat[, c("x","y")])
var(new_sim)

 

Notice that the variance between x and y (the off diagonal of the matrix) is very similar between the fake data (the observed sample) and the simulated data (new_sim).

Simulating x and y with the mtvnorm package

Different than the MASS package, The rmvnorm() function from the mtvnorm package requires the sigma argument to be a correlations matrix.

Let’s repeat the above process with our fake_dat and simulate a relationship between x and y.

## get a vector of the mean for each variable
variable_means <- apply(X = fake_dat[, c("x", "y")], MARGIN = 2, FUN = mean)

## Get a matrix of the correlation between x and y
variable_cor <- cor(fake_dat[, c("x", "y")])

## simulate 1000 new x and y variables using the mvtnorm package
set.seed(98)
new_sim <- mvtnorm::rmvnorm(n = 1000, mean = variable_means, sigma = variable_cor)
head(new_sim)

### look at the results relative to the original x and y
## column means
variable_means
apply(X = new_sim, MARGIN = 2, FUN = mean)

## correlation
cor(fake_dat[, c("x","y")])
cor(new_sim)

Similar to the previous example, we notice that the off diagonal correlation coefficient between x and y is very similar when comparing the simulated data to the fake data.

So, what is happening here? Both packages produce the same result, one uses a covariance matrix and the other uses a correlation matrix. The kicker here is understanding the relationship between covariance and correlation. Covariance is explaining how two variables vary together, however, its units aren’t on a scale that is directly interpretable to us. But, we can convert the covariance between two variables to a correlation by dividing their covariance by the product of their individual standard deviations.

For example, here is the covariance matrix between x and y in the fake data set.

cov(fake_dat[, c("x", "y")])

The covariance between the two variables is on the off diagonal, 2.389. We can store this in its own element.

cov_xy <- cov(fake_dat[, c("x", "y")])[2,1]
cov_xy

Let’s store the standard deviation of both `x` and `y` in their own elements to make the equation easier to read.

sd_x <- sd(fake_dat$x)
sd_y <- sd(fake_dat$y)

Finally, we calculate the correlation by dividing the covariance by the product of the two standard deviations and check our results by calling the cor() function on the two variables.

## covariance to correlation
cov_to_cor <- cov_xy / (sd_x * sd_y)
cov_to_cor

## check results with the corr() function
cor(fake_dat[, c("x", "y")])

By dividing the covariance by the product of the standard deviation of the two variables we can see the relationship between covariance and correlation and now understand why the results from the MASS and mtvnorm produce similar results. Understanding this relationship becomes valuable when we move onto simulating more complex relationships, for example when simulating mixed models.

What about three variables?

What if we want to simulate all three variables — x, y, and z?

All we need is a larger covariance or correlation matrix, depending on which of the above packages you’d like to use. Since we usually won’t be creating these matrices from a data set, as I did above, I’ll show how to create your own matrix and run the simulation.

First, let’s store a vector of plausible mean values for x, y, and z.

## Look at the mean values we had in the fake data
apply(X = fake_dat[, c("x", "y", "z")], MARGIN = 2, FUN = mean)

## create a vector of possible mean values for the simulation
mus <- c(9, 26, 63)

## look at the correlation matrix for the three variables in the fake data
cor(fake_dat[, c("x", "y", "z")])

## Create a matrix that stores plausible correlations between the variables you want to simulate
r_matrix <- matrix(c(1, 0.14, -0.24,
                    0.14, 1, -0.35,
                    -0.24, -0.35, 1), 
                   nrow = 3, ncol = 3,
       dimnames = list(c("x", "y", "z"),
                       c("x", "y", "z")))

r_matrix

Next, we create 1000 simulations of a multivariate normal distribution between x, y, and z. We then compare our correlation coefficients between the fake data, which is our observed sample, and the simulated data

## simulate 1000 new x, y, and z variables using the mvtnorm package
set.seed(43)
new_sim <- mvtnorm::rmvnorm(n = 1000, mean = mus, sigma = r_matrix)
head(new_sim)

### look at the results relative to the original x, y, and z
## column means
apply(X = fake_dat[, c("x", "y", "z")], MARGIN = 2, FUN = mean)
apply(X = new_sim, MARGIN = 2, FUN = mean)

## correlation
cor(fake_dat[, c("x", "y", "z")])
cor(new_sim)

The results from the simulation are pretty similar to the fake_dat dataset. If you recall from the correlation matrix and the vector of means, we didn’t use exact values from the observed data, so that, along with the random draws from the multivariate normal distribution, leads to the small amount of differences.

Wrapping Up

In this second installment of the Simulations in R series we’ve walked through how to code our own bootstrap resampling for both means and regression coefficients. We then progressed to building simulations of both bivariate and multivariate normal distributions. This, along with the basic info in Part 1, will serve us well as we progress forward in our work and begin to explore using simulations for comparisons between group means (simulated t-tests) and building regression models.

As always, all of the code is available in the Github repository.

Simulations in R Part 1: Functions for Simulation & Resampling

Simulating data is something I find myself doing all the time. Not only to explore uncertainty in data but also to explore model assumptions, understand how models behave under different circumstances, or to try and understand how a future analysis might work given some underlying data generating process. Thus, I decided to put together a series on simulations and resampling using R (I’ll also add a few analog scripts using Python to the GitHub repository).

In Part 1, I’ll provide some thoughts around why you might want to simulate or resample data and then show how you can simply do this in R. Additionally, I’ll walk through several helper functions for conducting and summarizing simulations/resamples as well as some basics around for() and while() loops, as we will use these extensively in our simulation and resampling processes.

My Github repository will contain all of the scripts in this series.

Why do we simulate or resample data?

  • The data generating process is what defines the properties of our data and dictates the type of distribution we are dealing with. For example, the mean and standard deviation reflect the two parameters of the data generating process for a normal distribution. We rarely know what the data generating process of our data is in the real world, thus we must infer it from our sample data. Both resampling and simulation offer methods of understanding the data generating process of data.
  • Sample data represents a small sliver of what might be occurring in the broader population. Using resampling and simulation, we are able to build larger data sets based on information contained in the sample data. Such approaches allow us to explore our uncertainty around what we have observed in our sample and the inferences we might be able to make about that larger population.
  • Creating samples of data allows us to assess patterns in the data and evaluate those patterns under different circumstances, which we can directly program.
  • By coding a simulation, we are able to reflect a desired data generating process, allowing us to evaluate assumptions or limitations of data that we have collected or are going to collect.
  • The world is full of randomness, meaning that every observation we make comes with some level of uncertainty. The uncertainty that we have about the true value of our observation can be expressed via various probability distributions. Resamping and simulation are ways that we can mimic this randomness in the world and help calibrate our expectation about the probability of certain events or observations occurring.

Difference between resampling and simulation

Resampling and simulation are both useful at generating data sets and reflecting uncertainty. However, they accomplish this task in different ways.

  • Resampling deals with techniques that take the observed sample data and randomly draw observations from that data to construct a new data set. This is often done thousands of times, building thousands of new data sets, and then summary statistics are produced on those data sets as a means of understanding the data generating properties.
  • Simulation works by assuming a data generating process (e.g., making a best guess or estimating a plausible mean and standard deviation for the population from previous literature) and then generating multiple samples of data, randomly, from the data generating process features.

Sampling from common distributions

To create a distribution in R we can use any one of the four primary prefixes, which define the type of information we want returned about the distribution, followed by the suffix that defines the distribution we are interested in.

Here is a helpful cheat sheet I put together for some of the common distributions one might use:

Some examples:

# The probability that a random variable is less than or equal to 1.645 has a cumulative density of 95% (CDF)
pnorm(q = 1.645, mean = 0, sd = 1)

# What is the exact probability (PDF) that we flip 10 coins, with 50% chance of heads or tails, and get 1 heads?
dbinom(x = 1, size = 10, prob = 0.5)

# What is the z-score for the 95 percentile when the data is Normal(0, 1)?
qnorm(p = 0.95, mean = 0, sd = 1)

# randomly draw 10 values from a uniform distribution with a min of 5 and max of 10
runif(n = 10, min = 5, max = 10)

We can completely simulate different distributions and properties of those distributions using these function. For several examples of different distributions see the GitHub code. Below is an example of 1,000 random observations from a normal distribution with a mean of 30 and standard deviation of 15 and plot the results..

## set the seed for reproducibility
set.seed(10)
norm_dat <- rnorm(n = 1000, mean = 30, sd = 15)

hist(norm_dat,
     main = "Random Simulation from a Normal Distribution",
     xlab = "N(30, 15^2)")

We can produce a number of summary statistics on this vector of random values:

# sample size
length(norm_dat)

# mean, standard deviation, and variance
mean(norm_dat)
sd(norm_dat)
var(norm_dat)

# median, median absolute deviation
median(norm_dat)
mad(norm_dat)

for & while loops

Typically, we are going to want to resample data more than once or to run multiple simulations. Often, we will want to do this thousands of times. We can use R to help us in the endeavor by programming for() and while() loops to do the heavy lifting for us and store the results in a convenient format (e.g., vector, data frame, matrix, or list) so that we can summarize it later.

for loops

for() loops are easy ways to tell `R` that we want it to do some sort of task for a specified number of iterations.

For example, let’s create a for() loop that adds 5 for every value from 1 to 10, for(i in 1:10).

# program the loop to add 5 to every value from 1:10
for(i in 1:10){
  
  print(i + 5)
  
}

We notice that the result is printed directly to the console. If we are doing thousands of iterations or if we want to store the results to plot and summarize them later, this wont be a good option. Instead, we can allocate an empty vector or data frame to store these values.

## storing values as vector
n <- 10
vector_storage <- rep(NA, times = n)

for(i in 1:n){
  vector_storage[i] <- i + 5
}

vector_storage

## store results back to a data frame
n <- 10
df_storage <- data.frame(n = 1:10)

for(i in 1:n){
  df_storage$n2[i] <- i + 5
}

df_storage

while loops

while() loops differ from for() loops in that they continue to perform a process while some condition is met.

For example, if we start with a count of 0 observations and continually add 1 observation we want to perform this process as long as the observations are below 10.

observations <- 0

while(observations < 10){
	observations <- observations + 1
	print(observations)
} 

We can also use while() loops to test logical arguments.

For example, let’s say we have five coins in our pocket and want to play a game with a fried where we flip a fair coin and every time it ends on heads (coin_flip == 1) we get a coin and every time it ends on tails we lose a coin. We are only willing to continue playing the game as long as retain between 3 and 10 coins.

## starting number of coins
coins <- 5 

## while loop 
while(coins >= 3 && coins <= 10){
	
  # flip a fair coin (50/50 chance of heads or tails)
	coin_flip <- rbinom(1,1,0.5)
	
	# If the coin leads on heads (1) you win a coin and if it lands on tails (0) you lose a coin
	if(coin_flip == 1){
	  
		coins <- coins + 1
		
		}else{
			coins <- coins - 1
		}
	
	## NOTE: we only play while our winnings are between 3 and 10 coins
	
	# print the result
	print(coins)
}

You can run the code many times and find out, on average, how many flips you will get!

Finally, we can also use while() loops if we are building models to minimize error. For example, lets say we have an error = 30 and we want to continue running the code until we have minimized the error below 1. So, the code will run while(error > 1).

error <- 30 while(error > 1){
  
 error <- error / 2
 print(error)
}

Helper functions for summarizing distributions

There are a number of helper functions in base R that can assist us in summarizing data.

  • apply() will return your results in a vector
  • lapply() will return your results as a list
  • sapply() can return the results as a vector or a list (if you set the argument `simplify = FALSE`)
  • tapply() will return your results in a named vector based on whichever grouping variable you specify
## create fake data
set.seed(1234)
fake_dat <- data.frame(
  group = rep(c("a", "b", "c"), each = 5),
  x = rnorm(n = 15, mean = 10, sd = 2),
  y = rnorm(n = 15, mean = 30, sd = 10),
  z = rnorm(n = 15, mean = 75, sd = 20)
)

fake_dat

#### apply ####
# get the column averages
apply(X = fake_dat[,-1], MARGIN = 2, FUN = mean)

# get the row averages
apply(X = fake_dat[,-1], MARGIN = 1, FUN = mean)

#### lapply ####
# Get the 95% quantile interval for each column
lapply(X = fake_dat[,-1], FUN = quantile, probs = c(0.025, 0.975))

#### sapply ####
# Get the standard deviation of each column in a vector
sapply(X = fake_dat[,-1], FUN = sd)

# Get the standard deviation of each column in a list
sapply(X = fake_dat[,-1], FUN = sd, simplify = FALSE)

#### tapply ####
# Get the average of x for each group
tapply(X = fake_dat$x, INDEX = fake_dat$group, FUN = mean)

We can alternatively do a lot of this type of data summarizing using the convenient R package {tidyverse}.

library(tidyverse)

## get the mean of each numeric column
fake_dat %>%
  summarize(across(.cols = x:z,
                   .fns = ~mean(.x)))

## get the mean across each row for the numeric columns
fake_dat %>%
  rowwise() %>%
  mutate(AVG = mean(c_across(cols = x:z)))

## Get the mean of x for each grou
fake_dat %>%
  group_by(group) %>%
  summarize(avg_x = mean(x),
            .groups = "drop")

Finally, another handy base R function is replicate(), which allows us to replicate a task n number of times.

For example, let’s say we want to draw from a random normal distribution, rnorm() with a mean = 0 and sd = 1 but, we want to run this random simulation 10 times and get 10 different data sets. replicate()` allows us to do this and stores the results in a matrix with 10 columns, each with 10 rows of the random sample.

replicate(n = 10, expr = rnorm(n = 10, mean = 0, sd = 1))

Wrapping Up

In this first part of my simulation and resampling series we went through some of the key functions in R that will help us build the scaffolding for our future work. In Part 2, we we dive into bootstrap resampling and simulating bivariate and multivariate normal distributions.

All code is available in both rmarkdown and html format on my Github page.

The High Performance Hockey Podcast Interview

This week, I had the great pleasure of being interviewed by my good friend and colleague Anthony Donskov for his High Performance Hockey Podcast.

Anthony has done a tremendous job for the sports science and strength and conditioning community in his teaching, writing, and podcasting. He brings a wealth of knowledge from both the applied strength coach realm all the way through to his PhD work.

In this podcast interview, Anthony and I discuss:

  • Data analysis
  • The PPDAC Framework for conducting research
  • My criticisms of applied sport science
  • The challenge of measuring hard things and things that matter in applied sport.

Check out the podcast HERE.

TidyX Episode 151: Shiny Model Building – Multiple Model Options

Ellis Hughes and I have  been building up an interface for the user to input a data set and then run a regression model and get out of sample predictions. This week, add more to our {shiny} model user interface and allow the user to specify the type of model they want to run. Instead of just allowing a linear regression model the user can now choose to either run a random forest or a decision tree.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

TidyX Episode 150: Shiny Model Building – Predictor Selection & Toast

In this next iteration of our Shiny Modeling building series, Ellis Hughes and I advance our user defined shiny model to ensure that the user is not able to select a variable as both a predictor and outcome variable. We discuss two different ways to do this, based on what sort of controls you’d like to provide the user. Additionally, we cover how to add a Toast drop down box to the user interface. When the user changes something, for example builds a model and then decides they want to change the outcome variable, a little drop down will pop out at the top of the screen to inform the user that something has changed with their data. This ensures that the user is always aware of the output they are looking at, helping to prevent any mistakes in interpretation.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.