Category Archives: Model Building in R

Two Group Comparison – Frequentist vs Bayes – Part 1

One of the more common types of analysis conducted in science is the comparison of two groups (e.g., a treatment group and a control group) to identify whether an intervention had a desired effect. The problem, in sport science and other fields, is often a lack of participants (small sample sizes) and an inability to combine prior knowledge (e.g., outcomes from previous research or domain expertise) with the observed data in order to make a broader statement of our findings (IE, a Bayesian approach). Consequently, this leaves us analyzing the data on hand and reporting whatever potential findings shake out.

This tutorial will step through a two group comparison of some simulated data from both a frequentist (t-test) and Bayesian approach.

All code is accessible on my GITHUB page.

Two Groups – Simulated Data

We are conducting a study on the effect of a new strength training program. Participants are randomly allocated into a control group (n = 17), receiving a normal training program, or an experimental group (n = 22), receiving the new training program.

The data consists of the change in strength score observed for each participant in their respective groups. (NOTE: For the purposes of this example, the strength score is a made up number describing a participants strength).

Simulate the data

library(tidyverse)
library(patchwork)

theme_set(theme_classic())

set.seed(6677)
df <- data.frame( subject = 1:39, group = c(rep("control", times = 17), rep("experimental", times = 22)), strength_score_change = c(round(rnorm(n = 17, mean = 0, sd = 0.5), 1), round(rnorm(n = 22, mean = 0.5, sd = 0.5), 1)) ) %>%
  mutate(group = factor(group, levels = c("experimental", "control")))

Summary Statistics

df %>%
  group_by(group) %>%
  summarize(N = n(),
            Avg = mean(strength_score_change),
            SD = sd(strength_score_change),
            SE = SD / sqrt(N))

It looks like the new strength training program led to a greater improvement in strength, on average, than the normal strength training program.

Plot the data

Density plots of the two distributions

df %>%
  ggplot(aes(x = strength_score_change, fill = group)) +
  geom_density(alpha = 0.2) +
  xlim(-2, 2.5)

Plot the means and 95% CI to compare visually

df %>%
  group_by(group) %>%
  summarize(N = n(),
            Avg = mean(strength_score_change),
            SD = sd(strength_score_change),
            SE = SD / sqrt(N)) %>%
  ggplot(aes(x = group, y = Avg)) +
  geom_hline(yintercept = 0,
             size = 1,
             linetype = "dashed") +
  geom_point(size = 5) +
  geom_errorbar(aes(ymin = Avg - 1.96 * SE, ymax = Avg + 1.96 * SE),
                width = 0,
                size = 1.4) +
  theme(axis.text = element_text(face = "bold", size = 13),
        axis.title = element_text(face = "bold", size = 17))

Plot the 95% CI for the difference in means

df %>%
  mutate(group = factor(group, levels = c("control", "experimental"))) %>%
  group_by(group) %>%
  summarize(N = n(),
            Avg = mean(strength_score_change),
            SD = sd(strength_score_change),
            SE = SD / sqrt(N),
            SE2 = SE^2,
            .groups = "drop") %&amp;gt;%
  summarize(diff = diff(Avg),
            se_diff = sqrt(sum(SE2))) %&amp;gt;%
  mutate(group = 'Group\nDifference') %&amp;gt;%
  ggplot(aes(x = diff, y = 'Group\nDifference')) +
  geom_vline(aes(xintercept = 0),
             linetype = "dashed",
             size = 1.2) +
  geom_point(size = 5) +
  geom_errorbar(aes(xmin = diff - 1.96 * se_diff,
                    xmax = diff + 1.96 * se_diff),
                width = 0,
                size = 1.3) +
  xlim(-0.8, 0.8) +
  labs(y = NULL,
       x = "Average Difference in Strengh Score",
       title = "Difference in Strength Score",
       subtitle = "Experimental - Control (Mean Difference ± 95% CI)") +
  theme(axis.text = element_text(face = "bold", size = 13),
        axis.title = element_text(face = "bold", size = 17),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 18))

Initial Observations

We can see that the two distributions have a decent amount of overlap
The mean change in strength for the experimental group appears to be larger than the mean change in strength for the control group
When visually comparing the group means and their associated confidence intervals, there is an observable difference. Additionally, looking at the mean difference plot, that difference appears to be significant based on the fact that the confidence interval does not cross zero.

So, it seems like something is going on here with the new strength training program. Let’s do some testing to see just how large the difference is between our two groups.

Group Comparison: T-test

First, let’s use a frequentist approach to evaluating the difference in strength score change between the two groups. We will use a t-test to compare the mean differences.

diff_t_test <- t.test(strength_score_change ~ group,
       data = df,
       alternative = "two.sided")

## store some of the main values in their own elements to use later
t_test_diff  <- as.vector(round(diff_t_test$estimate[1] - diff_t_test$estimate[2], 2))

se_diff <- 0.155    # calculated above when creating the plot

diff_t_test

The test is statistically significant, as the p-value is less than the magic 0.05 (p = 0.01) with a mean difference in strength score of 0.41 and a 95% Confidence Interval of 0.1 to 0.73.

Here, the null hypothesis was that the difference in strength improvement between the two groups is 0 (no difference) while the alternative hypothesis is that the difference is not 0. In this case, we did a two sided test so we are stating that we want to know if the change in strength from the new training program is either better or worse than the normal training program. If we were interested in testing the difference in a specific direction, we could have done a one sided test in the direction of our hypothesis, for example, testing whether the new training program is better than the normal program.

But, what if we just got lucky with our sample in the experimental group, and they happened to respond really well to the new program, and got unlucky in our control group, and they happened to not change as much to the normal training program? This could certainly happen. Also, had we collected data on a few more participants, it is possible that the group means could have changed and no effect was observed.

The t-test only allows us to compare the group means. It doesn’t tell us anything about the differences in variances between groups (we could do an F-test for that). The other issue is that rarely is the difference between two groups exactly 0. It makes more sense for us to compare a full distribution of the data and make an inference about how much difference there is between the two groups. Finally, we are only dealing with the data we have on hand (the data collected for our study). What if we want to incorporate prior information/knowledge into the analysis? What if we collect data on a few more participants and want to add that to what we have already observed to see how it changes the distribution of our data? For that, we can use a Bayesian approach.

Group Comparison: Bayes

The two approaches we will use are:

Group comparison with a known variance, using conjugate priors
Group comparisons without a know variance, estimating the mean and SD distributions jointly

Part 1: Group Comparison with known variance

First, we need to set some priors. Let’s say that we are skeptical about the effect of the new strength training program. It is a new training stimulus that the subjects have never been exposed to, so perhaps we believe that it will improve strength but not much more than the traditional program. We will set our prior belief about the mean improvement of any strength training program to be 0.1 with a standard deviation of 0.3, for our population. Thus, we will look to combine this average/expected improvement (prior knowledge) with the observations in improvement that we see in our respective groups. Finally, we convert the standard deviation of the mean to precision, calculated as 1/sd^2, for our equations going forward.

prior_mu <- 0.1
prior_sd <- 0.3
prior_var <- prior_sd^2
prior_precision <- 1 / prior_var

Plot the prior distribution for the mean difference

hist(rnorm(n = 1e6, mean = prior_mu, sd = prior_sd),
     main = "Prior Distribution for difference in strength change\nbetween control and experimental groups",
     xlab = NULL)
abline(v = 0,
       col = "red",
       lwd = 5,
       lty = 2)

Next, to make this conjugate approach work we need to have a known standard deviation. In this example, we will not be estimating the joint posterior distributions (mean and SD). Rather, we are saying that we are interested in knowing the mean and the variability around it but we are going to have a fixed/known SD.

Let’s say we looked at some previously published scientific literature and also tried out the new program in a small pilot study of athletes and we the improvements of a strength training program are normally distributed with a known SD of 0.6.

known_sd <- 0.6
known_var <- known_sd^2

Finally, let’s store the summary statistics for each group in their own elements. We can type these directly in from our summary table.

df %>%
  group_by(group) %>%
  summarize(N = n(),
            Avg = mean(strength_score_change),
            SD = sd(strength_score_change),
            SE = SD / sqrt(N))

experimental_N <- 22
experimental_mu <- 0.423
experimental_sd <- 0.494
experimental_var <- experimental_sd^2
experimental_precision <- 1 / experimental_var

control_N <- 17
control_mu <- 0.0118
control_sd <- 0.469
control_var <- control_sd^2
control_precision <- 1 / control_var

Now we are ready to update the observed study data for each group with our prior information and obtain posterior estimates. We will use the updating rules provided by William Bolstad in Chapter 13 of his book, Introduction to Bayesian Statistics, 2nd Ed.

##### Update the control group observations ######
## SD
posterior_precision_control <- prior_precision + control_N / known_var
posterior_var_control <- 1 / posterior_precision_control
posterior_sd_control <- sqrt(posterior_var_control)

## mean
posterior_mu_control <- prior_precision / posterior_precision_control * prior_mu + (control_N / known_var) / posterior_precision_control * control_mu

posterior_mu_control
posterior_sd_control

##### Update the experimental group observations ######
## SD
posterior_precision_experimental <- prior_precision + experimental_N / known_var
posterior_var_experimental <- 1 / posterior_precision_experimental
posterior_sd_experimental <- sqrt(posterior_var_experimental)

## mean
posterior_mu_experimental <- prior_precision / posterior_precision_experimental * prior_mu + (experimental_N / known_var) / posterior_precision_experimental * experimental_mu

posterior_mu_experimental
posterior_sd_experimental

Compare the posterior difference in strength change

# mean
mu_diff <- posterior_mu_experimental - posterior_mu_control
mu_diff

# sd = sqrt(var1 + var2)
sd_diff <- sqrt(posterior_var_experimental + posterior_var_control)
sd_diff

# 95% Credible Interval
mu_diff + qnorm(p = c(0.025, 0.975)) * sd_diff

Combining the observations for each group with our prior, it appears that the new program was more effective than the normal program on average (0.34 difference in change score between experimental and control) but the credible interval suggests the data is consistent with the new program having no extra benefit [95% Credible Interval: -0.0003 to 0.69].

Next, we can take the posterior parameters and perform a Monte Carlo Simulation to compare the posterior distributions.

Monte Carlo Simulation is an approach that uses random sampling from a defined distribution to solve a problem. Here, we will use Monte Carlo Simulation to sample from the normal distributions of the control and experimental groups as well as a simulation for the difference between the two. To do this, we will create a random draw of 10,000 values with the mean and standard deviation being defined as the posterior mean and standard deviation from the respective groups.

## Number of simulations
N <- 10000

## Monte Carlo Simulation
set.seed(9191)
control_posterior <- rnorm(n = N, mean = posterior_mu_control, sd = posterior_sd_control)
experimental_posterior <- rnorm(n = N, mean = posterior_mu_experimental, sd = posterior_sd_experimental)
diff_posterior <- rnorm(n = N, mean = mu_diff, sd = sd_diff)

## Put the control and experimental groups into a data frame
posterior_df <- data.frame(
  group = c(rep("control", times = N), rep("experimental", times = N)),
  posterior_sim = c(control_posterior, experimental_posterior)
)

Plot the mean and 95% CI for both simulated groups

posterior_df %>%
  group_by(group) %>%
  summarize(Avg = mean(posterior_sim),
            SE = sd(posterior_sim)) %>%
  ggplot(aes(x = group, y = Avg)) +
  geom_hline(yintercept = 0,
             size = 1,
             linetype = "dashed") +
  geom_point(size = 5) +
  geom_errorbar(aes(ymin = Avg - 1.96 * SE, ymax = Avg + 1.96 * SE),
                width = 0,
                size = 1.4) +
  labs(x = NULL,
       y = "Mean Diff",
       title = "Difference in Strength Score",
       subtitle = "Monte Carlo Simulation of Posterior Mean and SD") +
  theme(axis.text = element_text(face = "bold", size = 13),
        axis.title = element_text(face = "bold", size = 17),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 17))

We see more overlap here than we did when we were just looking at the observed data. Part of this is due to the priors that we set.

Plot the posterior simulation for the difference

hist(diff_posterior,
     main = "Posterior Simulation of the Difference between groups",
     xlab = "Strength Score Difference")
abline(v = 0,
       col = "red",
       lwd = 5,
       lty = 2)

Compare these results to the observed data and the t-test results

data.frame(
  group = c("control", "experimental", "difference"),
  observed_avg = c(control_mu, experimental_mu, t_test_diff),
  posterior_sim_avg = c(posterior_mu_control, posterior_mu_experimental, mu_diff),
  observed_standard_error = c(control_sd / sqrt(control_N), experimental_sd / sqrt(experimental_N), se_diff),
  posterior_sim_standard_error = c(posterior_sd_control, posterior_sd_experimental, sd_diff)
  )

Recall that we set out prior mean to 0.1 and our prior SD to 0.3
Also notice that, because this conjugate approach is directed at the means, I convert the observed standard deviations into standard errors (the standard deviation of the mean given the sample size). Remember, we set the SD to be known for this approach to work, so we are not able to say anything about it following our Bayesian procedure. To do so, we’d need to estimate both parameters (mean and SD) jointly. We will do this in part 2.
Above, we see that, following the Bayesian approach, the mean value of the control group gets pulled up closer to our prior while the mean value of the experimental group gets pulled down closer to that prior.
The posterior for the difference between the two groups, which is what we are interested in, has come down towards the prior, considerably. Had we had a larger sample of data in our two groups, the movement towards the prior would have been less extreme. But, since this was a relatively small study we get more movement towards the prior, placing less confidence in the observed sample of participants we had. We also see that the standard error around the mean difference increased, since our prior SD was 0.3. So, the mean difference decreased a bit and the standard error around that mean increased a bit. Both occurrences lead to less certainty in our observed outcomes.

Compare the Observed Difference to the Bayesian Posterior Difference

Start by creating a Monte Carlo Simulation of the observed difference from the original data.
Notice that the Bayesian Posterior Difference (blue) is pulled down towards our prior, slightly.
We also see that the Bayesian Posterior has more overlap of 0 than the Observed Difference (red) from the original data, which had a small sample size and hadn’t combined any prior knowledge that we might have had.

N <- 10000

set.seed(8945)
observed_diff <- rnorm(n = N, mean = t_test_diff, sd = se_diff)

plot(density(observed_diff),
     lwd = 5,
     col = "red",
     main = "Comparing the Observed Difference (red)\nto the\nBayes Posterior Difference (blue)",
     xlab = "Difference in Strength Score between Groups")
lines(density(diff_posterior),
      lwd = 5,
      col = "blue")
abline(v = 0,
       lty = 2,
       lwd = 3,
       col = "black")

Wrapping Up

That’s it for part 1. We’ve computed a frequentist t-test of the observed data in our study and compared those results to Bayesian estimation where we combined the observed data with some prior knowledge that we accessed from previous research and domain expertise. In this example, we used a conjugate approach and therefore were required to specify a known standard deviation for the data. Unfortunately, this may not always make sense to do and we may instead need to jointly estimating both the mean and standard deviation simultaneously. In part 2, we will discuss how to do this.

The code for this article is accessible on my GITHUB page.

If you see any code or mathematical errors, please email me.

Neural Networks….It’s regression all the way down!

Yesterday, I talked about how t-test and ANOVA are fundamentally just linear regression. But what about something more complex? What about something like a neural network?

Whenever people bring up neural networks I always say, “The most basic neural network is a sigmoid function. It’s just logistic regression!!” Of course, neural networks can get very complex and there is a lot more than can be added to them to maximize their ability to do the job. But fundamentally, they look like regression models and when you add several hidden layers (deep learning) you end up just stacking a bunch of regression models on top of each other (I know I’m over simplifying this a little bit).

Let’s see if we can build a simple neural network to prove it. As always, you can access the full code on my GITHUB page.

Loading packages, functions, and data

We will load {tidyverse} for data cleaning, {neuralnet} for building our neural network, and {mlbench} to access the Boston housing data.

I create a z-score function that will be used to standardize the features for our model. We will keep this simple and attempt to predict Boston housing prices (mdev) using three features (rm, dis, indus). To read more about what these features are, in your R console type ?BostonHousing. Once we’ve selected those features out of the data set, we apply our z-score function to them.

## Load packages
library(tidyverse)
library(neuralnet)
library(mlbench)

## z-score function
z_score <- function(x){
  z <- (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
  return(z)
}

## get data
data("BostonHousing")

## z-score features
d <- BostonHousing %>%
  select(medv, rm, dis, indus) %>%
  mutate(across(.cols = rm:indus,
                ~z_score(.x),
                .names = "{.col}_z"))
  
d %>% 
  head()

Train/Test Split

There isn’t much of a train/test split here because I’m not building a full model to be tested. I’m really just trying to show how a neural network works. Thus, I’ll select the first row of data as my “test” set and retain the rest of the data for training the model.

## remove first observation for making a prediction on after training
train <- d[-1, ]
test <- d[1, ]

Neural Network Model

We build a simple model with 1 hidden layer and then plot the output. In the plot we see various numbers. The numbers in black refer to our weights and the numbers in blue refer to the biases.

## Simple neural network with one hidden layer
set.seed(9164)
fit_nn <- neuralnet(medv ~ rm_z + dis_z + indus_z,
                    data = train,
                    hidden = 1,
                    err.fct = "sse",
                    linear.output = TRUE)

## plot the neural network
plot(fit_nn)

Making Predictions — It’s linear regression all the way down!

As stated above, we have weights (black numbers) and biases (blue numbers). If we are trying to frame up the neural network as being a bunch of stacked together linear regressions then we can think about the weights as functioning like regression coefficients and the biases functioning like the linear model intercept.

Let’s take each variable from the plot and store them in their own elements so that we can apply them directly to our test observation and write out the equation by hand.

## Predictions are formed using the weights (black) and biases
# Store the weights and biases from the plot and put them into their own elements
rm_weight <- 1.09872
dis_weight <- -0.05993
indus_weight <- -0.49887

# There is also a bias in the hidden layer
hidden_weight <- 35.95032

bias_1 <- -1.68717
bias_2 <- 14.85824

With everything stored, we are ready to make a prediction on the test observations

We begin at the input layer by multiplying each z-scored value by the corresponding weight from the plot above. We sum those together and add in the first bias — just like we would with a linear regression.

# Start by applying the weights to their z-scored values, sum them together and add
# in the first bias
input <- bias_1 + test$rm_z * rm_weight + test$dis_z * dis_weight + test$indus_z * indus_weight
input

One regression down, one more to go! But before we can move to the next regression, we need to transform this input value. The neural network is using a sigmoid function to make this transformation as the input value moves through the hidden layer. So, we should apply the sigmoid function before moving on.

# transform input -- the neural network is using a sigmoid function
input_sig <- 1/(1+exp(-input))
input_sig

We take this transformed input and multiply it by the hidden weight and then add it to the second bias. This final regression equation produces the predicted value of the home.

prediction <- bias_2 + input_sig * hidden_weight
prediction

The prediction here is in the thousands, relative to census data from the 1970’s.

Let’s compare the prediction we got by hand to what we get when we run the predict() function.

## Compare the output to what we would get if we used the predict() function
predict(fit_nn, newdata = test)

Same result!

Again, if you’d like the full code, you can access it on by GITHUB page.

Wrapping Up

Yesterday we talked about always thinking regression whenever you see a t-test or ANOVA. Today, we learn that we can think regression whenever we see a neural network, as well! By stacking two regression-like equations together we produced a neural network prediction. Imagine if we stacked 20 hidden layers together!

The big take home, though, is that regression is super powerful. Fundamentally, it is the workhorse that helps to drive a number of other machine learning approaches. Imagine if you spent a few years really studying regression models? Imagine what you could learn about data analysis? If you’re up for it, one of my all time favorite books on the topic is Gelman, Hill, and Vehtari’s Regression and Other Stories. I can’t recommend it enough!

Deriving a Confidence Interval from an Estimate and a p-value

Although most journals require authors to include confidence intervals into their papers it isn’t mandatory for all journal (merely a recommendation). Additionally, there may be occasions when you are reading an older paper from a time when this mandate/recommendation was not enforced. Finally, sometimes abstracts (due to word count limits) might only present p-values and estimates, in which case you might want to quickly obtain the confidence intervals to help organize your thoughts prior to diving into the paper. In these instances, you might be curious as to how you can get a confidence interval around the observed effect when all you have is a p-value.

Bland & Altman wrote a short piece of deriving confidence interval from only an estimate and a p-value in a 2011 paper in the British Medical Journal:

Altman, DG. Bland, JM. (2011). How to obtain the confidence interval from a p-value. BMJ, 343: 1-2.

Before going through the approach, it is important to note that they indicate a limitation of this approach is that it wont be as accurate in smaller samples, but the method can work well in larger studies (~60 subjects or more).

The Steps

The authors’ list 3 easy steps to derive the confidence interval from an estimate and p-value:

Calculate the test statistic for a normal distribution from the p-value.
Calculate the standard error (ignogre the minus sign).
Calculate the 95% CI using the standard error and a z-critical value for the desired level of confidence.
When doing this approach for a ratio (e.g., Risk Ratio, Odds Ratio, Hazard Ratio), the formulas should be used with the estimate on the log scale (if it already isn’t) and then exponentiate (antilog) the confidence intervals to put the results back to the normal scale.

Calculating the test statistic

To calculate the test statistic use the following formula:

z = -0.862 + sqrt(0.743 – 2.404 * log(p.value))

Calculating the standard error

To calculate the standard error use the following formula (remember that we are ignoring the sign of the estimate):

se = abs(estimate) / z

If we are dealing with a ratio, make sure that you are working on the log scale:

se = abs(log(estimate)) / z

Calculating the 95% Confidence Limits

Once you have the standard error, the 95% Confidence Limits can be calculated by multiplying the standard error by the z-critical value of 1.96:

CL.95 = 1.96 * se

From there, the 95% Confidence Interval can be calculated:

low95 = Estimate – CL.95
high95 = Estimate + CL.95

Remember, if you are working with rate statistics and you want to get the confidence interval on the natural scale, you will need to take the antilog:

low95 = exp(Estimate – CL.95)
high95 = exp(Estimate + CL.95)

Writing a function

To make this simple, I’ll write everything into a function. The function will take three arguments, which you will need to obtain from the paper:

p-value
The estimate (e.g., difference in means, risk ratio, odds ratio, hazard ratio, etc)

The function will default to log = FALSE but if you are working with a rate statistic you can change the argument to log = TRUE to get the results on both the log and natural scales. The function also takes a sig_digits argument, which defaults to 3 but can be changed depending on how many significant digits you need.

estimate_ci_95 <- function(p_value, estimate, log = FALSE, sig_digits = 3){
  
  if(log == FALSE){
    
    z <- -0.862 + sqrt(0.743 - 2.404 * log(p_value))
    z

    se <- abs(estimate) / z
    se
    
    cl <- 1.96 * se
    
    low95 <- estimate - cl
    high95 <- estimate + cl
    
    list('Standard Error' = round(se, sig_digits),
         '95% CL' = round(cl, sig_digits),
         '95% CI' = paste(round(low95, sig_digits), round(high95, sig_digits), sep = ", "))
    
  } else {
    
    if(log == TRUE){
      
      z <- -0.862 + sqrt(0.743 - 2.404 * log(p_value))
      z
      
      se <- abs(estimate) / z
      se
      
      cl <- 1.96 * se
      
      low95_log_scale <- estimate - cl
      high95_log_scale <- estimate + cl
      
      low95_natural_scale <- exp(estimate - cl)
      high95_natural_scale <- exp(estimate + cl)
      
      list('Standard Error (log scale)' = round(se, sig_digits),
           '95% CL (log scale)' = round(cl, sig_digits),
           '95% CL (natural scale)' = round(exp(cl), sig_digits),
           '95% CI (log scale)' = paste(round(low95_log_scale, sig_digits), round(high95_log_scale, sig_digits), sep = ", "),
           '95% CI (natural scale)' = paste(round(low95_natural_scale, sig_digits), round(high95_natural_scale, sig_digits), sep = ", "))
      
    }
    
  }
  
}

Test the function out

The paper provides two examples, one for a difference in means and the other for risk ratios.

Example 1

Example 1 states:

“the abstract of a report of a randomised trial included the statement that “more patients in the zinc group than in the control group recovered by two days (49% v 32%,P=0.032).” The difference in proportions was Est = 17 percentage points, but what is the 95% confidence interval (CI)?

estimate_ci_95(p_value = 0.032, estimate = 17, log = FALSE, sig_digits = 1)

Example 2

Example 2 states:

“the abstract of a report of a cohort study includes the statement that “In those with a [diastolic blood pressure] reading of 95-99 mm Hg the relative risk was 0.30 (P=0.034).” What is the confidence interval around 0.30?”

Here we change the argument to log = TRUE since this is a ratio statistic needs to be on the log scale.

estimate_ci_95(p_value = 0.034, estimate = log(0.3), log = TRUE, sig_digits = 2)

Try the approach out on a different data set to confirm the confidence intervals are calculated properly

Below, we build a simple logistic regression model for the PimaIndiansDiabetes data set from the {mlbench} package.

The odds ratios are already on the log scale so we set the argument log = TRUE to ensure we get results reported back to us on the natural scale, as well.
We use the summary() function to obtain the model estimates and p-values.
We use the confint() function to get the 95% Confidence Intervals from the model.
To get the confidence intervals on the natural scale we also take the exponent, exp(confint()).
We use our custom function, estimate_ci_95(), to see how well the results compare.

## get data
library(mlbench)
data("PimaIndiansDiabetes")
df <- PimaIndiansDiabetes

## turn outcome variable into a numeric (0 = negative for diabetes, 1 = positive for diabetes)
df$diabetes_num <- ifelse(df$diabetes == "pos", 1, 0)
head(df)

## simple model
diabetes_glm <- glm(diabetes_num ~ pregnant + glucose + insulin, data = df, family = "binomial")

## model summary
summary(diabetes_glm)

Calculate 95% CI from the p-values and odds ratio estimates.

Pregnant Coefficient

## 95% CI for the pregnant coefficient
estimate_ci_95(p_value = 2.11e-06, estimate = 0.122, log = TRUE, sig_digits = 3)

Glucose Coefficient

## 95% CI for the glucose coefficient
estimate_ci_95(p_value = 2e-16, estimate = 0.0375, log = TRUE, sig_digits = 3)

Insulin Coefficient

## 95% CI for the insulin coefficient
estimate_ci_95(p_value = 0.677, estimate = -0.0003, log = TRUE, sig_digits = 5)

Evaluate the results from the custom function to those calculated with the confint() function.

## Confidence Intervals on the Log Scale
confint(diabetes_glm)

## Confidence Intervals on the Natural Scale
exp(confint(diabetes_glm))

We get nearly the same results with a bit of rounding error!

Hopefully this function will be of use to some people as they read papers or abstracts.

You can find the complete code in a cleaned up markdown file on my GITHUB page.

Confidence Intervals for Random Forest Regression using tidymodels (sort of)

The random forest algorithm is an ensemble method that fits a large number of decision trees (weak learners) and uses their combined predictions, in a wisdom of the crowds type of fashion, to make the final prediction. Although random forest can be used for classification tasks, today I want to talk about using random forest for regression problems (problems where the variable we are predicting is a continuous one). Specifically, I’m not only interested in a single prediction but I also want to get a confidence interval for the prediction.

In R, the two main packages for fitting random forests are {ranger} and {randomForest}. These packages are also the two engines available when fitting random forests in {tidymodels}. When building models in the native packages, prediction on new data can be done with the predict() function (similar to all models in R). To get an estimate of the variation in predictions, we pass the predict function the argument predict.all = TRUE, which produces a vector of all of the predictions made by each individual tree in the random forest. The problem, is that this argument is not available for predict() in {tidymodels}. Consequently, all we are left with in {tidymodels} is making a point estimate prediction (the average value of all of the trees in the forest)!!

The way we can circumvent this issue is by fitting our model in {tidymodels} using cross-validation so that we can tune the mtry and trees values. Once we have the optimum values for these hyper-parameters we will use the {randomForest} package and build a new model using those values. We will then make our predictions with this model on new data.

NOTE: I’m not 100% certain this is the best way to approach this problem inside or outside of {tidymodels}. If someone has a better solution, please drop it into the comments section or shoot me an email!

Load Packages & Data

We will use the mtcars data set and try and predict the car’s mpg from the disp, hp, wt, qsec, drat, gear, and carb columns.

## Load packages
library(tidymodels)
library(tidyverse)
library(randomForest)

## load data
df <- mtcars %>%
  select(mpg, disp, hp, wt, qsec, drat, gear, carb)

head(df)

Split data into cross-validation sets

This is a small data set, so rather than spending data by splitting it into training and testing sets, I’m going to use cross-validation on all of the available data to fit the model and tune the hyper-parameters

## Split data into cross-validation sets
set.seed(5)
df_cv <- vfold_cv(df, v = 5)

Specify the model type & build a tuning grid

The model type will be a random forest regression using the randomForest engine.

The tuning grid will be a vector of values for both mtry and trees to provide the model with options to try as it tunes the hyper-parameters.

## specify the random forest regression model
rf_spec <- rand_forest(mtry = tune(), trees = tune()) %>%
  set_engine("randomForest") %>%
  set_mode("regression")

## build a tuning grid
rf_tune_grid <- grid_regular(
  mtry(range = c(1, 7)),
  trees(range = c(500, 800)),
  levels = 5
)

Create a model recipe & workflow

## Model recipe
rf_rec <- recipe(mpg ~ ., data = df)

## workflow
rf_workflow <- workflow() %>%
  add_recipe(rf_rec) %>%
  add_model(rf_spec)

Fit and tun the model

## set a control function to save the predictions from the model fit to the CV-folds
ctrl <- control_resamples(save_pred = TRUE)

## fit model
rf_tune <- tune_grid(
  rf_workflow,
  resamples = df_cv,
  grid = rf_tune_grid,
  control = ctrl
)

rf_tune

View the model performance and identify the best model

## view model metrics
collect_metrics(rf_tune)

## Which is the best model?
select_best(rf_tune, "rmse")

## Look at that models performance
collect_metrics(rf_tune) %>%
  filter(mtry == 4, trees == 725)

Here we see that the model with the lowest root mean squared error (rmse) has an mtry = 4 and trees = 725.

Extract the optimal mtry and trees values for minimizing rmse

## Extract the best mtry and trees values to optimize rmse
m <- select_best(rf_tune, "rmse") %>% pull(mtry)
t <- select_best(rf_tune, "rmse") %>% pull(trees)

m
t

Re-fit the model using the optimal mtry and trees values

Now that we’ve identified the hyper-parameters that minimize rmse, we will re-fit the model using the {randomForest} package, so that we can get predictions for all of the trees, and specify the mtry and ntree values that were extracted from the {tidymodels} model within the function.

## Re-fit the model outside of tidymodels with the optimized values
rf_refit <- randomForest(mpg ~ ., data = df, mtry = m, ntree = t)
rf_refit

Create new data and make predictions

When making the predictions we have to make sure to pass the argument predict.all = TRUE.

## New data
set.seed(859)
row_id <- sample(1:nrow(df), size = 5, replace = TRUE)
newdat <- df[row_id, ]
newdat

## Make Predictions
pred.rf <- predict(rf_refit, newdat, predict.all = TRUE)
pred.rf

What do predictions look like?

Because we requested predict all, we have the ability to see a prediction for each of the 725 trees that were fit. Below we will look at the first and last 6 predictions of the 725 individual trees for the first observation in our new data set.

## Look at all 725 predictions for the first row of the data
head(pred.rf$individual[1, ])
tail(pred.rf$individual[1, ])

What do predictions look like?

Taking the mean of the 725 predictions will produce the predicted value for the new observation, using the wisdom of the crowds. Similarly, the standard deviation of these 725 predictions will give us a sense for the variability of the weak learners. We can use this information to produce our confidence intervals. We calculate our confidence intervals as the standard deviation of predictions multiplied by the t-critical value, which we calculate from a t-distribution with the degrees of freedom equal to 725 – 1.

# Average prediction -- what the prediction function returns
mean(pred.rf$individual[1, ])

# SD of predictions
sd(pred.rf$individual[1, ])

# get t-critical value for df = 725 - 1
t_crit <- qt(p = 0.975, df = t - 1)

# 95% CI
mean(pred.rf$individual[1, ]) - t_crit * sd(pred.rf$individual[1, ])
mean(pred.rf$individual[1, ]) + t_crit * sd(pred.rf$individual[1, ])

Make a prediction with confidence intervals for all of the observations in our new data

First we will make a single point prediction (the average, wisdom of the crowds, prediction) and then we will write a for() loop to create the lower and upper 95% Confidence Intervals using the same approach as above.

## Now for all of the predictions
newdat$pred_mpg <- predict(rf_refit, newdat)

## add confidence intervals
lower <- rep(NA, nrow(newdat))
upper <- rep(NA, nrow(newdat))

for(i in 1:nrow(newdat)){
  lower[i] <- mean(pred.rf$individual[i, ]) - t_crit * sd(pred.rf$individual[i, ])
  upper[i] <- mean(pred.rf$individual[i, ]) + t_crit * sd(pred.rf$individual[i, ])
}

newdat$lwr <- lower
newdat$upr <- upper

View the new observations with their predictions and create a plot of the predictions versus the actual data

The three columns on the right show us the predicted miles per gallon and the 95% confidence interval for each of the five new observations.

The plot shows us the point prediction and confidence interval along with the actual mpg (in red), which we can see falls within each of the ranges.

## Look at the new observations, predctions and confidence intervals and plot the data
## new data
newdat

## plot
newdat %>%
  mutate(car_type = rownames(.)) %>%
  ggplot(aes(x = pred_mpg, y = reorder(car_type, pred_mpg))) +
  geom_point(size = 5) +
  geom_errorbar(aes(xmin = lwr, xmax = upr),
                width = 0.1,
                size = 1.3) +
  geom_point(aes(x = mpg),
             size = 5,
             color = "red") +
  theme_minimal() +
  labs(x = "Predicted vs Actual MPG",
       y = NULL,
       title = "Predicted vs Actual (red) MPFG from Random Forest",
       subtitle = "mpg ~ disp + hp + wt + qsec + draft + gear + carb")

Wrapping Up

Random forests can be used for regression or classification problems. Here, we used the algorithm for regression with the goal of obtaining 95% Confidence Intervals based on the variability of predictions exhibited by all of the trees in the forest. Again, I’m not certain that this is the best way to achieve this output either inside of or outside of {tidymodels}. If anyone has other thoughts, feel free to drop them in the comments or shoot me an email.

To access all of the code for this article, please see my GITHUB page.

Making Predictions from Cross-Validated Workflow Using tidymodels

In yesterday’s post, I offered an approach to using {tidymodes} when you don’t want to split your data into training and testing sets, but rather, you want to fit all of your data using cross-validated folds and then save the model and deploy it later on with new data sets. Recall that the reason you’d want to do this is because you might not have enough data where you feel good about removing some of it for a testing set, ultimately decreasing the number of observations your model can learn from.

After that post, I got an email from someone asking how they could save the entire workflow for later deployment as yesterday’s article only saved the model fit following cross-validation. Storing the workflow can be critical when you have a data and.or a model that requires various preprocessing steps prior to making forecasts. One of the advantages of the {tidymodels} framework is the ability to combine the preprocessing tasks in one step and then fit the model all at once. This keeps your script nice and tidy and makes it easy to see what is taking place at each step.

The issue we have when working with just the cross-validated folds is that you aren’t fitting the model to a training/testing split of data once you are done fitting it. In most {tidymodels} examples, model fitting is done with the last_fit() function, which requires a split data set, which was generated via the initial_split() function. Form there you can extract the workflow and save it for later deployment. Consequently, there are a few extra steps to make this work smoothly when saving a model that was fit using only cross-validated folds.

So, to follow up on yesterday, I’ll walk through a random forest classification example where I’ll fit a model to cross-validation folds of the mtcars data set, I’ll save the entire recipe (where preprocessing takes place), I’ll save the model, and then I’ll show how you can use both the saved recipe and model to make predictions on a new data set. Additionally, to make things more interesting, I will tune the random forest model and show how to extract the tuned parameters and re-fit the model before saving it.

Load packages & data

### get data
df <- mtcars
head(df)
table(df$cyl)

df$cyl <- as.factor(df$cyl)

Create cross-validated folds & specify a random forest classifier

### specify random forest model
rf_spec_with_tuning <-rand_forest(mtry = tune(), trees = tune()) %>%
  set_engine("randomForest", importance = TRUE) %>%
  set_mode("classification")

Build a tuning grid

We will allow {tidymodels} to optimize both mtry and the trees.

### build a tuning grid
rf_tune_grid <- grid_regular(
  mtry(range = c(1, 10)),
  trees(range = c(500, 800)),
  levels = 5
)

Create the model recipe

The mtcars data set is complete and has no missing values. But, that doesn’t mean that future data that we will be deploying the model on will be free from missing values. So, to be sure that we can handle this in the future, if we need to, I’m going to create an imputation step in the recipe that will use k-nearest neighbor, which will you the 3 nearest neighbors, to impute any NA values of the independent variables.

NOTE: I’m not saying this is the best imputation approach here. I’m simply creating an additional step in the model recipe that can be deployed later to show how it works.

### recipe -- set up imputation for new data that might have missing values
cyl_rec <- recipe(cyl ~ mpg + disp + wt, data = df) %>%
  step_impute_knn(mpg, disp, wt,
                  neighbors = 3)

Set up the workflow

Combine the preprocessing recipe and the random forest model, which still needs to be tuned, into a single workflow.

### workflow
cyl_wf <- workflow() %>%
  add_recipe(cyl_rec) %>%
  add_model(rf_spec_with_tuning)

Set up a control function for storing the model predictions on the cross-validated folds

### set a control function to save the predictions from the model fit to the CV-folds
ctrl <- control_resamples(save_pred = TRUE)

Tune the model parameters during the model fitting process

### fit model
cyl_tune <- tune_grid(
  cyl_wf,
  resamples = df_cv,
  grid = rf_tune_grid,
  control = ctrl
)

Get the model predictions from the cross-validated tunning

### get predictions
pred_group <- cyl_tune %>%
  unnest(cols = .predictions) %>%
  select(.pred_4, .pred_6, .pred_8, .pred_class, cyl) 

pred_group

table('predicted' = pred_group$.pred_class, 'observed' = pred_group$cyl)

Get the optimized values for mtry and trees

After tuning the model, we want to get the mtry and trees parameters that produced the best ROC/AUC, so we will pull those values out of our tuning grid, cyl_tune. In this instance, an mtry of 1 and 500 trees appear to be the optimal values.

NOTE: We could have extracted the mtry and trees parameters that optimized model accuracy instead.

# get the optimized numbers for mtry and trees
m <- select_best(cyl_tune, "roc_auc") %>% pull(mtry)
t <- select_best(cyl_tune, "roc_auc") %>% pull(trees)

m
t

Re-specify the model and re-fit the workflow

Since we are working with cross-validated samples and not a training/testing set, we can’t just fit the last or fit the best model because this isn’t split data. To get the optimal model we need to actual re-specify the random forest model with the mtry and trees values we extracted above. So re-fit a new workflow with the optiized mtry and trees parameters to ensure that the tuned model is used for our final model fit.

# re-specify the model with the optimized values
rf_spec_tuned <-rand_forest(mtry = m, trees = t) %>%
  set_engine("randomForest", importance = TRUE) %>%
  set_mode("classification")

# re-set workflow
cyl_wf_tuned <- workflow() %>%
  add_recipe(cyl_rec) %>%
  add_model(rf_spec_tuned)

Extract & save the final recipe

Now that the tuned model has been fit we will need to extract the final recipe and then save it as an .RDA file so that we can load it for deployment later on. To do this, we use the extract_recipe() function after fitting the tuned model to our original data set.

# extract the final recipe for pre-processing of new data
cyl_final_rec <- cyl_wf_tuned %>%
  fit(df) %>%
  extract_recipe()

save(cyl_final_rec, file = "cyl_final_rec.rda")
load("cyl_final_rec.rda")
cyl_final_rec

Extract & save the final model fit

Once we have the recipe, which holds all of our preprocessing steps, we then need to extract the actual model fit so that we can make future predictions on new data.

# extract final model
cyl_final_tuned <- cyl_wf_tuned %>% 
  fit(df) %>%
  extract_fit_parsnip()

save(cyl_final_tuned, file = "cyl_final_tuned.rda")
load("cyl_final_tuned.rda")

cyl_final_tuned

Create a new data set and add missing values to some of the independent variables

Now that our recipe and model fit are saved we will create some new data and add some missing values to show how the impute function, which we created in the recipe, works.

We will also save the cyl value for this new data (the truth) so we can check our work once the model predictions are done. Prior to making predictions on this new data we will drop the cyl column from this new data set to make it look more realistic to what we would see in the real world (IE, we’d never have the actual output we are trying to forecast).

### Create New Data with NAs
set.seed(947)
row_ids <- sample(x = 1:nrow(mtcars), size = 5, replace = TRUE)

df_new <-mtcars[row_ids, ]
df_new[2, 3] <- NA
df_new[c(1,5), 1] <- NA
df_new[3, 6] <- NA

# get the actual cyl values for this new data
truth <- df_new$cyl
truth

# drop the cyl column to pretend like this is new data
df_new$cyl <- NULL
df_new

Apply the recipe to the new data

Notice that we have some NAs in a few of the predictor columns (mpg, disp, and wt). We apply the recipe to the new data set by using the bake() function to impute those values using information about the data that was gained during the model workflow that we build back when we fit the model.

### Apply the pre-processing recipe to the new data
df_new_pre_processed <- cyl_final_rec %>%
  bake(new_data = df_new)

df_new_pre_processed

Make the final predictions using the saved model

Now that we have imputed NAs using the recipe we are ready to make cyl predictions on the new data. After making predictions we will combine the predicted class and the probability of each of the three classes with the new data set.

### Make a prediction for cyl
pred_cyl <- predict(cyl_final_tuned, new_data = df_new_pre_processed, type = "class")
df_new_pre_processed <- cbind(df_new_pre_processed, pred_cyl)

### get probability of each class
pred_probs <- predict(cyl_final_tuned, new_data = df_new_pre_processed, type = "prob")
df_new_pre_processed <- cbind(df_new_pre_processed, pred_probs)
df_new_pre_processed

See how well the model predicted on this new data, even after having to impute some values for each observation.

table('predicted' = df_new_pre_processed$.pred_class, 'actual' = truth)

The model predicted all 5 new observations correctly.

And that’s it! Those are the steps to follow for using all of your data to fit and tune a model with cross-validated folds, save the preprocessing steps and tuned model, and then apply it to a new set of observations.

All of the code for this blog article is available on my GITHUB page.