Category Archives: Sports Science

Deriving a Confidence Interval from an Estimate and a p-value

Although most journals require authors to include confidence intervals into their papers it isn’t mandatory for all journal (merely a recommendation). Additionally, there may be occasions when you are reading an older paper from a time when this mandate/recommendation was not enforced. Finally, sometimes abstracts (due to word count limits) might only present p-values and estimates, in which case you might want to quickly obtain the confidence intervals to help organize your thoughts prior to diving into the paper. In these instances, you might be curious as to how you can get a confidence interval around the observed effect when all you have is a p-value.

Bland & Altman wrote a short piece of deriving confidence interval from only an estimate and a p-value in a 2011 paper in the British Medical Journal:

Altman, DG. Bland, JM. (2011). How to obtain the confidence interval from a p-value. BMJ, 343: 1-2.

Before going through the approach, it is important to note that they indicate a limitation of this approach is that it wont be as accurate in smaller samples, but the method can work well in larger studies (~60 subjects or more).

The Steps

The authors’ list 3 easy steps to derive the confidence interval from an estimate and p-value:

Calculate the test statistic for a normal distribution from the p-value.
Calculate the standard error (ignogre the minus sign).
Calculate the 95% CI using the standard error and a z-critical value for the desired level of confidence.
When doing this approach for a ratio (e.g., Risk Ratio, Odds Ratio, Hazard Ratio), the formulas should be used with the estimate on the log scale (if it already isn’t) and then exponentiate (antilog) the confidence intervals to put the results back to the normal scale.

Calculating the test statistic

To calculate the test statistic use the following formula:

z = -0.862 + sqrt(0.743 – 2.404 * log(p.value))

Calculating the standard error

To calculate the standard error use the following formula (remember that we are ignoring the sign of the estimate):

se = abs(estimate) / z

If we are dealing with a ratio, make sure that you are working on the log scale:

se = abs(log(estimate)) / z

Calculating the 95% Confidence Limits

Once you have the standard error, the 95% Confidence Limits can be calculated by multiplying the standard error by the z-critical value of 1.96:

CL.95 = 1.96 * se

From there, the 95% Confidence Interval can be calculated:

low95 = Estimate – CL.95
high95 = Estimate + CL.95

Remember, if you are working with rate statistics and you want to get the confidence interval on the natural scale, you will need to take the antilog:

low95 = exp(Estimate – CL.95)
high95 = exp(Estimate + CL.95)

Writing a function

To make this simple, I’ll write everything into a function. The function will take three arguments, which you will need to obtain from the paper:

p-value
The estimate (e.g., difference in means, risk ratio, odds ratio, hazard ratio, etc)

The function will default to log = FALSE but if you are working with a rate statistic you can change the argument to log = TRUE to get the results on both the log and natural scales. The function also takes a sig_digits argument, which defaults to 3 but can be changed depending on how many significant digits you need.

estimate_ci_95 <- function(p_value, estimate, log = FALSE, sig_digits = 3){
  
  if(log == FALSE){
    
    z <- -0.862 + sqrt(0.743 - 2.404 * log(p_value))
    z

    se <- abs(estimate) / z
    se
    
    cl <- 1.96 * se
    
    low95 <- estimate - cl
    high95 <- estimate + cl
    
    list('Standard Error' = round(se, sig_digits),
         '95% CL' = round(cl, sig_digits),
         '95% CI' = paste(round(low95, sig_digits), round(high95, sig_digits), sep = ", "))
    
  } else {
    
    if(log == TRUE){
      
      z <- -0.862 + sqrt(0.743 - 2.404 * log(p_value))
      z
      
      se <- abs(estimate) / z
      se
      
      cl <- 1.96 * se
      
      low95_log_scale <- estimate - cl
      high95_log_scale <- estimate + cl
      
      low95_natural_scale <- exp(estimate - cl)
      high95_natural_scale <- exp(estimate + cl)
      
      list('Standard Error (log scale)' = round(se, sig_digits),
           '95% CL (log scale)' = round(cl, sig_digits),
           '95% CL (natural scale)' = round(exp(cl), sig_digits),
           '95% CI (log scale)' = paste(round(low95_log_scale, sig_digits), round(high95_log_scale, sig_digits), sep = ", "),
           '95% CI (natural scale)' = paste(round(low95_natural_scale, sig_digits), round(high95_natural_scale, sig_digits), sep = ", "))
      
    }
    
  }
  
}

Test the function out

The paper provides two examples, one for a difference in means and the other for risk ratios.

Example 1

Example 1 states:

“the abstract of a report of a randomised trial included the statement that “more patients in the zinc group than in the control group recovered by two days (49% v 32%,P=0.032).” The difference in proportions was Est = 17 percentage points, but what is the 95% confidence interval (CI)?

estimate_ci_95(p_value = 0.032, estimate = 17, log = FALSE, sig_digits = 1)

Example 2

Example 2 states:

“the abstract of a report of a cohort study includes the statement that “In those with a [diastolic blood pressure] reading of 95-99 mm Hg the relative risk was 0.30 (P=0.034).” What is the confidence interval around 0.30?”

Here we change the argument to log = TRUE since this is a ratio statistic needs to be on the log scale.

estimate_ci_95(p_value = 0.034, estimate = log(0.3), log = TRUE, sig_digits = 2)

Try the approach out on a different data set to confirm the confidence intervals are calculated properly

Below, we build a simple logistic regression model for the PimaIndiansDiabetes data set from the {mlbench} package.

The odds ratios are already on the log scale so we set the argument log = TRUE to ensure we get results reported back to us on the natural scale, as well.
We use the summary() function to obtain the model estimates and p-values.
We use the confint() function to get the 95% Confidence Intervals from the model.
To get the confidence intervals on the natural scale we also take the exponent, exp(confint()).
We use our custom function, estimate_ci_95(), to see how well the results compare.

## get data
library(mlbench)
data("PimaIndiansDiabetes")
df <- PimaIndiansDiabetes

## turn outcome variable into a numeric (0 = negative for diabetes, 1 = positive for diabetes)
df$diabetes_num <- ifelse(df$diabetes == "pos", 1, 0)
head(df)

## simple model
diabetes_glm <- glm(diabetes_num ~ pregnant + glucose + insulin, data = df, family = "binomial")

## model summary
summary(diabetes_glm)

Calculate 95% CI from the p-values and odds ratio estimates.

Pregnant Coefficient

## 95% CI for the pregnant coefficient
estimate_ci_95(p_value = 2.11e-06, estimate = 0.122, log = TRUE, sig_digits = 3)

Glucose Coefficient

## 95% CI for the glucose coefficient
estimate_ci_95(p_value = 2e-16, estimate = 0.0375, log = TRUE, sig_digits = 3)

Insulin Coefficient

## 95% CI for the insulin coefficient
estimate_ci_95(p_value = 0.677, estimate = -0.0003, log = TRUE, sig_digits = 5)

Evaluate the results from the custom function to those calculated with the confint() function.

## Confidence Intervals on the Log Scale
confint(diabetes_glm)

## Confidence Intervals on the Natural Scale
exp(confint(diabetes_glm))

We get nearly the same results with a bit of rounding error!

Hopefully this function will be of use to some people as they read papers or abstracts.

You can find the complete code in a cleaned up markdown file on my GITHUB page.

If the examples we see in textbooks don’t represent the real world, what about the sports science papers we read?

I enjoyed this short article by Andrew Gelman on the blog he keeps with several of his colleagues. The main point, which I agree 100% with, is that the examples in our stats and data analysis textbooks never seem to match what we see in the real world. The examples always seem to work! The data is clean and looks perfectly manicured for the analysis. I get it! The idea is to convey the concept of how different analytical approaches work. The rub is that once you get to the real world and look at your data you end up being like, “Uh. Wait….what is this? What do I do now?!”

The blog got me thinking about something else, though. Something that really frustrates me. If the examples we see in textbooks don’t reflect the data problems we face in the real world, what about the examples we read about in applied sport science research? How much do those examples reflect what we see in the real world?

At the risk of upsetting some colleagues, I’ll go ahead and say it:

I’m not convinced that the research we read in publications completely represents the real world either!

How can that be? This is applied science! Isn’t the real word THE research?

Well, yes and no. Yes, the data was collected in the real world, with real athletes, and in a sport setting. But often, reading a paper, looking at the aim and the conclusions, and then parsing through the methods section to see how they handled the data leaves me scratching my head.

My entire day revolves around looking at data and I can tell you; the real world is very messy.

How things get collected
How things get saved
How things get loaded
How things get logged

There are potential hiccups all along the way, no matter how stringent you are in trying to keep sound data collection practices!

In research, the problem is that the data is often touched up in some manner to create an analysis sufficient for publication. Missing values need to be handled a certain way (sometimes those rows get dropped, sometimes values get imputed), class imbalance can be an issue, errant values and outliers from technology flub-ups are a very real thing, data entry issues arise, etc. These things are all problematic and, if not identified prior to analysis, can be a major issue with the findings. I do believe that most people recognize these problems and would agree with me that they are very real issues. However, it is less about knowing that there are problems but rather, figuring out what to do about them. Often the methods sections gloss over these details (I get it, word counts for journals can be a pain in the butt) and simply produce a result that, on paper at least, seems overly optimistic. As I read through the results section, without details about data processing, I frequently say to myself, “No way. There is no way this effect is as real as they report. I can’t reproduce this without knowing how they cleaned up their data to observe this effect.”

Maybe someone should post a paper about how crappy their data is in the applied setting? Maybe we should just be more transparent about the data cleaning processes we go through so that we aren’t overly bullish on our findings and more realistic about the things we can say with confidence in the applied setting?

Does anyone else feel this way? Maybe I’m wrong and I’m being pessimistic, and this isn’t as big of an issue as I believe it to be? Is the data we see in publication truly representative of the real world?

Weakley et al. (2022). Velocity-Based Training: From Theory to Application – R Workbook

Velocity-based training (VBT) is a method employed by strength coaches to prescribe training intensity and volume based off of an individual athlete’s load-velocity profiles. I discussed VBT last year when I used {shiny} to build an interactive web application for visualizing and comparing athlete outputs.

Specific to this topic, I recently read the following publication: Weakley, Mann, Banyard, McLaren, Scott, and Garcia-Ramos. (2022). Velocity-based training: From theory to application. Strength Cond J; 43(2): 31-49.

The paper aimed to provide some solutions for analyzing, visualizing, and presenting feedback around training prescription and performance improvement when using VBT. I enjoyed the paper and decided to write an R Markdown file to provide code that can accompany it and (hopefully) assist strength coaches in applying some of the concepts in practice. I’ll summarize some notes and thoughts below, but if you’d like to read the full R Markdown file that explains and codes all of the approaches in the paper, CLICK HERE>> Weakley–2021—-Velocity-Based-Training—From-Theory-to-Application—Strength-Cond-J.

If you’d like the CODE and DATA to run the analysis yourself, they are available on my GitHub page.

Paper/R Markdown Overview

Technical Note: I don’t have the actual data from the paper. Therefore, I took a screen shot of Figure 3 in the text and used an open source web application for extracting data from figures in research papers. This requires me to go through and manually click on the points of the plot itself. Consequently, I’m not 100% perfect, so there may be subtle differences in my data set compared to what was used for the paper.

The data used in the paper reflect 17-weeks of mean concentric velocity (MCV) in the 100-kg back squat for a competitive powerlifter, tested once a week. The two main figures, which, along with the analysis, I will recreate are Figure 3 and Figure 5.

Figure 3 is a time series visual of the athlete while Figure 5 provides an analysis and visual for the athlete’s change across the weeks in the training phase.

Figure 3

The first 10-weeks represent the maintenance phase for the athlete, which was followed by a 7-week training phase. The maintenance phase sessions were used to build a linear regression model which was then used to visualize the athlete’s change over time along with corresponding confidence interval around each MCV observation. The model output looks like this:

The standard (typical) error was used to calculate confidence intervals around the observations. To calculate the standard error, the authors’ recommend one of two approaches:

1) If you have group-based test-retest data, they recommend taking the difference between the test-retest outcomes and calculating the standard error as follows:

$S E . g r o u p = s d (d i f f e r e n c e s) / s q r t (2)$ $S E . g r o u p = s d (d i f f e r e n c e s) / s q r t (2)$

2) If you have individual observations, they recommend calculating the standard error like this:

$S E . i n d i v i d u a l = s q r t (s u m . s q u a r e d . r e s i d u a l s) / (n - 2))$

Since we have individual athlete data, we will use the second option, along with the t-critical value for 80% CI, to produce Figure 3 from the paper :

The plot provides a nice visual of the athlete over time. We see that, because the linear model is calculated for the maintenance phase, as time goes on, the shaded standard error region gets wider. The confidence intervals around each point estimate are there to encourage us to think past just a point estimate and recognize that there is some uncertainty in every test outcome that cannot be captured in a single value.

Figure 5

This figure visualizes the change in squat velocity for the powerlifter in weeks 11-17 (the training phase) relative to the mean squat velocity form the maintenance phase, representing the athlete’s baseline performance.

Producing this plot requires five pieces of information:

Baseline average for the maintenance phase
The difference between the observed MVC in each training week and the maintenance average
Calculate the t-critical value for the 90% CI
Calculate the Lower 90% CI
Calculate the Upper 90% CI

Obtaining this information allows us to produce the following table of results and figure:

Are the changes meaningful?

One thing the authors’ mention in the paper are some approaches to evaluating whether the observed changes are meaningful. They recommend using either equivalence tests or second generation p-values. However, they don’t go into calculating such things on their data. I honestly am not familiar with the latter option, so I’ll instead create an example of using an equivalence test for the data and show how we can color the points within the plot to represent their meaningfulness.

Equivalence testing has been discussed by Daniel Lakens and colleagues in their tutorial paper, Lakens, D., Scheel, AM., Isager, PM. (2018). Equivalence testing for psychological reserach: A tutorial. Advances in Methods and Practices in Psychological Science. 2018; 1(2): 259-269.

Briefly, equivalence testing uses one-sided t-tests to evaluate whether the observed effect is larger or smaller than a pre-specified range of values surrounding the effect of interest, termed the smallest effect size of interest (SESOI).

In our above plot, we can consider the shaded range of values around 0 (-0.03 to 0.03, NOTE: The value 0.03 was provided in the text as the meaningful change for this athlete to see an ~1% increase in his 1-RM max) as the region where an observed effect would not be deemed interesting. Outside of those ranges is a change in performance that we would be most interested in. In addition to being outside of the SESOI region, the observed effect should be substantially large enough relative to the standard error around each point, which we calculated from our regression model earlier.

Putting all of this together, we obtain a the same figure above but now with the points colored specific to the p-value provided from our equivalence test:

Warpping Up

Again, if you’d like the full markdown file with code (click the ‘code’ button to display each code chunk) CLICK HERE >> Weakley–2021—-Velocity-Based-Training—From-Theory-to-Application—Strength-Cond-J

There are always a number of ways that analysis can unfold and provide valuable insights and this paper reflects just one approach. As with most things, I’m left with more questions than answers.

For example, Figure 3, I’m not sure if linear regression is the best approach. As we can see, the grey shaded region increases in width overtime because time is on the x-axis (independent variable) and the model was built on a small portion (the first 10-weeks) of the data. As such, with every subsequent week, uncertainty gets larger. How long would one continue to use the baseline model? At some point, the grey shaded region would be so wide that it would probably be useless. Are we too believe that the baseline model is truly representative of the athlete’s baseline? What if the baseline phase contained some amount of trend — how would the model then be used to quantify whatever takes place in the training phase? Maybe training isn’t linear? Maybe there is other conditional information that could be used?

In Figure 5, I wonder about the equivalence testing used in this single observation approach. I’ve generally thought of equivalence testing as a method comparing groups to determine if the effect from an intervention in one group is larger or smaller than the SESOI. Can it really work in an example like this, for an individual? I’m not sure. I need to think about it a bit. Maybe there is a different way such an analysis could be conceptualized? A lot of these issues come back to the problem of defining the baseline or some group of comparative observations that we are checking our most recent observation against.

My ponderings aside, I enjoyed the paper and the attempt to provide practitioners with some methods for delivering feedback when using VBT.

Issues with ‘Black Box’ Machine Learning Models in Injury Prediction

Injury prediction models developed using machine learning approaches have become more common due to the substantial rise of proprietary software in the sports science and sports medicine space. However, such ‘black box’ approaches are not without limitation. Aside from a lack of transparency, preventing independent evaluation of model performance, these types of models present challenges in interpretation, making it difficult for practitioners who are required to make decisions about athlete health and plan interventions.

I recently had the pleasure of working on a paper headed up by Garrett Bullock and a list of wonderful co-authors where we discuss some of these issues:

Black Box Prediction Methods in Sports Medicine Deserve a Red Card for Reckless Practice: A Change of Tactics is Needed to Advance Athlete Care. Sports Med.

Bayesian Updating of Reference Ranges for Serial Measurements

Introduction

The collection of serial measurements on athletes across a season (or multiple seasons) is one of the more common types of data being generated in the applied sport science environment. The question that coaches and practitioners often have is, “Is this player outside of their ‘normal range’?”

The best approach for establishing a reference range of ‘normal’ values is a frequently discussed topic in sport science. One common strategy is to use z-scores and represent the reference range as 1 standard deviation above or below the mean (Figure A) or plot the raw values and set the reference range 1 standard deviation above or below the raw mean (Figure B), for practitioners who might have a difficult time understanding standardized scores. Of course, the mean and standard deviation will now be related to all prior values. As such, if the athletes go through a training phase with substantially higher values than other phases (e.g., training camp) it could skew your reference ranges. To alleviate this issue, some choose to use a rolling mean and standard deviation, to represent the normal range of values relative to more recent training sessions (Figure C).

A problem with the approaches above is that they require a number of training sessions to allow a mean and standard deviation to be determined for the individual athlete. One solution to this issue is to base our initial normal reference ranges off of prior knowledge that we have from collecting data on players in previous seasons (or prior knowledge from research papers, if we don’t have data of our own yet). This type of Bayesian updating approach has been applied in WADA’s drug testing practices¹. More recently, Hecksteden et al., used this approach to evaluate the CK levels of team-sport athletes in both fatigued and non-fatigued states².

The mathematics of the approach was presented in the paper but might look intimidating to those not used to looking at mathematical equations in this manner.

The author’s provided a nice excel sheet where you can input your own data and get the updated reference ranges. However, the sheet is a protected sheet, which doesn’t afford the opportunity of seeing how the underlying equations work and you can’t alter the sheet to make appropriate for your data (for example, the data in the sheet log transforms the raw data automatically). Thus, I’ve decided to code the analysis out, both in excel and R, to help practitioners looking to adopt this approach.

Setting Priors

To apply this type of analysis, we need to first establish some prior values for three parameters: Prior Mean (mu), Prior Standard Deviation (tau), and a Prior Repeated-Measures Standard Deviation (sigmaRM). These values represent our current knowledge of the variable we are measuring before seeing any new data. As new data is collected, we can update these priors to get an individual (posterior) estimate for the athlete. I’ll use the priors set by Hecksteden and colleagues for CK levels of Male athletes:

Mu = 5.527
Tau = 0.661
sigmaRM = 0.504

Once we have established our prior parameters, we are ready to update them, using the math equations above, as new data comes in.

Bayesian Updating in Excel

The excel sheet is available at my GitHub page. It looks like this:

All of the heavy lifting occurs in the two columns under the header Bayesian Updating (Log Scale). The equation in the first row (Test 1) is different than the other equations below it because requires the prior information to get going. After that first test, the updated data become the prior for the next test and this continues for all tests forward. You can download the excel sheet and see how the equations work, so I won’t go through them here. Instead, I’ll show them more clearly in the R script, below.

Bayesian Updating in R

We first need to convert the math equations provided in the paper (posted above) into R code. Rather that leaving things to mathematical notation, I’ll plug in the variables in plain English:

To be clear, here are the definitions for the variables above:

Now that we know the variables we need for each equation we can begin the process of updating or reference ranges.

First create a data set of the test observations and their log values. This will be the same data we observed in our excel sheet:

Then we set our priors (in log format):

## priors
prior_mu <- 5.527
prior_sd <- 0.661
prior_repeated_measure_sd <- 0.504

We will start by seeing how the updating works for the mean and standard deviation parameters after the first test. To do this, we will create a function for each parameter (mean and standard deviation) that updates the priors with the observed values based on the above equations:

posterior_mu <- function(prior_mu, prior_sd, prior_repeated_measure_sd, obs_value){
  
  numerator <- prior_repeated_measure_sd^2 * prior_mu + prior_sd^2 * obs_value
  denominator <- prior_repeated_measure_sd^2 + prior_sd^2
  
  post_mu <- numerator / denominator
  return(post_mu)
  
  }

posterior_sd <- function(prior_repeated_measure_sd, prior_sd, test_num){
  
  post_var <- 1 / ((test_num - 1 + 1) * 1/prior_repeated_measure_sd^2 + 1/prior_sd^2) 
  post_sd <- sqrt(post_var)
  return(post_sd)
  
}

After running the functions on the observations of our first test, our updated mean and standard deviation are:

Notice that we obtain the same values that we see following test_1 in our excel workbook. We can also calculate 95% confidence intervals and take the exponent (since the data is on a log scale) to get the individual athlete’s updated reference range on the raw scale:

Again, these results confirm the values we see in our excel workbook.

That’s cool and all, but we need to be able to iteratively update the data set as new data comes in. Let’s write a for() loop!

First, we create a new column in the data that provides us with the updated standard deviation after observing the results in each test. This is a necessary first step as we will use this value to then update the mean value.

## Calculate the updated SD based on sample size
df2 <- df %>%
  mutate(bayes_sd = sqrt(1 / ((test - 1 + 1) * 1 / prior_repeated_measure_sd^2 + 1 / prior_sd^2))) 

df2

Next, we need a for() loop. This is a bit tricky because test_1 is updating based solely on the priors while all other tests (test_2 to test_N) will be updating based on the mean and standard deviation in the row above them. Thus, we need to have our for() loop look back at the previous row above it once those values are calculated. This sort of thing is easy to set up in excel but in R (or Python) we need to think about how to code our row indexes. I covered this sort of iterative row computing in two previous articles HERE and HERE.

Within the loop we first set up vectors for the prior variance (sd^2), the denominator in our equation, and the log transformed observations from our data set. Then, we calculate the updated posterior for the mean (mu) on each pass through the loop, each time using the value preceding it in the vector, [i – 1], to allow us to iteratively update the data.

Once we run the loop, we add the results to our data set (removing the first observation in the vector since that was the original prior before seeing any data):

# Create a vector to store results
N <- length(df2$ln_value) + 1
bayes_mu <- c(prior_mu, rep(NA, N - 1))


## For loop
for(i in 2:N){
  
  ## Set up vectors for the variance, denominator, and newly observed values
  prior_var <- c(prior_sd^2, df2$bayes_sd^2)
  denominator <- prior_repeated_measure_sd^2 + prior_var
  vals <- df2$ln_value
  
  ## calculate bayesian updated mu
  bayes_mu[i] <- (prior_repeated_measure_sd^2 * bayes_mu[i-1] + prior_var[i-1] * vals[i-1]) / denominator[i-1]
    
}

df2$bayes_mean <- bayes_mu[-1]
df2

The two columns, bayes_sd and bayes_mean, contain our updated prior values and they are the exact same results we obtained in our excel workbook.

To use these updated parameters for creating individual athlete reference ranges, we calculate the 95% Confidence Intervals:

NOTE: I added a row at the start of data frame to establish the priors, before seeing the data, so that they could also be plotted as part of the reference ranges.

### Confidence Intervals
first_prior <- data.frame(test = 0, value = NA, ln_value = NA, bayes_sd = prior_sd, bayes_mean = prior_mu)

df2 <- df2 %>%
  bind_rows(first_prior) %>%
  arrange(test)

## Exponentiate back to get the reference range
df2$low95 <- exp(df2$bayes_mean - 1.96*df2$bayes_sd)
df2$high95 <- exp(df2$bayes_mean + 1.96*df2$bayes_sd)
df2

Finally, we plot the observations along with the continually updated references ranges. You can clearly see how large the normal range is before seeing any data (test_0) and then how quickly this range begins to shrink down once we start observing data from the individual.

To access the R code and the excel workbook please visit my GitHub page.

References

Sottas PE et al. (2007). Bayesian detection of abnormal values in longitudinal biomarkers with application to T/E ratio. Biostatistics; 8(2): 285-296.
Hecksteden et al. (2017). A new method to individualize monitoring of muscle recovery in athletes. Int J Sport Phys Perf; 12: 1137-1142.

Patrick Ward, PhD

Patrick Ward, PhD | Sports Science & Analytics