# If the examples we see in textbooks don’t represent the real world, what about the sports science papers we read?

I enjoyed this short article by Andrew Gelman on the blog he keeps with several of his colleagues. The main point, which I agree 100% with, is that the examples in our stats and data analysis textbooks never seem to match what we see in the real world. The examples always seem to work! The data is clean and looks perfectly manicured for the analysis. I get it! The idea is to convey the concept of how different analytical approaches work. The rub is that once you get to the real world and look at your data you end up being like, “Uh. Wait….what is this? What do I do now?!”

The blog got me thinking about something else, though. Something that really frustrates me. If the examples we see in textbooks don’t reflect the data problems we face in the real world, what about the examples we read about in applied sport science research? How much do those examples reflect what we see in the real world?

At the risk of upsetting some colleagues, I’ll go ahead and say it:

I’m not convinced that the research we read in publications completely represents the real world either!

How can that be? This is applied science! Isn’t the real word THE research?

Well, yes and no. Yes, the data was collected in the real world, with real athletes, and in a sport setting. But often, reading a paper, looking at the aim and the conclusions, and then parsing through the methods section to see how they handled the data leaves me scratching my head.

My entire day revolves around looking at data and I can tell you; the real world is very messy.

• How things get collected
• How things get saved
• How things get logged

There are potential hiccups all along the way, no matter how stringent you are in trying to keep sound data collection practices!

In research, the problem is that the data is often touched up in some manner to create an analysis sufficient for publication. Missing values need to be handled a certain way (sometimes those rows get dropped, sometimes values get imputed), class imbalance can be an issue, errant values and outliers from technology flub-ups are a very real thing, data entry issues arise, etc. These things are all problematic and, if not identified prior to analysis, can be a major issue with the findings. I do believe that most people recognize these problems and would agree with me that they are very real issues. However, it is less about knowing that there are problems but rather, figuring out what to do about them. Often the methods sections gloss over these details (I get it, word counts for journals can be a pain in the butt) and simply produce a result that, on paper at least, seems overly optimistic. As I read through the results section, without details about data processing, I frequently say to myself, “No way. There is no way this effect is as real as they report. I can’t reproduce this without knowing how they cleaned up their data to observe this effect.”

Maybe someone should post a paper about how crappy their data is in the applied setting? Maybe we should just be more transparent about the data cleaning processes we go through so that we aren’t overly bullish on our findings and more realistic about the things we can say with confidence in the applied setting?

Does anyone else feel this way? Maybe I’m wrong and I’m being pessimistic, and this isn’t as big of an issue as I believe it to be? Is the data we see in publication truly representative of the real world?

# Weakley et al. (2022). Velocity-Based Training: From Theory to Application – R Workbook

Velocity-based training (VBT) is a method employed by strength coaches to prescribe training intensity and volume based off of an individual athlete’s load-velocity profiles. I discussed VBT last year when I used {shiny} to build an interactive web application for visualizing and comparing athlete outputs.

Specific to this topic, I recently read the following publication: Weakley, Mann, Banyard, McLaren, Scott, and Garcia-Ramos. (2022). Velocity-based training: From theory to application. Strength Cond J; 43(2): 31-49.

The paper aimed to provide some solutions for analyzing, visualizing, and presenting feedback around training prescription and performance improvement when using VBT. I enjoyed the paper and decided to write an R Markdown file to provide code that can accompany it and (hopefully) assist strength coaches in applying some of the concepts in practice. I’ll summarize some notes and thoughts below, but if you’d like to read the full R Markdown file that explains and codes all of the approaches in the paper, CLICK HERE>> Weakley–2021—-Velocity-Based-Training—From-Theory-to-Application—Strength-Cond-J.

If you’d like the CODE and DATA to run the analysis yourself, they are available on my GitHub page.

Paper/R Markdown Overview

Technical Note: I don’t have the actual data from the paper. Therefore, I took a screen shot of Figure 3 in the text and used an open source web application for extracting data from figures in research papers. This requires me to go through and manually click on the points of the plot itself. Consequently, I’m not 100% perfect, so there may be subtle differences in my data set compared to what was used for the paper.

The data used in the paper reflect 17-weeks of mean concentric velocity (MCV) in the 100-kg back squat for a competitive powerlifter, tested once a week. The two main figures, which, along with the analysis, I will recreate are Figure 3 and Figure 5.

Figure 3 is a time series visual of the athlete while Figure 5 provides an analysis and visual for the athlete’s change across the weeks in the training phase.

Figure 3

The first 10-weeks represent the maintenance phase for the athlete, which was followed by a 7-week training phase. The maintenance phase sessions were used to build a linear regression model which was then used to visualize the athlete’s change over time along with corresponding confidence interval around each MCV observation. The model output looks like this:

The standard (typical) error was used to calculate confidence intervals around the observations. To calculate the standard error, the authors’ recommend one of two approaches:

1) If you have group-based test-retest data, they recommend taking the difference between the test-retest outcomes and calculating the standard error as follows:

• SE.group = sd(differences) / sqrt(2)

2) If you have individual observations, they recommend calculating the standard error like this:

• SE.individual = sqrt(sum.squared.residuals) / (n2))

Since we have individual athlete data, we will use the second option, along with the t-critical value for 80% CI, to produce Figure 3 from the paper :

The plot provides a nice visual of the athlete over time. We see that, because the linear model is calculated for the maintenance phase, as time goes on, the shaded standard error region gets wider. The confidence intervals around each point estimate are there to encourage us to think past just a point estimate and recognize that there is some uncertainty in every test outcome that cannot be captured in a single value.

Figure 5

This figure visualizes the change in squat velocity for the powerlifter in weeks 11-17 (the training phase) relative to the mean squat velocity form the maintenance phase, representing the athlete’s baseline performance.

Producing this plot requires five pieces of information:

1. Baseline average for the maintenance phase
2. The difference between the observed MVC in each training week and the maintenance average
3. Calculate the t-critical value for the 90% CI
4. Calculate the Lower 90% CI
5. Calculate the Upper 90% CI

Obtaining this information allows us to produce the following table of results and figure:

##### Are the changes meaningful?

One thing the authors’ mention in the paper are some approaches to evaluating whether the observed changes are meaningful. They recommend using either equivalence tests or second generation p-values. However, they don’t go into calculating such things on their data. I honestly am not familiar with the latter option, so I’ll instead create an example of using an equivalence test for the data and show how we can color the points within the plot to represent their meaningfulness.

Equivalence testing has been discussed by Daniel Lakens and colleagues in their tutorial paper, Lakens, D., Scheel, AM., Isager, PM. (2018). Equivalence testing for psychological reserach: A tutorial. Advances in Methods and Practices in Psychological Science. 2018; 1(2): 259-269.

Briefly, equivalence testing uses one-sided t-tests to evaluate whether the observed effect is larger or smaller than a pre-specified range of values surrounding the effect of interest, termed the smallest effect size of interest (SESOI).

In our above plot, we can consider the shaded range of values around 0 (-0.03 to 0.03, NOTE: The value 0.03 was provided in the text as the meaningful change for this athlete to see an ~1% increase in his 1-RM max) as the region where an observed effect would not be deemed interesting. Outside of those ranges is a change in performance that we would be most interested in. In addition to being outside of the SESOI region, the observed effect should be substantially large enough relative to the standard error around each point, which we calculated from our regression model earlier.

Putting all of this together, we obtain a the same figure above but now with the points colored specific to the p-value provided from our equivalence test:

Warpping Up

Again, if you’d like the full markdown file with code (click the ‘code’ button to display each code chunk) CLICK HERE >> Weakley–2021—-Velocity-Based-Training—From-Theory-to-Application—Strength-Cond-J

There are always a number of ways that analysis can unfold and provide valuable insights and this paper reflects just one approach. As with most things, I’m left with more questions than answers.

For example, Figure 3, I’m not sure if linear regression is the best approach. As we can see, the grey shaded region increases in width overtime because time is on the x-axis (independent variable) and the model was built on a small portion (the first 10-weeks) of the data. As such, with every subsequent week, uncertainty gets larger. How long would one continue to use the baseline model? At some point, the grey shaded region would be so wide that it would probably be useless. Are we too believe that the baseline model is truly representative of the athlete’s baseline? What if the baseline phase contained some amount of trend — how would the model then be used to quantify whatever takes place in the training phase? Maybe training isn’t linear? Maybe there is other conditional information that could be used?

In Figure 5, I wonder about the equivalence testing used in this single observation approach. I’ve generally thought of equivalence testing as a method comparing groups to determine if the effect from an intervention in one group is larger or smaller than the SESOI. Can it really work in an example like this, for an individual? I’m not sure. I need to think about it a bit. Maybe there is a different way such an analysis could be conceptualized? A lot of these issues come back to the problem of defining the baseline or some group of comparative observations that we are checking our most recent observation against.

My ponderings aside, I enjoyed the paper and the attempt to provide practitioners with some methods for delivering feedback when using VBT.

# Issues with ‘Black Box’ Machine Learning Models in Injury Prediction

Injury prediction models developed using machine learning approaches have become more common due to the substantial rise of proprietary software in the sports science and sports medicine space. However, such ‘black box’ approaches are not without limitation. Aside from a lack of transparency, preventing independent evaluation of model performance, these types of models present challenges in interpretation, making it difficult for practitioners who are required to make decisions about athlete health and plan interventions.

I recently had the pleasure of working on a paper headed up by Garrett Bullock and a list of wonderful co-authors where we discuss some of these issues:

# Bayesian Updating of Reference Ranges for Serial Measurements

Introduction

The collection of serial measurements on athletes across a season (or multiple seasons) is one of the more common types of data being generated in the applied sport science environment. The question that coaches and practitioners often have is, “Is this player outside of their ‘normal range’?”

The best approach for establishing a reference range of ‘normal’ values is a frequently discussed topic in sport science. One common strategy is to use z-scores and represent the reference range as 1 standard deviation above or below the mean (Figure A) or plot the raw values and set the reference range 1 standard deviation above or below the raw mean (Figure B), for practitioners who might have a difficult time understanding standardized scores. Of course, the mean and standard deviation will now be related to all prior values. As such, if the athletes go through a training phase with substantially higher values than other phases (e.g., training camp) it could skew your reference ranges. To alleviate this issue, some choose to use a rolling mean and standard deviation, to represent the normal range of values relative to more recent training sessions (Figure C).

A problem with the approaches above is that they require a number of training sessions to allow a mean and standard deviation to be determined for the individual athlete. One solution to this issue is to base our initial normal reference ranges off of prior knowledge that we have from collecting data on players in previous seasons (or prior knowledge from research papers, if we don’t have data of our own yet). This type of Bayesian updating approach has been applied in WADA’s drug testing practices1. More recently, Hecksteden et al., used this approach to evaluate the CK levels of team-sport athletes in both fatigued and non-fatigued states2.

The mathematics of the approach was presented in the paper but might look intimidating to those not used to looking at mathematical equations in this manner.

The author’s provided a nice excel sheet where you can input your own data and get the updated reference ranges. However, the sheet is a protected sheet, which doesn’t afford the opportunity of seeing how the underlying equations work and you can’t alter the sheet to make appropriate for your data (for example, the data in the sheet log transforms the raw data automatically). Thus, I’ve decided to code the analysis out, both in excel and R, to help practitioners looking to adopt this approach.

Setting Priors

To apply this type of analysis, we need to first establish some prior values for three parameters: Prior Mean (mu), Prior Standard Deviation (tau), and a Prior Repeated-Measures Standard Deviation (sigmaRM). These values represent our current knowledge of the variable we are measuring before seeing any new data. As new data is collected, we can update these priors to get an individual (posterior) estimate for the athlete. I’ll use the priors set by Hecksteden and colleagues for CK levels of Male athletes:

• Mu = 5.527
• Tau = 0.661
• sigmaRM = 0.504

Once we have established our prior parameters, we are ready to update them, using the math equations above, as new data comes in.

Bayesian Updating in Excel

The excel sheet is available at my GitHub page. It looks like this:

All of the heavy lifting occurs in the two columns under the header Bayesian Updating (Log Scale). The equation in the first row (Test 1) is different than the other equations below it because requires the prior information to get going. After that first test, the updated data become the prior for the next test and this continues for all tests forward. You can download the excel sheet and see how the equations work, so I won’t go through them here. Instead, I’ll show them more clearly in the R script, below.

Bayesian Updating in R

We first need to convert the math equations provided in the paper (posted above) into R code. Rather that leaving things to mathematical notation, I’ll plug in the variables in plain English:

To be clear, here are the definitions for the variables above:

Now that we know the variables we need for each equation we can begin the process of updating or reference ranges.

First create a data set of the test observations and their log values. This will be the same data we observed in our excel sheet:

Then we set our priors (in log format):

## priors
prior_mu <- 5.527
prior_sd <- 0.661
prior_repeated_measure_sd <- 0.504


We will start by seeing how the updating works for the mean and standard deviation parameters after the first test. To do this, we will create a function for each parameter (mean and standard deviation) that updates the priors with the observed values based on the above equations:

posterior_mu <- function(prior_mu, prior_sd, prior_repeated_measure_sd, obs_value){

numerator <- prior_repeated_measure_sd^2 * prior_mu + prior_sd^2 * obs_value
denominator <- prior_repeated_measure_sd^2 + prior_sd^2

post_mu <- numerator / denominator
return(post_mu)

}

posterior_sd <- function(prior_repeated_measure_sd, prior_sd, test_num){

post_var <- 1 / ((test_num - 1 + 1) * 1/prior_repeated_measure_sd^2 + 1/prior_sd^2)
post_sd <- sqrt(post_var)
return(post_sd)

}


After running the functions on the observations of our first test, our updated mean and standard deviation are:

Notice that we obtain the same values that we see following test_1 in our excel workbook. We can also calculate 95% confidence intervals and take the exponent (since the data is on a log scale) to get the individual athlete’s updated reference range on the raw scale:

Again, these results confirm the values we see in our excel workbook.

That’s cool and all, but we need to be able to iteratively update the data set as new data comes in. Let’s write a for() loop!

First, we create a new column in the data that provides us with the updated standard deviation after observing the results in each test. This is a necessary first step as we will use this value to then update the mean value.

## Calculate the updated SD based on sample size
df2 <- df %>%
mutate(bayes_sd = sqrt(1 / ((test - 1 + 1) * 1 / prior_repeated_measure_sd^2 + 1 / prior_sd^2)))

df2


Next, we need a for() loop. This is a bit tricky because test_1 is updating based solely on the priors while all other tests (test_2 to test_N) will be updating based on the mean and standard deviation in the row above them. Thus, we need to have our for() loop look back at the previous row above it once those values are calculated. This sort of thing is easy to set up in excel but in R (or Python) we need to think about how to code our row indexes. I covered this sort of iterative row computing in two previous articles HERE and HERE.

Within the loop we first set up vectors for the prior variance (sd^2), the denominator in our equation, and the log transformed observations from our data set. Then, we calculate the updated posterior for the mean (mu) on each pass through the loop, each time using the value preceding it in the vector, [i – 1], to allow us to iteratively update the data.

Once we run the loop, we add the results to our data set (removing the first observation in the vector since that was the original prior before seeing any data):

# Create a vector to store results
N <- length(df2$ln_value) + 1 bayes_mu <- c(prior_mu, rep(NA, N - 1)) ## For loop for(i in 2:N){ ## Set up vectors for the variance, denominator, and newly observed values prior_var <- c(prior_sd^2, df2$bayes_sd^2)
denominator <- prior_repeated_measure_sd^2 + prior_var
vals <- df2$ln_value ## calculate bayesian updated mu bayes_mu[i] <- (prior_repeated_measure_sd^2 * bayes_mu[i-1] + prior_var[i-1] * vals[i-1]) / denominator[i-1] } df2$bayes_mean <- bayes_mu[-1]
df2


The two columns, bayes_sd and bayes_mean, contain our updated prior values and they are the exact same results we obtained in our excel workbook.

To use these updated parameters for creating individual athlete reference ranges, we calculate the 95% Confidence Intervals:

NOTE: I added a row at the start of data frame to establish the priors, before seeing the data, so that they could also be plotted as part of the reference ranges.

### Confidence Intervals
first_prior <- data.frame(test = 0, value = NA, ln_value = NA, bayes_sd = prior_sd, bayes_mean = prior_mu)

df2 <- df2 %>%
bind_rows(first_prior) %>%
arrange(test)

## Exponentiate back to get the reference range
df2$low95 <- exp(df2$bayes_mean - 1.96*df2$bayes_sd) df2$high95 <- exp(df2$bayes_mean + 1.96*df2$bayes_sd)
df2


Finally, we plot the observations along with the continually updated references ranges. You can clearly see how large the normal range is before seeing any data (test_0) and then how quickly this range begins to shrink down once we start observing data from the individual.

To access the R code and the excel workbook please visit my GitHub page.

References

• Sottas PE et al. (2007). Bayesian detection of abnormal values in longitudinal biomarkers with application to T/E ratio. Biostatistics; 8(2): 285-296.
• Hecksteden et al. (2017). A new method to individualize monitoring of muscle recovery in athletes. Int J Sport Phys Perf; 12: 1137-1142.

# The Nordic Hamstring Exercise, Hamstring Strains, & the Scientific Process

Doing science in lab is hard.

Doing science in the applied environment is hard — maybe even harder than the lab, at times, due to all of the variables you are unable to control for.

Reading and understanding science is also hard.

Let’s face it, science is tough! So tough, in fact, that scientists themselves have a difficult time with all of the above, and they do this stuff for a living! As such, to keep things in check, science applies a peer-review process to ensure that a certain level of standard is upheld.

Science has a very adversarial quality to it. One group of researchers formulate a hypothesis, conduct some research, and make a claim. Another group of researchers look at that claim and say, “Yeah, but…”, and then go to work trying to poke holes in it, looking to answer the question from a different angle, or trying to refute it altogether. The process continues until some type of consensus is agreed upon within the scientific community based on all of the available evidence.

This back-and-forth tennis match of science has a lot to teach those looking to improve their ability to read and understand research. Reading methodological papers and letters to the editor offer a glimpse into how other, more seasoned, scientists think and approach a problem. You get to see how they construct an argument, deal with a rebuttal, and discuss the limitations of both the work they are questioning and the work they are conducting themselves.

All of this brings me to a recent publication from Franco Impellizzeri, Alan McCall, and Maarten van Smeden,

The paper is directed at a prior meta-analysis which aggregated the findings of several studies with the aim of understanding the role of the Nordic hamstring exercise (NHE) in reducing the risk hamstring strain injuries in athletes. In a nutshell, Impellizzeri and colleagues felt like the conclusions and claims made from the original meta-analysis were too optimistic, as the title of the paper suggested that performing the NHE can halve the rate of hamstring injuries (Risk Ratio: 0.49, 95% CI: 0.32 to 0.74). Moreover, Impellizzeri et al, identified some methodological flaws with regard to how the meta-analysis was performed.

The reason I like this paper is because it literally steps you through the thought process of Impellizzeri and colleagues. First, it discusses the limitations that they feel are present in the previous meta-analysis. They then conduct their own meta-analysis by re-analyzing the data, however, they apply an inclusion criteria that was more strict and, therefore, only included 5 papers from the original study (they also identified and included a newer study that met their criteria). In analyzing the original five papers, they found prediction intervals ranging from 0.06 to 5.14. (Side Note: Another cool piece of this paper is the discussion and reporting of both confidence intervals and prediction intervals, the latter of which are rarely discussed in the sport science literature).

The paper is a nice read if you want to see the thought process around how scientists read research and go about the scientific process of challenging a claim.

Some general thoughts

• Wide prediction intervals leave us with a lot of uncertainty around the effectiveness of NHE in reducing the risk of hamstring strain injuries. At the lower end of the interval NHE could be beneficial and protective for some while at the upper end, potentially harmful to others.
• A possible reason for the large uncertainty in the relationship between NHE and hamstring strain injury is that we might be missing key information about the individual that could indicate whether the exercise would be helpful or harmful. For example, context around their previous injury history (hamstring strains in particular), their training age, or other variables within the training program, all might be useful information to help paint a clearer picture about who might benefit the most or the least.
• Injury is highly complex and multi-faceted. Trying to pin a decrease in injury risk to a single exercise or a single type of intervention seems like a bit of a stretch.
• No exercise is a panacea and we shouldn’t treat them as such. If you like doing NHE for yourself (I actually enjoy it!) then do it. If you don’t, then do something else. Let’s not believe that certain exercises have magical properties and let’s not fight athletes about what they are potentially missing from not including an exercise in their program when we don’t even have good certainty on whether it is beneficial or not.