Category Archives: Sports Analytics

Total Score of Athleticism — R Markdown & R Shiny

A battery of performance tests (e.g., strength, power, fitness, agility) are often used by strength and conditioning and sports science staffs to evaluate a player’s current physical status. Such information can help to guide future training programs aiming to improve deficiencies and enhance performance. Having a single value that represents the athlete’s overall athleticism may be useful to identify the most well-rounded athletes in the club and may also help in communicating the test results of each player to the coaching staff in a digestible manner.

Recently, Anthony Turner and colleagues published the paper, Total Score of Athleticism: Holistic Athlete Profiling to Enhance Decision-Making, in the NSCA’s Strength and Conditioning Journal. While there are number of ways that one could index an athlete and represent their athleticism in a single value, this approach is simple to calculate for the practitioner and therefore I wanted to use it as an example of how you can create an R markdown report and Shiny app for displaying the results. (NOTE: For those interested in doing the analysis in excel, Anthony has a YouTube channel where he details the process. CLICK HERE).

Calculating Total Score of Athleticism (TSA)

The TSA is derived by calculating the z-score for each test in your battery and then averaging over the z-score for each individual to produce a single value.

The z-score is calculated simply as:

The values for a z-score will be reported with a mean of 0 and standard deviations ranging from -3 to 3 where ±1 SD represents ~68% of the scores, ±2 SD represents ~95% of the scores, and ±3 SD represents ~99% of the scores. In this way, values that are negative suggest the athlete is below average while values that are positive suggest the athlete is above average, relative to the group.

Some coaches or practitioners, however, may not like looking at a z-score because it is difficult for them to wrap their heads around the values (though I think it is easier to see the results as a z-score when performance is represented as positive and negative) and may instead prefer to look at the scores on a 0-100 scale. To convert the z-scores to t-scores on a 0-100 scale we use this formula:

Now, instead of having positive and negative values representing above or below average athletes, a score of a 50 represents average. As such, 10 points in either direction of 50 represent the number of standard deviations from average the athlete is. For example 60 = 1 SD, 70 = 2 SD, and 80 = 3 SD.

R Markdown Report

R Markdown is a simple way to take your analysis and turn it into a report or document for sharing. The beauty of R Markdown is that you can choose to show your R Code (if you are sharing with other professionals or colleagues) or hide your R Code and just show the results of the analysis (if you are sharing with coaches or other staff members).

I’ll put my R code in here to walk through the steps but if you are interested in the R Markdown file, just go over to my GitHub page.

First, we need to load our required packages and make up some fake data (since I don’t have any data I can use publicly).

# Packages
library(tidyverse)
library(reshape)
library(stringr)

# Simulate data
set.seed(3344)
athlete <- as.factor(1:30)
cmj <- c(round(rnorm(n = 10, mean = 30, sd = 4), 1), round(rnorm(n = 10, mean = 24, sd = 4), 1), round(rnorm(n = 10, mean = 33, sd = 2), 1))
sprint_40 <- c(round(rnorm(n = 10, mean = 4.5, sd = .1), 2), round(rnorm(n = 10, mean = 4.9, sd = .2), 2), round(rnorm(n = 10, mean = 5, sd = .2), 2))
bench <- c(round(rnorm(n = 10, mean = 20, sd = 4), 1), round(rnorm(n = 10, mean = 12, sd = 4), 1), round(rnorm(n = 10, mean = 30, sd = 4), 1))
five_ten_five <- c(round(rnorm(n = 10, mean = 6.4, sd = .2), 2), round(rnorm(n = 10, mean = 6.7, sd = .2), 2), round(rnorm(n = 10, mean = 7.5, sd = .4), 2))
df <- data.frame(athlete, cmj, sprint_40, bench, five_ten_five)

Next we need to write 2 functions, one for calculating our z-score and one for calculating our t-score.

## z-score function
z_score <- function(x){
z = (x - mean(x, na.rm = T)) / sd(x, na.rm = T)
return(z)
}

## t-score function
t_score <- function(x){
t = (x * 10) + 50
return(t)
}

Now we are all set to calculate the z-score and t-score results for our individual athletes. Also, note that before converting to the t-score, any test where negative reflects better performance (e.g., speed tests where a faster time is more favorable) you can multiple the z-score by -1. This intuitively makes it easier for those reading the report to always associate values that are positive as “above average” and values that are negative as “below average”.


## calculate the z-score
df <- df %>%
mutate_if(is.numeric, list(z = z_score))

df$sprint_40_z <- df$sprint_40_z * -1
df$five_ten_five_z <- df$five_ten_five_z * -1

## calculate the t-score
df <- df %>%
  mutate(cmj_t = t_score(cmj_z),
         sprint_40_t = t_score(sprint_40_z),
         bench_t = t_score(bench_z),
         five_ten_five_t = t_score(five_ten_five_z))

Finally calculate the TSA z-score (TSA_z) and TSA t-score (TSA_t).


## calculate TSA_z
df$TSA_z <- apply(df[, 6:9], MARGIN = 1, FUN = mean)

## calculate TSA_z
df$TSA_t <- with(df, (TSA_z * 10) + 50)

Now that the data is prepared we construct our report. Before plotting the TSA z-scores we need to move the data from the wide format, that it is currently in, to a long format. This will make it easier to code the plot. We will remove the “_z” at the end of each variable to make the labels cleaner looking. We are also going to add a shaded range between -1 and 1 (you can pick whatever range makes sense for you in your situation). Finally, we will include an indicator value that flags the athlete as “green” when they are above average and “red” when they are below average.


# Change data from a wide to long format
df_long <- df %>%
melt(., id = "athlete", measure.vars = c("cmj_z", "sprint_40_z", "bench_z", "five_ten_five_z"))

# remove the _z
df_long$Test <- str_sub(df_long$variable, end = -3)

# Add indicator value
df_long <- df_long %>% mutate("indicator" = ifelse(value > 0, "above avg", "below avg"))

# plot
df_long %>%
filter(athlete %in% c(3, 15, 22, 27)) %>%
ggplot(aes(x = Test, y = value)) +
geom_rect(aes(ymin = -1, ymax = 1), xmin = 0, xmax = Inf, fill = "light grey", alpha = 0.3) +
geom_col(aes(fill = indicator), alpha = 0.8) +
facet_wrap(~athlete) +
scale_fill_manual(values = c("green", "red")) +
theme_light() +
theme(axis.text.x = element_text(face = "bold", size = 12, angle = 45, vjust = 1, hjust = 1),
axis.text.y = element_text(face = "bold", size = 12),
strip.background = element_rect(fill = "black"),
strip.text = element_text(color = "white", face = "bold"),
plot.title = element_text(size = 18),
plot.subtitle = element_text(size = 15)) +
labs(x = "", y = "z-score of performance") +
ggtitle("Test Performance", subtitle = "Player Performance Standardized to the Team") +
ylim(-3, 3)

We can also plot the entire team’s TSA z-score in a similar manner. I’ve included the TSA z-score values on the plot as well, to help with interpretation.

If you would like to see what the finished markdown file looks like click here:
Total_Score_of_Athleticism_-_Report

You can certainly manipulate the code to produce different plots or add more text for the coaches to read. I’ve stuck with the z-score in this report because I personally don’t think the t-score conveys the information in the same manner. My personal preference is that I like to see below average as negative and above average as positive. I’ve added the code for the t-score plot in the markdown file so you can manipulate it if you’d like. Quickly, here is what the plot would look like (horizontal dashed line at 50, indicating average) so you can judge which one you prefer better:

Shiny App

You may have noticed from the markdown report that when plotting the individual test results I only showed 4 athletes. In this way the report is rather static! It doesn’t allow us to scroll through players in an efficient manner. For that, we need something that is interactive. Enter Shiny!

Shiny is a way for us to quickly build interactive webpages in R that can be hosted directly on our computer. There is a lot of versatility in these apps. The example below is just a simple app that allows the coach to select an athlete and it will automatically change to that individual’s plot, showing their performance in all of the individual tests as well as their TSA.

We could extend this app to have a second plot to allow for player comparisons or a table of results to allow for viewing the raw values for each of the tests. The code is provided below for you to run on your computer. I don’t go into what all of the elements of the code are doing (for time sake) but perhaps I could revisit other ways of using Shiny in future blog posts and explain more clearly how the code works.


### TotalScore of Athleticism - Shiny App

# Load packages
library(tidyverse)
library(reshape)
library(stringr)
library(shiny)

# simulate data
set.seed(3344)
athlete <- as.factor(1:30)
cmj <- c(round(rnorm(n = 10, mean = 30, sd = 4), 1), round(rnorm(n = 10, mean = 24, sd = 4), 1), round(rnorm(n = 10, mean = 33, sd = 2), 1))
sprint_40 <- c(round(rnorm(n = 10, mean = 4.5, sd = .1), 2), round(rnorm(n = 10, mean = 4.9, sd = .2), 2), round(rnorm(n = 10, mean = 5, sd = .2), 2))
bench <- c(round(rnorm(n = 10, mean = 20, sd = 4), 1), round(rnorm(n = 10, mean = 12, sd = 4), 1), round(rnorm(n = 10, mean = 30, sd = 4), 1))
five_ten_five <- c(round(rnorm(n = 10, mean = 6.4, sd = .2), 2), round(rnorm(n = 10, mean = 6.7, sd = .2), 2), round(rnorm(n = 10, mean = 7.5, sd = .4), 2))
df <- data.frame(athlete, cmj, sprint_40, bench, five_ten_five)

# z-score function

z_score <- function(x){
z = (x - mean(x, na.rm = T)) / sd(x, na.rm = T)
}

##### Data Pre-Processing #####
###############################

# calculate the z-score
df <- df %>%
mutate_if(is.numeric, list(z = z_score))

df$sprint_40_z <- df$sprint_40_z * -1
df$five_ten_five_z <- df$five_ten_five_z * -1

# calculate TSA_z
df$TSA_z <- apply(df[, 6:9], MARGIN = 1, FUN = mean)

# Change data from a wide to long format
df_long <- df %>% 
melt(., id = "athlete", measure.vars = c("cmj_z", "sprint_40_z", "bench_z", "five_ten_five_z", "TSA_z"))

# remove the _z
df_long$Test <- str_sub(df_long$variable, end = -3)

# Add indicator value
df_long < df_long %<% mutate("indicator" = ifelse(value > 0, "above avg", "below avg"))

##### Shiny App #####
#####################

## User Interface

athlete <- as.vector(unique(df_long$athlete))

ui <- fluidPage(

titlePanel("Performance Testing Results"),

selectInput(
input = "athlete",
label = "athlete",
choices = athlete,
selected = "1"
),

plotOutput(outputId = "tsa.plot",
width = "60%")
)

## server

server <- function(input, output){

dat <- reactive({
dataset <- subset(df_long, athlete == input$athlete)
dataset
})

output$tsa.plot <- renderPlot({
d <- dat()

athlete.plot <- ggplot(data = d, aes(x = Test, y = value)) +
geom_rect(aes(ymin = -1, ymax = 1), xmin = 0, xmax = Inf, fill = "light grey", alpha = 0.3) +
geom_col(aes(fill = indicator), alpha = 0.8) +
scale_fill_manual(values = c("green", "red")) +
theme_light() +
theme(axis.text.x = element_text(face = "bold", size = 12, angle = 45, vjust = 1, hjust = 1), 
axis.text.y = element_text(face = "bold", size = 12),
strip.background = element_rect(fill = "black"),
strip.text = element_text(color = "white", face = "bold"),
plot.title = element_text(size = 18),
plot.subtitle = element_text(size = 15)) +
labs(x = "", y = "z-score of performance") +
ggtitle("Test Performance", subtitle = "Player Performance Standardized to the Team") +
ylim(-3, 3)

print(athlete.plot)
})
}

## Run the app
shinyApp(ui = ui, server = server)

The website ends up looking like this:

Conclusion

There are a number of ways that one can index several performance tests and create a single number for coaches to digest. This is one simple example of an approach that is easy for practitioners to understand and action against (e.g., low values in red may require specialized approaches to training and performance development). The simplicity of this approach makes it easy to create simple reports and web apps that allow the strength coach and sports science staff to quickly share information with the coaching staff in an easy and interactive manner.

Obviously care should go into selecting the test battery as it should reflect elements that are specific to success in your given sport. Finally, the current format of the Total Score of Athleticism treats each test with equal weight. However, there may be situations where you want to place more weight on certain tests. For example, some tests may be more important for certain position groups than others. Alternatively, some tests may have a stronger association with sports performance and thus require greater weight than other tests. All of these things can be addressed in your setting by simply using the code in this blog and altering it to meet your needs.

References

1) Turner, AN. et al. (2019). Total Score of Athleticism: Holistic Athlete Profiling to Enhance Decision-Making, Strength Cond J. 2019. Epub-ahead-of-print.

Estimating Performance

When looking at sports statistics it’s important to keep in mind that performance is a blend of both skill and luck. As such, recognizing that all athletes exhibit some level of regression to the mean helps us put into perspective that observed performance is not necessarily where that individual’s true performance lies. For example, we wouldn’t believe that a baseball player who starts the MLB season going 6 for 10 has a true .600 batting average that they will carry throughout the season. Eventually, they will regress back down to something more normal (more average). Conversely, if a player starts the season 1 for 10 we would expect them to eventually move back up to something more normal.

A goal in sports analytics is to try and estimate the true performance of a player given some observed data. One of my favorite papers on this topic is from Efron and Morris (1977), Stein’s Paradox in Statistics. The paper discusses ways of using observed outcomes to make future forecasts using the James-Stein Estimator and then a Bayes Estimation approach.

Using 2019 MLB data, I’ve put together some R code to walk through the methods proposed in the paper.

Getting Data

First, we need to load Bill Petti’s baseballr package, which is a handy package for scraping MLB data. We will also load the ggplot2 package for data visualizations

library(baseballr)
library(ggplot2)

The aim of this analysis will be to look at hitting performance (batting average) over the first 30 days of the 2019 MLB season, make a forecast of the player’s true batting average, based on the observations of those first 30 days, and then test that forecast on the next 30 days.

We will obtain two data sets, a first 30 days data set and a second 30 days data set.

### Get first 30 days of 2019 MLB and Second 30 days

dat_first30 <- daily_batter_bref(t1 = "2019-03-28", t2 = "2019-04-28")
dat_second30 <- daily_batter_bref(t1 = "2019-04-29", t2 = "2019-05-29")

## Explore the data frames

head(dat_first30)
head(dat_second30)

dim(dat_first30) # 595 x 29
dim(dat_second30) # 611 x 29

Evaluating The Data

Let’s now reduce the two data frame’s down to the columns we need (Name, AB, and BA).

dat_first30 <- dat_first30[, c("Name", "AB", "BA")]
dat_second30 <- dat_second30[, c("Name", "AB", "BA")]

Let’s look at the number of observations (AB) we see for the players in the two data sets.

quantile(dat_first30$AB)
quantile(dat_second30$AB)

par(mfrow = c(1,2))
hist(dat_first30$AB, col = "grey", main = "First 30 AB")
rug(dat_first30$AB, col = "red", lwd = 2)
hist(dat_second30$AB, col = "grey", main = "Second 30 AB")
rug(dat_second30$AB, col = "red", lwd = 2)

We see a considerable right skew in the data with a large number of players observing a small amount of at bats over the first 30 days and a few players observing a large number observations (the everyday players).

Let’s see what Batting Average looks like.

quantile(dat_first30$BA, na.rm = T)
quantile(dat_second30$BA, na.rm = T)

par(mfrow = c(1,2))
hist(dat_first30$BA, col = "grey", main = "First 30 BA")
rug(dat_first30$BA, col = "red", lwd = 2)
hist(dat_second30$BA, col = "grey", main = "Second 30 BA")
rug(dat_second30$BA, col = "red", lwd = 2)

We see the median batting average for MLB players over the 60 days ranges from .224, in the first 30 days, to .234, in the second 30 days. We also see some players with a batting average of 1.0, which is probably because they batted only one time and got a hit. We also see that there are a bunch of players with a batting average of 0.

This broad range of at bats and batting average values is actually going to be interesting when we attempt to forecast future performance as small sample sizes make it difficult to have faith in a player’s true ability. This is one area in which the James-Stein Estimator and Bayes Estimation approaches may be useful, as they allow for shrinkage of the observed batting averages towards the mean. In this way, the forecast isn’t too overly confident about the player who went 6 for 10 and it isn’t too under confident about the player who went 1 for 10. Additionally, for the player who never got an at bat, the forecast will suggest that the player is an average player until future data/observations can be gathered to prove otherwise.

Estimating Performance

We will use 3 approaches to forecast performance in the second 30 days:

Use the batting average of the player in the first 30 days and assume that the next 30 days would be similar. This will server as our benchmark for which the other two approaches need to improve upon if we are to use them. Note, however, that this approach doesn’t help us at all for players who had 0 at bats.
Use a James-Stein Estimator to forecast the second 30 days by taking the observed batting averages and applying a level of shrinkage to account for some regression to the mean.
Account for regression to the mean by using some sort of prior assumption of the batting average mean and standard deviation of MLB players. This is a type of Bayes approach to handling the problem and is discussed in the last section of Effron and Moriss’s article.

Before we start making our forecasts, I’m going to take a random sample of 50 players from the first 30 days data set. This will allow us to work with a subset of the data for the sake of the example. Additionally, rather than cleaning up the data or setting an inclusion criteria based on number of at bats, by taking a random sample, I’ll get a good mix of players that had no at bats, very little at bats, an average number of at bats, and a large number of at bats. This will allow us to see how well the three forecast approaches handle a large variability in observations. (Technical Note: If you are going to follow along with the r-script below, make sure you use the same set.seed() as I do to ensure reproducibility).

## Get a sample of 50 players from the first 30

set.seed(1657)
N <- nrow(dat_first30)
samp_size <- 50
samp <- sample(x = N, size = samp_size, replace = F)

first30_samp <- dat_first30[samp, ]
head(first30_samp)

We need to locate these same players in the second 30 days data set so that we can test our forecasts.

## Find the same players in the second 30 days data set

second30_samp <- subset(dat_second30, Name %in% unique(first30_samp$Name))

nrow(second30_samp) # 37

Only 37 players from the first set of players (the initial 50) are available in the second 30 days. This could be due to a number of reasons such as injury, getting sent down to triple A, getting benched for a player that was performing better, etc. In any event, we will work with these 37 players from here on out. So, we need to go into the sample from the first 30 days and find those players. Then we merge the two sample sets together

## Subset out the 37 players in the first 30 days sample set

first30_samp <- subset(first30_samp, Name %in% second30_samp$Name)

nrow(first30_samp) # 37

## Merge the two samples together

df <- merge(first30_samp, second30_samp, by = "Name")
head(df)

The first 30 days are denoted as AB.x and BA.x while the second are AB.y and BA.y.

First 30 Day Batting Average to Forecast Second 30 Day Batting Average

Using the two columns, BA.x and BA.y, we can subtract one from the other to obtain the difference in our forecast had we simply assumed the first 30 day’s performance (BA.x) would be similar to the second (BA.y). From there, we can calculate the mean absolute error (MAE) and the root mean square error (RMSE), which we will use to compare the other two methods.

## Difference between first 30 day BA and second 30 day BA

df$Proj_Diff_Avg <- with(df, BA.x - BA.y)
mae <- mean(abs(df$Proj_Diff_Avg), na.rm = T)
rmse <- sqrt(mean(df$Proj_Diff_Avg^2, na.rm = T))

Using the James-Stein Estimator

The forumla for the James-Stein Estimator is as follows:

JS = group_mean + C(obs_BA – group_mean)

Where:

group_mean = the average of all players in the first 30 days
obs_BA = the observed batting average for an individual player in the first 30 days
C = a constant that represents the shrinkage factor. C is calculated as:
- C = 1 – ((k – 3)*σ²)/(Σ(y – ŷ)²)
  - k = the number of unknown means we are trying to forecast (in this case the sample size of our first 30 days)
  - σ²= the group variance observed during the first 30 days
  - (Σ(y – ŷ)²) = the sum of the squared differences between each player’s observed batting average and the group average during the first 30 days

First we will calculate our group mean, group SD, and squared differences from the first 30 day sample.


# Calculate grand mean and sd

group_mean <- round(mean(df$BA.x, na.rm = T), 3)
group_mean # .214

group_sd <- round(sd(df$BA.x, na.rm = T), 3)
group_sd # .114

## Calculate sum of squared differences from the group average

sq.diff <- sum((df$BA.x - group_mean)^2)
sq.diff # .467

Next we calculate our shrinkage factor, C.

## Calculate shirinkage factor

k <- nrow(df)
c <- 1 - ((k - 3) * group_sd^2) / sq.diff
c # .053

Now we are ready to make a forecast of batting average for the second 30 days using the James-Stein Estimator and calculate the MAE and RMSE.

df$JS <- group_mean + c*(df$BA.x - group_mean)

df$Proj_Diff_JS <- with(df, JS - BA.y)

mae_JS <- mean(abs(df$Proj_Diff_JS), na.rm = T)
rmse_JS <- sqrt(mean(df$Proj_Diff_JS^2, na.rm = T))

Using a Prior Assumption for MLB Batting Average

In this example, the prior batting average I’m going to use will be the mean and standard deviation from the entire first 30 day data set (the original data set, not the sample). I could, of course, use historic data to build my prior assumption but I figured I’ll just start with this approach since I have the data readily accessible.

# Get a prior for BA

prior_BA_avg <- mean(dat_first30$BA, na.rm = T)
prior_BA_sd <- sd(dat_first30$BA, na.rm = T)

prior_BA_avg # .203
prior_BA_sd # .136

For our Bayes Estimator, we will use the following approach:

BE = prior_BA_avg + prior_BA_sd(obs_BA – prior_BA_avg)

df$BE <- prior_BA_avg + prior_BA_sd*(df$BA.x - prior_BA_avg)
mae_BE <- mean(abs(df$Proj_Diff_BE), na.rm = T)
rmse_BE <- sqrt(mean(df$Proj_Diff_BE^2, na.rm = T))

Looking at Our Forecasts

Let’s put the MAE and RMSE of our 3 approaches into a data frame so we can see how they performed.

Comparisons <- c("First30_Avg", "James_Stein", "Bayes")
mae_grp <- c(mae_avg, mae_JS, mae_BE)
rmse_grp <- c(rmse_avg, rmse_JS, rmse_BE)

model_comps <- data.frame(Comparisons, MAE = mae_grp, RMSE = rmse_grp)
model_comps[order(model_comps$RMSE), ]

It looks like the lowest RMSE is the Bayes Estimator. The James-Stein Estimator is not far behind. Both approaches out performed just using the player’s first 30 day average as a naive forecast for future performance.

We can visualizes the differences in our projections for the three approaches as well.

We can see that making projections based off of first 30 day average (green) is wildly spread out and the mean value appears to over project a player’s true ability (the peak is greater than 0, the dashed red line). Conversly, the James-Stein Estimator (blue) has a pretty strong peak that is just below 0, meaning it may be under projecting players and pulling some players down too far towards the group average. Finally, the Bayes Estimator (grey) resides in the middle of the two projections with its peak just above 0.

The last thing I’ll do is put some confidence intervals around the Bayes Estimator for each player and take a look at how the sample size influences the forecast and where the observed batting average during the second 30 days was in relationship to that forecast.

df$Bayes_SE <- with(df, sqrt((bayes * (1-bayes))/AB.x))
df$Low_CI <- df$bayes - 1.96*df$Bayes_SE
df$High_CI <- df$bayes + 1.96*df$Bayes_SE

We can then plot the results. (Technical Note: To keep the plot from being too busy, I used only plot the first 15 rows of the data set).

ggplot(df[1:15, ], aes(x = reorder(Name, AB.x), y = bayes)) +
	geom_point(color = "blue") +
	geom_point(aes(y = BA.y), color = "red") +
	geom_errorbar(aes(ymin = Low_CI, ymax = High_CI), width = 0) +
	geom_text(aes(label = paste(AB.x, "ABs", sep = " ")), vjust = -1, size = 3) +
	geom_hline(aes(yintercept = 0), linetype = "dashed") +	
	coord_flip() +
	theme_classic() +
	ggtitle("Bayes Estimator", 
		subtitle = "Blue = Bayes Estimation \nRed = Observed BA during second 30 day \nLabeled ABs = Individual Sample Size the Estimation was built on (First 30 days ABs)")

The at bats labelled in the plot are specific to the first 30 days of data, as they represent the number of observations for each individual that the forecast was built on. We can see that when we have more observations, the forecast does better at identifying the player’s true performance based on their first 30 days. Rizzo and Igelsias were the two that really beat their forecast in the second 30 days of the 2019 season (Rizzo in particular). The guys at the bottom (Butera, Wynns, Duplantier, and Fedde) are much harder to forecast given they had such few observations in the first 30 days. In the second 30 days, Duplantier only had 1 AB while Butera and Fedde only had 2.

Wrapping Up

The aim of this post was to work through the approaches to forecasting performance used in a 1977 paper from Efron and Morris. Dealing with small samples is a problem in sport and what we saw was that if we naively just use the average performance of a player during those small number of observations we may be missing the boat on their underlying true potential due to regression to the mean. As such, things like the James-Stein Estimator or Bayes Estimator can help us obtain better estimates by using a prior assumption about the average player in the population.

There are other ways to handle this problem, of course. For example, an Empirical Bayes Approach could be used by assuming a beta distribution for our data and making our forecasts from there. Finally, alternative approaches to modeling could account for different variables that might influence a player during the first 30 days (injury, park factors, strength of opponent, etc). However, the simple approaches presented by Efron and Morris are a nice start.

References

Efron, B., Morris, CN. (1977). Stein’s Paradox in Statistics. Scientific American, 236(5): 119-127.

A Simple Approach to Analyzing Athlete Data in Applied Sports Science

Intro

Evaluating whether an athlete has or has not improved in some key performance indicator is critical to understanding the success of a prescribed training or rehabilitation program. In the applied setting, practitioners are faced with N = 1 decisions as they are training or rehabilitating individual athletes, each of whom is unique in their own way. As such, tests that allow practitioners to understand these individual improvements are imperative to quantifying the training process.

The analysis of athlete testing data first requires an understanding of what the test is measuring (whether it is valid or not) and the amount of noise/error within the test (whether the test is reliable or not). Tests that are overly noisy make it challenging for practitioners to reliably know whether or not changes exhibited by the athlete are due to real performance improvement, measurement error (e.g., issues with the test itself) or biological variation. Approaches to analyzing test-retest data to evaluate typical error measurement (TEM) and smallest worthwhile change (SWC) have been previously discussed by authors such as Hopkins¹, Swinton², and Turner³. Recently, my friend and colleague, Shaun McLaren, wrote a blog post on understanding statistics when interpreting individualized testing data. Such approaches are important in the applied setting as the last thing a practitioner or clinician wants to do is report inaccurate information regarding an athlete’s current physical state to the coach or management. From a medical/return-to-play standpoint, such information is important for ensuring that the athlete is making progress and meeting certain benchmarks to ensure a safe return from injury.

The analytical approaches Shaun discussed are relatively easy to perform, and interested readers can download Excel sheets that will automatically calculate these measures and only require the practitioner to provide test-retest data. My aim in this blog post is to walk though similar statistical approaches using the coding language R and build a function that will automatically calculate these metrics once the practitioner provides their data (analysis for this blog post was built of of the methods proposed by Swinton and Colleagues², who provide similar methods in excel on the article’s webpage).

Simulating Data

First, we need to create some data to play with. I’ll simulate two different data sets:

- - Data Set 1: Test-Retest Data
    - This data set will serve as our test-retest trial data. We will use this data set to calculate measures to get a sense for how noisy the test is and calculate measures such as TEM and SWC. For example, let’s say that this test-retest trial is something like a simple vertical jump test. We want to have the athletes perform the test, take a rest period, and then perform the test again. We will then calculate how much error there is in the test.
  - Data Set 2: Training Intervention Data
    - Once we’ve established TEM and SWC, we will simulate a second data set that represents a group of athletes performing an experimental training intervention (strength training only) and another group of athletes performing a control condition (endurance training only). We will write a function to evaluate the responses from this data to understand how successful the intervention truly was.

Test-Retest Data Simulation

Our test-retest data will be a simulation of a vertical jump test for 20 athletes.


### Load packages

library(dplyr)
library(ggplot2)
library(reshape)

### Simulate data

set.seed(2018)
subject <- LETTERS[1:20]
group <- rep(c("experimental", "control"), each = 10)
test <- c(round(rnorm(n = 10, mean = 25, sd = 4),1), round(rnorm(n = 10, mean = 25, sd = 3), 1))
retest <- c(round(rnorm(n = 10, mean = 25, sd = 5),1), round(rnorm(n = 10, mean = 24, sd = 4), 1))

reliability.data <- data.frame(subject, test, retest)
head(reliability.data)

  subject test retest
1       A 23.3   31.3
2       B 18.8   26.3
3       C 24.7   26.3
4       D 26.1   33.9
5       E 31.9   18.9
6       F 23.9   23.8

Two metrics we are interested in obtaining from the data are TEM and SWC.

Typical error of measurement (TEM) is calculated as the standard deviation of the difference between test-retest scores divided by the square root of 2.
- TEM = sd(Difference) / sqrt(2)

Smallest worthwhile change (SWC) is calculated as the standard deviation of Test 1 multiplied by an effect size of interest. Hopkins and Batterham⁴ recommend this effect size to be 0.2, as 0.2 represents the “smallest worthwhile effect” according to Jacob Cohen.
- SWC = sd(Test1 Scores) * magnitude threshold

Note on the magnitude threshold: With a very homogeneous group of athletes the standard deviation, and ensuing SWC, can be very small, perhaps so small that it is almost meaningless (Buchheit⁵) . However, I encourage practitioners to determine the effect size of interest based on the magnitude of change that they feel would be meaningful to worry about or meaningful to report to a coach. This might come down to the type of test being performed or the age/experience of the athlete. I don’t think it is as easy as simple saying “0.2 is always our benchmark.” Sometimes we may want to have a larger magnitude of interest (perhaps 0.8, 1.0, or 1.2). To be consistent with the scientific literature, I’ll use 0.2 in for this example, however, in the test-retest function below, I allow the practitioner to choose the magnitude threshold that is most important to them.


######### Test-Retest Function ######################
#####################################################

Test_Retest <- function(test1, test2, magnitude.threshold){
	
	require(dplyr)
	
	# combine the vectors into a dataset
	dataset <- data.frame(test1, test2)
	
	# calculate difference
	dataset$Diff <- with(dataset, test2-test1)
	
	# Calculate Mean & SD
	stats <- as.data.frame(dataset %>%
	summarize(PreTest.Mean = round(mean(test1, na.rm =T),2),
		PreTest.SD = round(sd(test1, na.rm = T),2),
		PostTest.Mean = round(mean(test2, na.rm = T), 2),
		PostTest.SD = round(sd(test2, na.rm = T), 2),
		Mean.Difference = round(mean(Diff, na.rm = T), 2),
		SD.Difference = round(sd(Diff, na.rm = T), 2)))
		
	# Calculate TEM
	TEM <- sd(dataset$Diff, na.rm = 2)/sqrt(2)
	
	# Calculate SWC
	swc <- magnitude.threshold*sd(test1)
	
	# Function output
	list(SummaryStats = stats, TEM = round(TEM,2), SWC = round(swc, 3))
		
}

With the function loaded, we can now supply it with the data from our simulated test-retest trial. All that is required are three inputs:

A vector representing the the scores for test 1.
A vector representing the scores for test 2 (the re-test).
The magnitude of threshold of interest. (Again, in this example I’ll use 0.2, to represent the smallest worthwhile change. Feel free to change this to a different magnitude threshold, such as 0.8 or 1.2, and see how it effects the results.)


test.retest.results <- Test_Retest(test1 = reliability.data$test, test2 = reliability.data$retest, magnitude.threshold = 0.2)

test.retest.results

Looking at the output, we see that the results are returned as a list with three elements:

Summary statistics of both tests and the difference between tests
The TEM
The SWC

This type of list format is useful if you want to call specific parts of the analysis. For example, if I need the TEM to be included downstream, in a later analysis, I can simply call it by typing:


test.retest.results$TEM
[1] 4.83

One thing we may notice from the output of our function is that the error for this test is rather large, relative to the SWC. This could potentially be an issue when attempting to interpret future results for this test, given the error is so large. In this case, we may want to go back to the drawing board with our test and try to figure out a way to minimize the test error (or potentially consider using a different test). Alternatively, using this test would mean that we need to have a rather large change in the athlete’s performance to be certain that improvement the athlete had was “real.”

Training Intervention Simulation Data

The training data that we’ll simulate will have baseline vertical jump scores and follow-up vertical jump scores at 8 weeks. Group 1 will only perform strength training while Group 2 will only perform endurance training.


### Simulate data

set.seed(2018)
subject <- LETTERS[1:20]
group <- rep(c("experimental", "control"), each = 10)
baseline <- c(round(rnorm(n = 10, mean = 24, sd = 3),1), round(rnorm(n = 10, mean = 24, sd = 3), 1))
post.intervention <- c(round(baseline[1:10] + rnorm(n = 10, mean = 8, sd = 5), 1), round(rnorm(n = 10, mean = 27, sd = 5), 1))

study.data <- data.frame(subject, group, baseline, post.intervention)
head(study.data)

  subject        group baseline post.intervention
1       A experimental     22.9              35.1
2       B experimental     20.1              31.0
3       C experimental     21.4              30.3
4       D experimental     27.2              33.0
5       E experimental     23.6              31.2
6       F experimental     27.1              30.5

tail(study.data)

   subject   group baseline post.intervention
15       O control     23.4              30.4
16       P control     25.4              24.5
17       Q control     21.2              17.7
18       R control     32.2              30.7
19       S control     25.0              26.6
20       T control     22.1              32.4

Next, we create a function called outcome, which takes the following eight inputs:

A vector of baseline scores
A vector of post-test scores (follow-up scores)
A vector denoting which subjects belong to each of the groups
A vector of subject IDs
The TEM established from our test-retest trial above
The SWC established from our test-retest trial above
The number of samples in our test-retest trial (Reliability.N = 20)
The confidence interval we are interested in. For this example I’ll use 95%. However, feel free to change this value and see how it influences the results


outcome <- function(baseline.test, post.test, groups, subject.IDs, TEM, SWC, Reliability.N, Conf.Level){
	
	# Combine the vecotors into a data set
	df <- data.frame(subject.IDs, groups, baseline.test, post.test)
	
	# True Baseline Score Calculation
	
	df$LowCI.baseline.true <- round(baseline.test - qt(p = (1-Conf.Level)/2, df = Reliability.N - 1, lower.tail = F)*TEM, 2)
	
	df$HighCI.baseline.true <- round(baseline.test + qt(p = (1-Conf.Level)/2, df = Reliability.N - 1, lower.tail = F)*TEM, 2)
	
	# create a difference score
	df$Pre.Post.Diff <- with(df, post.test - baseline.test)
	
	# create confidence intervals around the difference score
	df$LowCI.Diff <-  round(df$Pre.Post.Diff - qt(p = (1-Conf.Level)/2, df = Reliability.N - 1, lower.tail = F) * sqrt(2) * TEM, 2)
	df$HighCI.Diff <-  round(df$Pre.Post.Diff + qt(p = (1-Conf.Level)/2, df = Reliability.N - 1, lower.tail = F) * sqrt(2) * TEM, 2)
	
	# Summary Stats of Change
	
	Pre.Post.Summary.Stats <- df %>% 
					group_by(groups) %>%
					summarize(
						Mean = mean(Pre.Post.Diff),
						SD = sd(Pre.Post.Diff))
					
	# SD of the response
	diff <- as.data.frame(Pre.Post.Summary.Stats)
	sd.response <- sqrt(abs(diff[1,3]^2 - diff[2,3]^2))
	
	# Proportion of Response
	
	mean1 <- diff[1,2]
	mean2 <- diff[2,2]
	
	prop.response.group.1 <- ifelse(SWC > 0, 100-pnorm(q = SWC,
		mean = mean1,
		sd = sd.response)*100, pnorm(q = SWC,
		mean = mean1,
		sd = sd.response)*100)
	
	prop.response.group.2 <- ifelse(SWC > 0, 100-pnorm(q = SWC,
		mean = mean2,
		sd = sd.response)*100, pnorm(q = SWC,
		mean = mean2,
		sd = sd.response)*100)
	
	list(Data = df, 
		Outcome.Stats = Pre.Post.Summary.Stats, 
		Stdev.Response = sd.response, 
		Perct.Responders.Group1 = paste(round(prop.response.group.1, 2), "%", sep = ""),
		Perct.Responders.Group2 = paste(round(prop.response.group.2, 2), "%", sep = "")
	)
	
}

Now we are ready to use the outcome function on our simulated intervention data set.


outcome(baseline.test = study.data$baseline, 
		post.test = study.data$post.intervention, 
		groups = study.data$group,
		subject.IDs = study.data$subject, 
		TEM = 4.83, 
		SWC = 0.756,
		Reliability.N = 20, 
		Conf.Level = 0.95)

Similar to our test-retest function, the results are returned as a list. Let’s look at the results in more detail:

The first element of the list provides a table of our original data, except we have a few new columns. First we see that we have Low and High Confidence Interval columns (in this case, these columns represent 95% CI, since that is what I specified when I ran the function). These confidence intervals are specific to the baseline test score. They are important for us to consider because when measuring an athlete we can never be truly confident that the performance they produced is their true performance (due to a variety of factors and, in particular, biological variability). Thus, the function uses the TEM from the test-retest trial to calculate the confidence interval around the athletes’ observed baseline scores. Finally, the last three columns provide us with the post-pre score differences and 95% CI around those difference scores for each individual athlete.
The second element of list gives us the summary statistics for each group based on how they performed in the trial. In this element, we can see that the experimental group (Group 1, strength training-only group) observed a larger improvement in vertical jump height, on average, following 8 weeks of training, compared to the control group (Group 2, endurance training-only group). TECHNICAL NOTE: R automatically sorted the two groups alphabetically. As such, even though Group 1 (the experimental group) was first in the original data set, it comes out as being “Group 2” in the output.
The third element is the standard deviation of individual responses. Hopkins⁶ suggest that this standard deviation represents the amount that the mean effect of the intervention is seen to vary between individuals. This standard deviation will be used in the fourth and fifth elements to help understand the individual responses observed within groups.
The fourth and fifth elements of the list display the percentage of responders to the treatment. This proportion of response is calculated by evaluating the variability in change scores from the intervention (standard deviation of individual responses) and the specified SWC (from our test-retest trial)². In the case of our simulated data set we see that Group 1 (remember, this is the endurance group, since R organized the data by group alphabetically) had a lower response than Group 2 (the strength training group).

Wrapping Up

When analyzing data in the applied sport science setting it is important to establish measures such as TEM and SWC so that you can have a higher amount of certainty that athletes are progressing and making true performance improvements. In this blog post, I showed a very simple way to analyze such data while also showing that some basic R coding can be used to produce functions that make our job easier and provide quick results (and quick results are important in the applied setting where decisions between games need to be made in a timely fashion).

Two future considerations:

The training intervention example I provided may not be terribly realistic in many applied sports settings. For example, rarely will a coach allow the staff to separate players into two groups that train in different ways. In a future blog post, I hope to provide some code for analyzing individuals when serial measurements are taken across a season.
I didn’t provide any visualization of the data. Data visualization is not only critical to understanding the data you are analyzing but also important for presenting your data to coaches, managers, and other practitioners. I hope to address data visualization approaches in a future blog post.

References

Hopkins WG, Marshall SW, Batterham AM, Hanin J. (2009). Progressive statistics for studies in sports medicine and exercise science. Med Sci Sports Exer, 41(1): 3-12.
Swinton PA, Hemingway BS, Saunders B, Gualano B, Dolan E. A statistical framework to interpret individual response to intervention: Paving the way for personalized nutrition and exercise prescription. Front Nutr, 5(41): 1-14.
Turner A, Brazier J, Bishop C, Chavda S, Cree J, Read P. (2015). Data analysis for strength and conditioning coaches: Using excel to analyze reliability, differences, and relationships. Strength Cond J, 31(1): 76-83.
Hopkins WG, Batterham AM. (2016). Error rates, decisive outcomes and publication bias with several inferential methods. Sports Med, 46(10): 1563-1573.
Buchheit M. (2014). Monitoring training status with HR measures: Do all roads lead to Rome? Front Phys, 5(73): 1-19.
Hopkins WG. (2015). Individual responses made easy. J Apply Physiol, 118: 1444-1446.