TidyTuesday — Powerlifting Performance & Age

TidyTuesday is a really neat project where every week a new data set is provided (for free) and anyone can download the data and share their findings. The basic idea was to get people to trade ideas on how to arrange, summarize, and visualize data within R (primarily using the suite of data science packages that make up the tidyverse).

I’ve enjoyed seeing what people share on Twitter and my friend Ellis Hughes suggested that I join in the fun. As such, I found a data set from an earlier week that was sports related (to keep the analysis relevant with the theme of my blog).

The data set comes from the TidyTuesday on 10/8/2019 (free to download HERE). Briefly, the data set contains outcomes from International Powerlifting Federation (IPF) Competitions from 1973 up through 2019. Each row represents an individual athlete’s best lift in the squat, bench press, and deadlift, for a given competition. In total, the data set contains 38,244 rows and 15996 unique lifters. (NOTE: There is a much larger data set that is linked to on the GitHub page, but I did not use that one).

I’ll use the Data Analysis Template I discussed in a previous blog article. The only difference between the template from the prior article and the approach I’ll take here is that I have no prior knowledge of the data set. The template works well when we have a specific question to answer as it helps to guide the process from data collection to analysis. However, in this case, as is sometimes common in the real world, people may provide you with a data set without a specific question. As such, some level of data exploration is required to understand the data set and what type of questions may be interesting. Therefore, I’ll begin with just familiarizing myself with the data before developing a question I may want to answer.

Loading Data & Cleaning Data

  • Read in the data from the TidyTuesday GitHub page.
  • Notice that I added a cleaning step when reading in the data. I filter out any age class of 5-12 and I also remove any NA values in the age column (which happened because sometimes exact age wasn’t recorded). I added this step when importing the data after I worked through my analysis because I felt like it was better to do this right away and  space in the code.
  • In the second step, I ordered the data set by athlete name and date of competition.
  • Finally, I created a long format of the data frame (since it is originally in a Wide format) to assist with building data visualizations and I remove any NA’s that were present in the data set (e.g., if a lifter bombs out on their squat in a competition then they have no value for the squat).


df <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-08/ipf_lifts.csv") %>%
filter(age_class != "5-12", !is.na(age))

# order the data by lifter and date

df <- df %>%
arrange(name, date)

# create a long format of the data 

df_long <- df %>%
reshape2::melt(., id = c("name", "date", "age", "age_class", "weight_class_kg", "sex"), measure.vars = c("best3squat_kg", "best3bench_kg", "best3deadlift_kg")) %>%
na.omit(df_long)

Data Exploration

Since I don’t really know anything about the data set provided, it is hard to have a question to answer. Thus, I create some basic plots to help orient myself to to the data we are working with.

First, I wanted to see the athletes who have competed in the most competitions in this data set:

I know that lifters in the IPF have a choice of wearing different types of lifting equipment so I wanted to see what sort of competition gear the athletes in this data set wore:

I was curious about the age class and actual age of when athletes, on average, achieve their best lift:

We can also look at this by male and female:

Finally, I want to explore the distribution of power lifting totals between men and women:

Research Question

After exploring the data a little bit, some of the things that stand out:

1) The data set contains primarily lifters wearing single-ply lifting gear.

2) The boxplots use the ‘age_class’ variable, so everyone within an ‘age_class’ is treated the same and the age bins appear to be rather large (e.g., 24 – 34). I prefer not to think of age data this way since such large groupings can have a lot of variability within them.

3) Looking at the dot plots, which reflect age as a continuous variable, athletes tend to peak in all three of the lifts around their early 30’s.

4) The trend for peaking in performance seems to be consistent among men and women (which is interesting given that I would have suspected women to peak later given that they might be less inclined to take up serious weight training until later in life, whereas male’s tend to start lifting around their high school years).

5) The distribution of powerlifting totals appears to be relatively normally distributed for both men and women, with more variability in the distribution for men than women.

The beauty of graphing your data is that it often reveals underlying patterns that help you get a sense for what is going on. It is instances like this where a statistical model can serve as a gut check to confirm what you can already clearly see.

In looking at the data, the two questions I’ll explore are:

1) At what age do powerlifters peak for the 3 competition lifts?

2) How many competitions do lifters perform until they finally total elite?

I’ll keep these rather simple and brief, as a means of sharing some ideas. These models can (and should) be more thorough and account for things like sex (in the aging curve model, for example) and other variables that may be relevant to how powerlifters progress across  career. What is presented below is just a simple jumping off point of where I might begin when working with data like this to answer a question before extending the model (for example, creating a mixed model to account for individual lifters).

Models

Powerlifter Aging Curve

To develop a simple aging curve model I built a polynomial regression for each of the 3 lifts (again, to keep things simple, I did not include sex in these models). Before building the models, we noticed from our data exploration was that most of the lifters in this data set are single-ply lifters. So I’m going to limit the analysis to them since changing competition gear can influence performance (I’m not going to get into the philosophical debate about which one is “better” than the other — I’ll leave that to the lifters). Additionally, since I’m interested in how lifters perform across their career and when they tend to “peak”, I’m going to limit my analysis to only those lifters who have competed in at least 10 competitions. After cleaning up the data specific to the above inclusion criteria we are left with 6169 rows of data and 426 unique athletes.


# Data clean up for aging curve model
sply <- df %>%
filter(equipment == "Single-ply") %>%
group_by(name) %>%
filter(n() >= 10)

nrow(sply)
nrow(distinct(sply, name))

# 6169
# 426 athletes

 

Now that the data is in the format we’d like, we can build some simple models for each of the three lifts:


squat_age_fit <- lm(best3squat_kg ~ age + I(age^2), data = sply)
bench_age_fit <- lm(best3bench_kg ~ age + I(age^2), data = sply)
deadlift_age_fit <- lm(best3deadlift_kg ~ age + I(age^2), data = sply)

The summary of the three models can be found on my GitHub page. Here is an example of the squat model output:

We see that the coefficient for age is positive while the polynomial of age is negative. This shouldn’t come as a surprise given that we observed an upside down “U” in our plots during the data exploration phase of our analysis. We can use these two coefficients to calculate the peak age from our regression equation. I’ve written a function to do that:


peak_age <- function(coef1, coef2){
x = -(coef1) / (2 * (coef2))
}

 

By supplying the custom function with the two coefficient (age and age^2) we can obtain the peak age from each of our models:


Just as suggested in our data visualizations, the peak age is around the early to mid 30’s with the squat peaking earlier and the bench press peaking later. As an example, we can plot the actual data along with a prediction line and 95% Confidence Interval for the bench press, where the peak age is around 36 years old:

Number of Competitions Until Totaling Elite

To try and answer this question I built a simple time-to-event (survival) model. In this case, the event of interest is the individual achieving an elite total, coded as a 1, and any competition where they do not achieve an elite total coded as a 0. I’m only calculating time to first elite total for each lifter, so there are some lifters that achieve elite and others that do not.

I wasn’t sure of where to obtain the elite total criteria so I found a criteria to use on THIS WEBSITE. However, I’m not certain if these criteria will carry over to single-ply lifters (IE, perhaps these criteria are only specific to raw lifters?). I also wasn’t able to locate an elite total criteria for female lifters, so the below analysis is only specific to male lifters. Finally, not all of the weight classes observed in the data were available on the referenced website. So, this analysis is far from perfect given the data but it will suffice for a simple example.

After adding in the elite total criteria and removing the athletes who were not in a weight class that was specific to the elite total criteria presented in the website, I was left with 6074 male lifters of which, 22% of them (1335) achieved an elite total during their career:

In looking at the number of competitions until a lifter totals elite (plot below), it appears that many of them are achieving that status in their first competition. This makes me skeptical of the data as I feel like most lifters would require a number of competitions to achieve an elite total. This may be a function of either (a) the subset of data that has been provided by TidyTuesday or (b) I’m using the wrong elite total criteria for single-ply lifters.

The data was fit with a Kaplan-Meier curve in order to create a simple model and nice visual of the data. Below is the summary table produced from the model followed by the time-to-event curve (event being elite total).

 

Conclusions

The TidyTuesday project is a great way to get access to data sets and share ideas. This was a fun one to do given it is specific to sport and I had the opportunity to try a few different models while also showing different ways of graphing the data. Finally, there is a bunch of different coding approaches I used to clean up the data, which you can check out on my GitHub page.

Python Pivot Tables

Feedback from the last two blog entries suggests that people seem to enjoy the “how to” nature of them. Walking through how to do some basic coding steps while exploring, summarizing, visualizing, and analyzing data is a great way to help others feel comfortable when trying to move from excel or SPSS to a code language (like R or Python). Additionally, this also helping me to get back to first principles and put together a step-by-step approach to teaching this stuff to others.

As such, I’ve decided to start incorporating some articles where I walk through each line of code. Since I’ve started to do a bit more work in Python (though I still strongly favor R), I feel like this is a great place for me to start.

For those of us who started out in sport and exercise science, using Excel Pivot Tables was one of the first way we began to explore our data and summarize what was there. Today, I’ll walk through how we can create some of those exact same pivot tables in Python.

The Jupyter Notebook and data for this article are available on my GitHub page.

The data I used for this article was taken from www.hockey-reference.com and consists of game outcomes from the 2014-2015 Season through the 2018-2019 Season. I stored the data as a .csv file for this article but I’ll do a future blog article on how to scrape data from webpages like this. This is additionally fun because I know very little about hockey, so I get to learn along the way!

Step 1: Load Packages

  • First I start out just loading the packages (Python people call them “libraries” but I’m still stuck in the old R jargon, so I say “packages”) that contain some of the functions that I’ll need to do my analysis.
  • Pandas is the main data frame package in Python and numpy is useful for dealing with arrays and various mathematical functions.
  • Seaborn and Matplotlib are two packages that I mainly use for data visualizing, however I do not do anything visualization in this article.
  • Finally, the OS package is what I use to check my working directory and change the working directory, if needed (more on that later).
  • Once importing each package I give them an alias (e.g., pandas as pd) so that I can use a short hand version of the package name when I call one of it’s functions, which you will see throughout this article.

# Load packages

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os as os

Step 2: Change the working directory if needed and load data

  • The working directory is the place where the data for your analysis is stored.
  • os.getcwd() is a function from the “os” package that gets my current working directory. If the directory needs to be changed to a different folder (for example, if the data were stored somewhere other than my desktop), I use the os.chdir() function to change it.
  • Once I have the directory set, I load the data using the pd.read_csv() function from the pandas package.
  • After the data is loaded, we take a look at the first few rows of the data frame.
  • An example of all of these steps is provided in the screen shot of my Jupyter notebook.

# Get current working directory
os.getcwd()

# Change the working directory
os.chdir('/Users/Patrick/Desktop')

# Load the 2014 - 2018 NHL Scores data
nhl = pd.read_csv('2014-2018 NHL Scores.csv')

# Look at the first few rows
nhl.head()

Step 3: Clean up the data (All little housekeeping)

  • We notice that the Goals (‘G’) column is repeated, once for the Visitor and once for the Home team. Thus, the second time it is referred to as ‘G.1’. So, we can rename these columns to ‘Visitor_G’ and ‘Home_G’, so that they are more explicit and less confusing.
  • Next, we want to create four new columns. First, we create a column that calculates the point differential of the game for the home team. Then we create a column that identifies whether the winner was the home or away team. Finally, we create a column that tells us which team was the winner and a column telling us which team was the loser. You can see that I do all of these steps by first creating the conditions I’m testing (comparing goals for the home team to goals for the away team) and then assigning the outcome a specific label (e.g. home, away, or tie).

nhl['Point_Diff'] = nhl['Home_G'] - nhl['Visitor_G']

conditions = [
(nhl['Home_G'] > nhl['Visitor_G']),
(nhl['Home_G'] < nhl['Visitor_G']),
(nhl['Home_G'] == nhl['Visitor_G'])]

choices = ['home', 'visitor', 'tie']

nhl['winner'] = np.select(conditions, choices, default = 'null')

cond_for_winner = [
(nhl['winner'] == 'home'),
(nhl['winner'] == 'visitor'),
(nhl['winner'] == 'tie')]

choices_for_winner = [nhl['Home'], nhl['Visitor'], 'tie']
choices_for_loser = [nhl['Visitor'], nhl['Home'], 'tie']

nhl['winning_team'] = np.select(cond_for_winner, choices_for_winner, default = 'null')
nhl['losing_team'] = np.select(cond_for_winner, choices_for_loser, default = 'null')

  • Our data frame now looks like this:

Step 4: Create Pivot Tables

Now that we have the data arranged, we can create some pivot tables.

First, let’s create a pivot table looking at the average home point differential by season.


nhl.groupby('Season').mean()['Point_Diff']

 

I produced this, as you can see from the code, by looking for the mean of ‘Point_Diff’, grouped by the variable ‘Season’.

This is useful, but I don’t love how it looks. Using the pivot_table() function, we can get a nicer looking output.


nhl.pivot_table(values = 'Point_Diff', index = 'Season', aggfunc = np.mean)

 

Much nicer!!

Next, lets look at how many times the home team won.


nhl.groupby(['winner']).size()

If we divide by the total number of observations, we get the percentage of times the home team won.


nhl.groupby(['winner']).size() / len(nhl.index)

 

We can also look at home and visitor win percentage by each season within our data, again by using the groupby() function. We then create two columns, one for home and one for visitor win percentage with just some basic math.


season_totals = nhl.groupby(['Season', 'winner']).size().unstack()
season_totals['Home_Win_Pct'] = season_totals['home'] / (season_totals['home'] + season_totals['visitor'])
season_totals['Visitor_Win_Pct'] = season_totals['visitor'] / (season_totals['home'] + season_totals['visitor'])
season_totals

Finally, we can look at the Win% for each team over this 5 year stretch. Three steps were required to do this:

  1. The first chunk of code gets the total number of times each team played a game.
  2. The second chunk of code tally’s up the teams total wins and losses and stores them in a data frame (team_perf).
  3. The final chunk of code calculates a win percentage an then sorts the win percentage from best to worse.

team_win_totals = nhl.groupby(['winning_team']).size()
team_loss_totals = nhl.groupby(['losing_team']).size()

team_win_totals = pd.DataFrame(team_win_totals, columns = ['wins'])
team_loss_totals = pd.DataFrame(team_loss_totals, columns = ['losses'])
team_perf = team_win_totals.join(team_loss_totals)

team_perf['Win_Pct'] = team_perf['wins'] / (team_perf['wins'] + team_perf['losses'])
team_perf.sort_values(by = ['Win_Pct'], ascending = False)

Conclusion

There are a number of other things we could have built Pivot Tables on to explore this simple data set. Feel free to take the code and the data from my GitHub page and play around with it yourself. I don’t code too often in Python (still trying to figure out a lot of the syntax) so if you notice any errors in the code or ways to make it cleaner (nothing beats R Tidyverse, in my opinion), feel free to comment below.

 

 

Data Analysis Template in R Markdown & Jupyter Notebook

The nice thing about working on a team with other analysts, working as part of a research group, or working on your PhD is the ability to share analysis with other colleagues, get feedback, and learn new ways of thinking about things.

Interestingly, when I’ve inquired to colleagues at some teams about how they share their analysis with their group they often say that, “people do their analysis and just present the results”. I think this is a big miss in terms of being able to have transparency in the research process, sharing so that others can learn or help to provide constructive feedback, and walking through the steps you went through (data retrieval,  data cleaning, analysis, model testing, etc) to ensure that things make sense to the group before being shared with the end user.

For the PhD student, a more streamlined approach to the entire analysis can help them talk through what they did with their advisors, ensure that all the correct steps were taken during the analysis, and have greater confidence about what their data is and is not saying (which can really come in handy when it is time to defend the thesis!). When I was doing my PhD I would often try and put all my steps into a power point presentation to walk through with my supervisors. I never liked that, however, because it always felt clumsy and I was never really getting to the guts of the analysis as much as I was just sharing the outcomes of what I did and why I did it and talking through how I did it. A template that allows for a clear presentation would have made things much easier for both myself and my supervisors

In my last post, I used R Markdown to create a report that allows the sport scientist to share some basic data analysis with the strength and conditioning staff and other coaches. As I said in that post, R Markdown is a wonderful resource for creating reports where you can hide your code and simply show visualizations of the data and model outputs. But, what if we don’t want to hide our code?! In this sense, R Markdown is extremely useful for setting up a data analysis template to allow you to walk through all the steps in your project and share the results with your colleagues or PhD supervisors. Additionally, you could also keep R Studio open when presenting your findings and address any changes/suggestions that people may have, in real time before, “knitting” the markdown file into the final html or pdf document. This last part allows the analysis to come to life and allows you to make direct changes and immediately show how they impact the outcome of the analysis!

Data Analysis Templates

There are a number of different data analysis frameworks one could follow. Two that immediately come to mind are the Cross Industry Standard Process for Data Mining (CRISP-DM) and the Problem, Plan, Data, Analysis, and Conclusion (PPDAC) Cycle.

Although they come from different industries — CRSIP-DM from the business world  and PPDAC from more of the statistics education world — there is considerable overlap and both have the aim of providing the analyst with a clear path to answering their research question.

The objectives of each phase within these two frameworks is shown below.

 

As you can see, the end goal of the analysis is different between the two frameworks: CRISP-DM being targeted at deploying a model specific to business use cases and PPDAC providing more of a runway for scientific publication. However, both can provide us with an appreciation for creating a systematic process around data analysis, allowing for a clear explanation of our approach when discussing with colleagues or PhD supervisors.

In an attempt to create something more generic and less specific to a certain industry or field, I came up with my own framework:

The framework is freely available on my GitHub page in both an R Markdown and Jupyter Notebook (if you prefer Python) formats. If you’d like to see with the R Markdown HTML looks like, click here >> PWard_-_Data_Analysis_Framework.

All you have to do is take the template (either R Markdown or Jupyter Notebook), delete the comments that I have under each section and fill in your own comments and your R or Python script, where applicable, to conduct your analysis. Once complete, you will have a clean looking file that details your entire approach.

I’ve made a simple example of what using the template could look like. If you are interested in seeing the result in R Markdown, CLICK HERE >> Data_Analysis_Framework_Example_–_MLB_Hitting. If you are interested in seeing the result in a Python Jupyter Notebook, CLICK HERE >> Data Analysis Framework Example — MLB Hitting (Jupyter).

All of the code and the templates for use are available on my GitHub page.

 

Total Score of Athleticism — R Markdown & R Shiny

A battery of performance tests (e.g., strength, power, fitness, agility) are often used by strength and conditioning and sports science staffs to evaluate a player’s current physical status. Such information can help to guide future training programs aiming to improve deficiencies and enhance performance. Having a single value that represents the athlete’s overall athleticism may be useful to identify the most well-rounded athletes in the club and may also help in communicating the test results of each player to the coaching staff in a digestible manner.

Recently, Anthony Turner and colleagues published the paper, Total Score of Athleticism: Holistic Athlete Profiling to Enhance Decision-Making, in the NSCA’s Strength and Conditioning Journal. While there are number of ways that one could index an athlete and represent their athleticism in a single value, this approach is simple to calculate for the practitioner and therefore I wanted to use it as an example of how you can create an R markdown report and Shiny app for displaying the results. (NOTE: For those interested in doing the analysis in excel, Anthony has a YouTube channel where he details the process. CLICK HERE).

Calculating Total Score of Athleticism (TSA)

The TSA is derived by calculating the z-score for each test in your battery and then averaging over the z-score for each individual to produce a single value.

The z-score is calculated simply as:

The values for a z-score will be reported with a mean of 0 and standard deviations ranging from -3 to 3 where ±1 SD represents ~68% of the scores, ±2 SD represents ~95% of the scores, and ±3 SD represents ~99% of the scores. In this way, values that are negative suggest the athlete is below average while values that are positive suggest the athlete is above average, relative to the group.

Some coaches or practitioners, however, may not like looking at a z-score because it is difficult for them to wrap their heads around the values (though I think it is easier to see the results as a z-score when performance is represented as positive and negative) and may instead prefer to look at the scores on a 0-100 scale. To convert the z-scores to t-scores on a 0-100 scale we use this formula:

Now, instead of having positive and negative values representing above or below average athletes, a score of a 50 represents average. As such, 10 points in either direction of 50 represent the number of standard deviations from average the athlete is. For example 60 = 1 SD, 70 =  2 SD, and 80 = 3 SD.

R Markdown Report

R Markdown is a simple way to take your analysis and turn it into a report or document for sharing. The beauty of R Markdown is that you can choose to show your R Code (if you are sharing with other professionals or colleagues) or hide your R Code and just show the results of the analysis (if you are sharing with coaches or other staff members).

I’ll put my R code in here to walk through the steps but if you are interested in the R Markdown file, just go over to my GitHub page.

First, we need to load our required packages and make up some fake data (since I don’t have any data I can use publicly).

# Packages
library(tidyverse)
library(reshape)
library(stringr)

# Simulate data
set.seed(3344)
athlete <- as.factor(1:30)
cmj <- c(round(rnorm(n = 10, mean = 30, sd = 4), 1), round(rnorm(n = 10, mean = 24, sd = 4), 1), round(rnorm(n = 10, mean = 33, sd = 2), 1))
sprint_40 <- c(round(rnorm(n = 10, mean = 4.5, sd = .1), 2), round(rnorm(n = 10, mean = 4.9, sd = .2), 2), round(rnorm(n = 10, mean = 5, sd = .2), 2))
bench <- c(round(rnorm(n = 10, mean = 20, sd = 4), 1), round(rnorm(n = 10, mean = 12, sd = 4), 1), round(rnorm(n = 10, mean = 30, sd = 4), 1))
five_ten_five <- c(round(rnorm(n = 10, mean = 6.4, sd = .2), 2), round(rnorm(n = 10, mean = 6.7, sd = .2), 2), round(rnorm(n = 10, mean = 7.5, sd = .4), 2))
df <- data.frame(athlete, cmj, sprint_40, bench, five_ten_five)

Next we need to write 2 functions, one for calculating our z-score and one for calculating our t-score.

## z-score function
z_score <- function(x){
z = (x - mean(x, na.rm = T)) / sd(x, na.rm = T)
return(z)
}

## t-score function
t_score <- function(x){
t = (x * 10) + 50
return(t)
}

 

Now we are all set to calculate the z-score and t-score results for our individual athletes. Also, note that before converting to the t-score, any test where negative reflects better performance (e.g., speed tests where a faster time is more favorable) you can multiple the z-score by -1. This intuitively makes it easier for those reading the report to always associate values that are positive as “above average” and values that are negative as “below average”.


## calculate the z-score
df <- df %>%
mutate_if(is.numeric, list(z = z_score))

df$sprint_40_z <- df$sprint_40_z * -1
df$five_ten_five_z <- df$five_ten_five_z * -1

## calculate the t-score
df <- df %>%
  mutate(cmj_t = t_score(cmj_z),
         sprint_40_t = t_score(sprint_40_z),
         bench_t = t_score(bench_z),
         five_ten_five_t = t_score(five_ten_five_z))

Finally calculate the TSA z-score (TSA_z) and TSA t-score (TSA_t).


## calculate TSA_z
df$TSA_z <- apply(df[, 6:9], MARGIN = 1, FUN = mean)

## calculate TSA_z
df$TSA_t <- with(df, (TSA_z * 10) + 50)

 

Now that the data is prepared we construct our report. Before plotting the TSA z-scores we need to move the data from the wide format, that it is currently in, to a long format. This will make it easier to code the plot. We will remove the “_z” at the end of each variable to make the labels cleaner looking. We are also going to add a shaded range between -1 and 1 (you can pick whatever range makes sense for you in your situation). Finally, we will include an indicator value that flags the athlete as “green” when they are above average and “red” when they are below average.


# Change data from a wide to long format
df_long <- df %>%
melt(., id = "athlete", measure.vars = c("cmj_z", "sprint_40_z", "bench_z", "five_ten_five_z"))

# remove the _z
df_long$Test <- str_sub(df_long$variable, end = -3)

# Add indicator value
df_long <- df_long %>% mutate("indicator" = ifelse(value > 0, "above avg", "below avg"))

# plot
df_long %>%
filter(athlete %in% c(3, 15, 22, 27)) %>%
ggplot(aes(x = Test, y = value)) +
geom_rect(aes(ymin = -1, ymax = 1), xmin = 0, xmax = Inf, fill = "light grey", alpha = 0.3) +
geom_col(aes(fill = indicator), alpha = 0.8) +
facet_wrap(~athlete) +
scale_fill_manual(values = c("green", "red")) +
theme_light() +
theme(axis.text.x = element_text(face = "bold", size = 12, angle = 45, vjust = 1, hjust = 1),
axis.text.y = element_text(face = "bold", size = 12),
strip.background = element_rect(fill = "black"),
strip.text = element_text(color = "white", face = "bold"),
plot.title = element_text(size = 18),
plot.subtitle = element_text(size = 15)) +
labs(x = "", y = "z-score of performance") +
ggtitle("Test Performance", subtitle = "Player Performance Standardized to the Team") +
ylim(-3, 3)

 

 

We can also plot the entire team’s TSA z-score in a similar manner. I’ve included the TSA z-score values on the plot as well, to help with interpretation.

 

If you would like to see what the finished markdown file looks like click here:
Total_Score_of_Athleticism_-_Report

You can certainly manipulate the code to produce different plots or add more text for the coaches to read. I’ve stuck with the z-score in this report because I personally don’t think the t-score conveys the information in the same manner. My personal preference is that I like to see below average as negative and above average as positive. I’ve added the code for the t-score plot in the markdown file so you can manipulate it if you’d like. Quickly, here is what the plot would look like (horizontal dashed line at 50, indicating average) so you can judge which one you prefer better:

 

Shiny App

You may have noticed from the markdown report that when plotting the individual test results I only showed 4 athletes. In this way the report is rather static! It doesn’t allow us to scroll through players in an efficient manner. For that, we need something that is interactive. Enter Shiny!

Shiny is a way for us to quickly build interactive webpages in R that can be hosted directly on our computer. There is a lot of versatility in these apps. The example below is just a simple app that allows the coach to select an athlete and it will automatically change to that individual’s plot, showing their performance in all of the individual tests as well as their TSA.

We could extend this app to have a second plot to allow for player comparisons or a table of results to allow for viewing the raw values for each of the tests. The code is provided below for you to run on your computer. I don’t go into what all of the elements of the code are doing (for time sake) but perhaps I could revisit other ways of using Shiny in future blog posts and explain more clearly how the code works.


### TotalScore of Athleticism - Shiny App

# Load packages
library(tidyverse)
library(reshape)
library(stringr)
library(shiny)

# simulate data
set.seed(3344)
athlete <- as.factor(1:30)
cmj <- c(round(rnorm(n = 10, mean = 30, sd = 4), 1), round(rnorm(n = 10, mean = 24, sd = 4), 1), round(rnorm(n = 10, mean = 33, sd = 2), 1))
sprint_40 <- c(round(rnorm(n = 10, mean = 4.5, sd = .1), 2), round(rnorm(n = 10, mean = 4.9, sd = .2), 2), round(rnorm(n = 10, mean = 5, sd = .2), 2))
bench <- c(round(rnorm(n = 10, mean = 20, sd = 4), 1), round(rnorm(n = 10, mean = 12, sd = 4), 1), round(rnorm(n = 10, mean = 30, sd = 4), 1))
five_ten_five <- c(round(rnorm(n = 10, mean = 6.4, sd = .2), 2), round(rnorm(n = 10, mean = 6.7, sd = .2), 2), round(rnorm(n = 10, mean = 7.5, sd = .4), 2))
df <- data.frame(athlete, cmj, sprint_40, bench, five_ten_five)

# z-score function

z_score <- function(x){
z = (x - mean(x, na.rm = T)) / sd(x, na.rm = T)
}

##### Data Pre-Processing #####
###############################

# calculate the z-score
df <- df %>%
mutate_if(is.numeric, list(z = z_score))

df$sprint_40_z <- df$sprint_40_z * -1
df$five_ten_five_z <- df$five_ten_five_z * -1

# calculate TSA_z
df$TSA_z <- apply(df[, 6:9], MARGIN = 1, FUN = mean)

# Change data from a wide to long format
df_long <- df %>% 
melt(., id = "athlete", measure.vars = c("cmj_z", "sprint_40_z", "bench_z", "five_ten_five_z", "TSA_z"))

# remove the _z
df_long$Test <- str_sub(df_long$variable, end = -3)

# Add indicator value
df_long < df_long %<% mutate("indicator" = ifelse(value > 0, "above avg", "below avg"))

##### Shiny App #####
#####################

## User Interface

athlete <- as.vector(unique(df_long$athlete))

ui <- fluidPage(

titlePanel("Performance Testing Results"),

selectInput(
input = "athlete",
label = "athlete",
choices = athlete,
selected = "1"
),

plotOutput(outputId = "tsa.plot",
width = "60%")
)

## server

server <- function(input, output){

dat <- reactive({
dataset <- subset(df_long, athlete == input$athlete)
dataset
})

output$tsa.plot <- renderPlot({
d <- dat()

athlete.plot <- ggplot(data = d, aes(x = Test, y = value)) +
geom_rect(aes(ymin = -1, ymax = 1), xmin = 0, xmax = Inf, fill = "light grey", alpha = 0.3) +
geom_col(aes(fill = indicator), alpha = 0.8) +
scale_fill_manual(values = c("green", "red")) +
theme_light() +
theme(axis.text.x = element_text(face = "bold", size = 12, angle = 45, vjust = 1, hjust = 1), 
axis.text.y = element_text(face = "bold", size = 12),
strip.background = element_rect(fill = "black"),
strip.text = element_text(color = "white", face = "bold"),
plot.title = element_text(size = 18),
plot.subtitle = element_text(size = 15)) +
labs(x = "", y = "z-score of performance") +
ggtitle("Test Performance", subtitle = "Player Performance Standardized to the Team") +
ylim(-3, 3)

print(athlete.plot)
})
}

## Run the app
shinyApp(ui = ui, server = server)

The website ends up looking like this:

Conclusion

There are a number of ways that one can index several performance tests and create a single number for coaches to digest. This is one simple example of an approach that is easy for practitioners to understand and action against (e.g., low values in red may require specialized approaches to training and performance development). The simplicity of this approach makes it easy to create simple reports and web apps that allow the strength coach and sports science staff to quickly share information with the coaching staff in an easy and interactive manner.

Obviously care should go into selecting the test battery as it should reflect elements that are specific to success in your given sport. Finally, the current format of the Total Score of Athleticism treats each test with equal weight. However, there may be situations where you want to place more weight on certain tests. For example, some tests may be more important for certain position groups than others. Alternatively, some tests may have a stronger association with sports performance and thus require greater weight than other tests. All of these things can be addressed in your setting by simply using the code in this blog and altering it to meet your needs.

 

 

References

1) Turner, AN. et al. (2019). Total Score of Athleticism: Holistic Athlete Profiling to Enhance Decision-Making, Strength Cond J. 2019. Epub-ahead-of-print.

Estimating Performance

When looking at sports statistics it’s important to keep in mind that performance is a blend of both skill and luck. As such, recognizing that all athletes exhibit some level of regression to the mean helps us put into perspective that observed performance is not necessarily where that individual’s true performance lies. For example, we wouldn’t believe that a baseball player who starts the MLB season going 6 for 10 has a true .600 batting average that they will carry throughout the season. Eventually, they will regress back down to something more normal (more average). Conversely, if a player starts the season 1 for 10 we would expect them to eventually move back up to something more normal.

A goal in sports analytics is to try and estimate the true performance of a player given some observed data. One of my favorite papers on this topic is from Efron and Morris (1977), Stein’s Paradox in Statistics. The paper discusses ways of using observed outcomes to make future forecasts using the James-Stein Estimator and then a Bayes Estimation approach.

Using 2019 MLB data, I’ve put together some R code to walk through the methods proposed in the paper.

Getting Data

First, we need to load Bill Petti’s baseballr package, which is a handy package for scraping MLB data. We will also load the ggplot2 package for data visualizations

library(baseballr)
library(ggplot2)

The aim of this analysis will be to look at hitting performance (batting average) over the first 30 days of the 2019 MLB season, make a forecast of the player’s true batting average, based on the observations of those first 30 days, and then test that forecast on the next 30 days.

We will obtain two data sets, a first 30 days data set and a second 30 days data set.

### Get first 30 days of 2019 MLB and Second 30 days

dat_first30 <- daily_batter_bref(t1 = "2019-03-28", t2 = "2019-04-28")
dat_second30 <- daily_batter_bref(t1 = "2019-04-29", t2 = "2019-05-29")

## Explore the data frames

head(dat_first30)
head(dat_second30)

dim(dat_first30) # 595 x 29
dim(dat_second30) # 611 x 29

Evaluating The Data

Let’s now reduce the two data frame’s down to the columns we need (Name, AB, and BA).

dat_first30 <- dat_first30[, c("Name", "AB", "BA")]
dat_second30 <- dat_second30[, c("Name", "AB", "BA")]

 

Let’s look at the number of observations (AB) we see for the players in the two data sets.

 

quantile(dat_first30$AB)
quantile(dat_second30$AB)

par(mfrow = c(1,2))
hist(dat_first30$AB, col = "grey", main = "First 30 AB")
rug(dat_first30$AB, col = "red", lwd = 2)
hist(dat_second30$AB, col = "grey", main = "Second 30 AB")
rug(dat_second30$AB, col = "red", lwd = 2)

 

We see a considerable right skew in the data with a large number of players observing a small amount of at bats over the first 30 days and a few players observing a large number observations (the everyday players).

Let’s see what Batting Average looks like.

quantile(dat_first30$BA, na.rm = T)
quantile(dat_second30$BA, na.rm = T)

par(mfrow = c(1,2))
hist(dat_first30$BA, col = "grey", main = "First 30 BA")
rug(dat_first30$BA, col = "red", lwd = 2)
hist(dat_second30$BA, col = "grey", main = "Second 30 BA")
rug(dat_second30$BA, col = "red", lwd = 2)

 

We see the median batting average for MLB players over the 60 days ranges from .224, in the first 30 days, to .234, in the second 30 days. We also see some players with a batting average of 1.0, which is probably because they batted only one time and got a hit. We also see that there are a bunch of players with a batting average of 0.

This broad range of at bats and batting average values is actually going to be interesting when we attempt to forecast future performance as small sample sizes make it difficult to have faith in a player’s true ability. This is one area in which the James-Stein Estimator and Bayes Estimation approaches may be useful, as they allow for shrinkage of the observed batting averages towards the mean. In this way, the forecast isn’t too overly confident about the player who went 6 for 10 and it isn’t too under confident about the player who went 1 for 10. Additionally, for the player who never got an at bat, the forecast will suggest that the player is an average player until future data/observations can be gathered to prove otherwise.

Estimating Performance

We will use 3 approaches to forecast performance in the second 30 days:

  1. Use the batting average of the player in the first 30 days and assume that the next 30 days would be similar. This will server as our benchmark for which the other two approaches need to improve upon if we are to use them. Note, however, that this approach doesn’t help us at all for players who had 0 at bats.
  2. Use a James-Stein Estimator to forecast the second 30 days by taking the observed batting averages and applying a level of shrinkage to account for some regression to the mean.
  3. Account for regression to the mean by using some sort of prior assumption of the batting average mean and standard deviation of MLB players. This is a type of Bayes approach to handling the problem and is discussed in the last section of Effron and Moriss’s article.

Before we start making our forecasts, I’m going to take a random sample of 50 players from the first 30 days data set. This will allow us to work with a subset of the data for the sake of the example. Additionally, rather than cleaning up the data or setting an inclusion criteria based on number of at bats, by taking a random sample, I’ll get a good mix of players that had no at bats, very little at bats, an average number of at bats, and a large number of at bats. This will allow us to see how well the three forecast approaches handle a large variability in observations. (Technical Note: If you are going to follow along with the r-script below, make sure you use the same set.seed() as I do to ensure reproducibility).

## Get a sample of 50 players from the first 30

set.seed(1657)
N <- nrow(dat_first30)
samp_size <- 50
samp <- sample(x = N, size = samp_size, replace = F)

first30_samp <- dat_first30[samp, ]
head(first30_samp)

We need to locate these same players in the second 30 days data set so that we can test our forecasts.

## Find the same players in the second 30 days data set

second30_samp <- subset(dat_second30, Name %in% unique(first30_samp$Name))

nrow(second30_samp) # 37

Only 37 players from the first set of players (the initial 50) are available in the second 30 days. This could be due to a number of reasons such as injury, getting sent down to triple A, getting benched for a player that was performing better, etc. In any event, we will work with these 37 players from here on out. So, we need to go into the sample from the first 30 days and find those players. Then we merge the two sample sets together

## Subset out the 37 players in the first 30 days sample set

first30_samp <- subset(first30_samp, Name %in% second30_samp$Name)

nrow(first30_samp) # 37

## Merge the two samples together

df <- merge(first30_samp, second30_samp, by = "Name")
head(df)

The first 30 days are denoted as AB.x and BA.x while the second are AB.y and BA.y.

First 30 Day Batting Average to Forecast Second 30 Day Batting Average

Using the two columns, BA.x and BA.y, we can subtract one from the other to obtain the difference in our forecast had we simply assumed the first 30 day’s performance (BA.x) would be similar to the second (BA.y). From there, we can calculate the mean absolute error (MAE) and the root mean square error (RMSE), which we will use to compare the other two methods.

## Difference between first 30 day BA and second 30 day BA

df$Proj_Diff_Avg <- with(df, BA.x - BA.y)
mae <- mean(abs(df$Proj_Diff_Avg), na.rm = T)
rmse <- sqrt(mean(df$Proj_Diff_Avg^2, na.rm = T))


Using the James-Stein Estimator

The forumla for the James-Stein Estimator is as follows:

JS = group_mean + C(obs_BA – group_mean)

Where:

  • group_mean = the average of all players in the first 30 days
  • obs_BA = the observed batting average for an individual player in the first 30 days
  • C = a constant that represents the shrinkage factor. C is calculated as:
    • C = 1 – ((k – 3)*σ2)/(Σ(y – ŷ)2)
      • k = the number of unknown means we are trying to forecast (in this case the sample size of our first 30 days)
      • σ2 = the group variance observed during the first 30 days
      • (Σ(y – ŷ)2) = the sum of the squared differences between each player’s observed batting average and the group average during the first 30 days

First we will calculate our group mean, group SD, and squared differences from the first 30 day sample.


# Calculate grand mean and sd

group_mean <- round(mean(df$BA.x, na.rm = T), 3)
group_mean # .214

group_sd <- round(sd(df$BA.x, na.rm = T), 3)
group_sd # .114

## Calculate sum of squared differences from the group average

sq.diff <- sum((df$BA.x - group_mean)^2)
sq.diff # .467

Next we calculate our shrinkage factor, C.

## Calculate shirinkage factor

k <- nrow(df)
c <- 1 - ((k - 3) * group_sd^2) / sq.diff
c # .053

Now we are ready to make a forecast of batting average for the second 30 days using the James-Stein Estimator and calculate the MAE and RMSE.

df$JS <- group_mean + c*(df$BA.x - group_mean)

df$Proj_Diff_JS <- with(df, JS - BA.y)

mae_JS <- mean(abs(df$Proj_Diff_JS), na.rm = T)
rmse_JS <- sqrt(mean(df$Proj_Diff_JS^2, na.rm = T))


Using a Prior Assumption for MLB Batting Average

In this example, the prior batting average I’m going to use will be the mean and standard deviation from the entire first 30 day data set (the original data set, not the sample). I could, of course, use historic data to build my prior assumption but I figured I’ll just start with this approach since I have the data readily accessible.

# Get a prior for BA

prior_BA_avg <- mean(dat_first30$BA, na.rm = T)
prior_BA_sd <- sd(dat_first30$BA, na.rm = T)

prior_BA_avg # .203
prior_BA_sd # .136

For our Bayes Estimator, we will use the following approach:

BE = prior_BA_avg + prior_BA_sd(obs_BA – prior_BA_avg)

df$BE <- prior_BA_avg + prior_BA_sd*(df$BA.x - prior_BA_avg)
mae_BE <- mean(abs(df$Proj_Diff_BE), na.rm = T)
rmse_BE <- sqrt(mean(df$Proj_Diff_BE^2, na.rm = T))


Looking at Our Forecasts

Let’s put the MAE and RMSE of our 3 approaches into a data frame so we can see how they performed.

Comparisons <- c("First30_Avg", "James_Stein", "Bayes")
mae_grp <- c(mae_avg, mae_JS, mae_BE)
rmse_grp <- c(rmse_avg, rmse_JS, rmse_BE)

model_comps <- data.frame(Comparisons, MAE = mae_grp, RMSE = rmse_grp)
model_comps[order(model_comps$RMSE), ]

It looks like the lowest RMSE is the Bayes Estimator. The James-Stein Estimator is not far behind. Both approaches out performed just using the player’s first 30 day average as a naive forecast for future performance.

We can visualizes the differences in our projections for the three approaches as well.

We can see that making projections based off of first 30 day average (green) is wildly spread out and the mean value appears to over project a player’s true ability (the peak is greater than 0, the dashed red line). Conversly, the James-Stein Estimator (blue) has a pretty strong peak that is just below 0, meaning it may be under projecting players and pulling some players down too far towards the group average. Finally, the Bayes Estimator (grey) resides in the middle of the two projections with its peak just above 0.

The last thing I’ll do is put some confidence intervals around the Bayes Estimator for each player and take a look at how the sample size influences the forecast and where the observed batting average during the second 30 days was in relationship to that forecast.

df$Bayes_SE <- with(df, sqrt((bayes * (1-bayes))/AB.x))
df$Low_CI <- df$bayes - 1.96*df$Bayes_SE
df$High_CI <- df$bayes + 1.96*df$Bayes_SE

We can then plot the results. (Technical Note: To keep the plot from being too busy, I used only plot the first 15 rows of the data set).

ggplot(df[1:15, ], aes(x = reorder(Name, AB.x), y = bayes)) +
	geom_point(color = "blue") +
	geom_point(aes(y = BA.y), color = "red") +
	geom_errorbar(aes(ymin = Low_CI, ymax = High_CI), width = 0) +
	geom_text(aes(label = paste(AB.x, "ABs", sep = " ")), vjust = -1, size = 3) +
	geom_hline(aes(yintercept = 0), linetype = "dashed") +	
	coord_flip() +
	theme_classic() +
	ggtitle("Bayes Estimator", 
		subtitle = "Blue = Bayes Estimation \nRed = Observed BA during second 30 day \nLabeled ABs = Individual Sample Size the Estimation was built on (First 30 days ABs)")

 

The at bats labelled in the plot are specific to the first 30 days of data, as they represent the number of observations for each individual that the forecast was built on. We can see that when we have more observations, the forecast does better at identifying the player’s true performance based on their first 30 days. Rizzo and Igelsias were the two that really beat their forecast in the second 30 days of the 2019 season (Rizzo in particular). The guys at the bottom (Butera, Wynns, Duplantier, and Fedde) are much harder to forecast given they had such few observations in the first 30 days. In the second 30 days, Duplantier only had 1 AB while Butera and Fedde only had 2.

Wrapping Up

The aim of this post was to work through the approaches to forecasting performance used in a 1977 paper from Efron and Morris. Dealing with small samples is a problem in sport and what we saw was that if we naively just use the average performance of a player during those small number of observations we may be missing the boat on their underlying true potential due to regression to the mean. As such, things like the James-Stein Estimator or Bayes Estimator can help us obtain better estimates by using a prior assumption about the average player in the population.

There are other ways to handle this problem, of course. For example, an Empirical Bayes Approach could be used by assuming a beta distribution for our data and making our forecasts from there. Finally, alternative approaches to modeling could account for different variables that might influence a player during the first 30 days (injury, park factors, strength of opponent, etc). However, the simple approaches presented by Efron and Morris are a nice start.

References

  1. Efron, B., Morris, CN. (1977). Stein’s Paradox in Statistics. Scientific American, 236(5): 119-127.