Category Archives: Sports Analytics

Force Decks – Force Plate Shiny Dashboard

Last week, two of the data scientists at Vald Performance, Josh Ruddy and Nick Murray, put out a free online tutorial on how to create a force plate reports using R with data from their Force Decks software.

It was a nice tutorial to give an overview of some of the power behind ggplot2 and the suite of packages that come with tidyverse. Since they made the data available (in the link above), I decided to pull it down and put together a quick shiny app for those that might be interested in extending the report to an interactive web app.

This isn’t the first time I’ve build a shiny app for the blog using force plate data. Interested readers might want to check out my post from a year ago where I built a shiny interactive report for force-velocity profiling.

You can watch a short preview of the end product in the below video link and the screen shots below the link show a static view of what the final shiny App will look like.

A few key features:

  1. App always defaults to the most recent testing day on the testDay tab.
  2. The user can select the position group at the top and that position group will be maintained across all tabs. For example, if you select Forwards, when you switch between tabs one and two, forwards will always be there.
  3. The time series plots on the Player Time Series tab are done using plotly, so they are interactive, allowing the user to hover over each test session and see the change from week-to-week in the tool tip. When the change exceeds the meaningful change, the point turns red. Finally, because it is plotly, the user can slice out specific dates that they want to look at (as you can see me do in the video example), which comes in handy when there are a large number of tests over time.

All code and data s accessible through my GitHub page.

vald_shiny_app

Loading and preparing the data

  • I load the data in using read.csv() and file.choose(), so navigate to wherever you have the data on your computer and select it.
  • There is some light cleaning to change the date in to a date variable. Additionally, there were no player positions in the original data set, so I just made some up and joined those in.

### packages ------------------------------------------------------------------
library(tidyverse)
library(lubridate)
library(psych)
library(shiny)
library(plotly)

theme_set(theme_light())

### load & clean data ---------------------------------------------------------
cmj <- read.csv(file.choose(), header = TRUE) %>%
  janitor::clean_names() %>%
  mutate(date = dmy(date))

player_positions <- data.frame(name = unique(cmj$name),
                               position = c(rep("Forwards", times = 15),
                                            rep("Mids", times = 15),
                                            rep("Backs", times = 15)))

# join position data with jump data
cmj <- cmj %>%
  inner_join(player_positions)

 

Determining Typical Error and Meaningful Change

  • In this example, I’ll just pretend as if the first 2 sessions represented our test-retest data and I’ll work from there.
  • Typical Error Measurement (TEM) was calculated as the standard deviation of differences between test 1 and 2 divided by the square root of 2.
  • For the meaningful change, instead of using 0.2 (the commonly used smallest worthwhile change multiplier) I decided to use a moderate change (0.6), since 0.2 is such a small fraction of the between subject SD.
  • For info on these two values, I covered them in a blog post last week using Python and a paper Anthony Turner and colleagues wrote.

change_standards <- cmj %>%
  group_by(name) %>%
  mutate(test_id = row_number()) %>%
  filter(test_id < 3) %>%
  select(name, test_id, rel_con_peak_power) %>%
  pivot_wider(names_from = test_id,
              names_prefix = "test_",
              values_from = rel_con_peak_power) %>%
  mutate(diff = test_2 - test_1) %>%
  ungroup() %>%
  summarize(TEM = sd(diff) / sqrt(2),
            moderate_change = 0.6 * sd(c(test_1, test_2)))

Building the Shiny App

  • In the user interface, I first create my sidebar panel, allowing the user to select the position group of interest. You’ll notice that this sidebar panel is not within the tab panels, which is why it stands alone and allows us to select a position group that will be retained across all tabs.
  • Next, I set up 2 tabs. Notice that in the first tab (testDay) I include a select input, to allow the user to select the date of interest. In the selected argument I tell shiny to always select the max(cmj$date) so that the most recent session is always shown to the user.
  • The server is pretty straight forward. I commented out where each tab data is built. Basically, it is just taking the user specified information and performing simple data filtering and then ggplot2 charts to provide us with the relevant information.
  • On the testDay plot, we use the meaningful change to shade the region around 0 in grey and we use the TEM around the athlete’s observed performance on a given day to specify the amount of error that we might expect for the test.
  • One the Player Time Series plot we have the athlete’s average line and ±1 SD lines to accompany their data, with points changing color when the week-to-week change exceeds out meaningful change.
### Shiny App -----------------------------------------------------------------------------

## Set up user interface

ui <- fluidPage(
  
  ## set title of the app
  titlePanel("Team CMJ Analysis"),
  
  ## create a selection bar for position group that works across all tabs
  sidebarPanel(
    selectInput(inputId = "position",
                label = "Select Position Group:",
                choices = unique(cmj$position),
                selected = "Backs",
                multiple = FALSE),
    width = 2
  ),
  
  ## set up 2 tabs: One for team daily analysis and one for player time series
  tabsetPanel(
    
    tabPanel(title = "testDay",
             
             selectInput(inputId = "date",
                         label = "Select Date:",
                         choices = unique(cmj$date)[-1],
                         selected = max(cmj$date),
                         multiple = FALSE),
             
             mainPanel(plotOutput(outputId = "day_plt", width = "100%", height = "650px"),
                       width = 12)),
    
    tabPanel(title = "Player Time Series",
             
             mainPanel(plotlyOutput(outputId = "player_plt", width = "100%", height = "700px"),
                       width = 12))
  )
  
)


server <- function(input, output){
  
  ##### Day plot tab ####
  ## day plot data
  day_dat <- reactive({
    
    d <- cmj %>%
      group_by(name) %>%
      mutate(change_power = rel_con_peak_power - lag(rel_con_peak_power)) %>%
      filter(date == input$date,
             position == input$position)
    
    d
    
  })
  
  ## day plot
  output$day_plt <- renderPlot({ day_dat() %>%
      ggplot(aes(x = reorder(name, change_power), y = change_power)) +
      geom_rect(aes(ymin = -change_standards$moderate_change, ymax = change_standards$moderate_change),
                xmin = 0,
                xmax = Inf,
                fill = "light grey",
                alpha = 0.6) +
      geom_hline(yintercept = 0) +
      geom_point(size = 4) +
      geom_errorbar(aes(ymin = change_power - change_standards$TEM, ymax = change_power + change_standards$TEM),
                    width = 0.2,
                    size = 1.2) +
      theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1),
            axis.text = element_text(size = 16, face = "bold"),
            axis.title = element_text(size = 18, face = "bold"),
            plot.title = element_text(size = 22)) +
      labs(x = NULL,
           y = "Weekly Change",
           title = "Week-to-Week Change in Realtive Concentric Peak Power")
    
  })
  
  ##### Player plot tab ####
  ## player plot data
  
  player_dat <- reactive({
    
    d <- cmj %>%
      group_by(name) %>%
      mutate(avg = mean(rel_con_peak_power),
             sd = sd(rel_con_peak_power),
             change = rel_con_peak_power - lag(rel_con_peak_power),
             change_flag = ifelse(change >= change_standards$moderate_change | change <= -change_standards$moderate_change, "Flag", "No Flag")) %>%
      filter(position == input$position)
    
    d
  })
  
  ## player plot
  output$player_plt <- renderPlotly({
    
    plt <- player_dat() %>%
      ggplot(aes(x = date, y = rel_con_peak_power, label = change)) +
      geom_rect(aes(ymin = avg - sd, ymax = avg + sd),
                xmin = 0,
                xmax = Inf,
                fill = "light grey",
                alpha = 0.6) +
      geom_hline(aes(yintercept = avg - sd),
                 color = "black",
                 linetype = "dashed",
                 size = 1.2) +
      geom_hline(aes(yintercept = avg + sd),
                 color = "black",
                 linetype = "dashed",
                 size = 1.2) +
      geom_hline(aes(yintercept = avg), size = 1) +
      geom_line(size = 1) +
      geom_point(shape = 21,
                 size = 3,
                 aes(fill = change_flag)) +
      facet_wrap(~name) +
      scale_fill_manual(values = c("red", "black", "black")) +
      theme(axis.text = element_text(size = 13, face = "bold"),
            axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
            plot.title = element_text(size = 18),
            strip.background = element_rect(fill = "black"),
            strip.text = element_text(size = 13, face = "bold"),
            legend.position = "none") +
      labs(x = NULL,
           y = NULL,
           title = "Relative Concentric Peak Power")
    
    ggplotly(plt)
    
  })
  
  
}



shinyApp(ui, server)

TidyTuesday — Powerlifting Performance & Age

TidyTuesday is a really neat project where every week a new data set is provided (for free) and anyone can download the data and share their findings. The basic idea was to get people to trade ideas on how to arrange, summarize, and visualize data within R (primarily using the suite of data science packages that make up the tidyverse).

I’ve enjoyed seeing what people share on Twitter and my friend Ellis Hughes suggested that I join in the fun. As such, I found a data set from an earlier week that was sports related (to keep the analysis relevant with the theme of my blog).

The data set comes from the TidyTuesday on 10/8/2019 (free to download HERE). Briefly, the data set contains outcomes from International Powerlifting Federation (IPF) Competitions from 1973 up through 2019. Each row represents an individual athlete’s best lift in the squat, bench press, and deadlift, for a given competition. In total, the data set contains 38,244 rows and 15996 unique lifters. (NOTE: There is a much larger data set that is linked to on the GitHub page, but I did not use that one).

I’ll use the Data Analysis Template I discussed in a previous blog article. The only difference between the template from the prior article and the approach I’ll take here is that I have no prior knowledge of the data set. The template works well when we have a specific question to answer as it helps to guide the process from data collection to analysis. However, in this case, as is sometimes common in the real world, people may provide you with a data set without a specific question. As such, some level of data exploration is required to understand the data set and what type of questions may be interesting. Therefore, I’ll begin with just familiarizing myself with the data before developing a question I may want to answer.

Loading Data & Cleaning Data

  • Read in the data from the TidyTuesday GitHub page.
  • Notice that I added a cleaning step when reading in the data. I filter out any age class of 5-12 and I also remove any NA values in the age column (which happened because sometimes exact age wasn’t recorded). I added this step when importing the data after I worked through my analysis because I felt like it was better to do this right away and  space in the code.
  • In the second step, I ordered the data set by athlete name and date of competition.
  • Finally, I created a long format of the data frame (since it is originally in a Wide format) to assist with building data visualizations and I remove any NA’s that were present in the data set (e.g., if a lifter bombs out on their squat in a competition then they have no value for the squat).


df <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-08/ipf_lifts.csv") %>%
filter(age_class != "5-12", !is.na(age))

# order the data by lifter and date

df <- df %>%
arrange(name, date)

# create a long format of the data 

df_long <- df %>%
reshape2::melt(., id = c("name", "date", "age", "age_class", "weight_class_kg", "sex"), measure.vars = c("best3squat_kg", "best3bench_kg", "best3deadlift_kg")) %>%
na.omit(df_long)

Data Exploration

Since I don’t really know anything about the data set provided, it is hard to have a question to answer. Thus, I create some basic plots to help orient myself to to the data we are working with.

First, I wanted to see the athletes who have competed in the most competitions in this data set:

I know that lifters in the IPF have a choice of wearing different types of lifting equipment so I wanted to see what sort of competition gear the athletes in this data set wore:

I was curious about the age class and actual age of when athletes, on average, achieve their best lift:

We can also look at this by male and female:

Finally, I want to explore the distribution of power lifting totals between men and women:

Research Question

After exploring the data a little bit, some of the things that stand out:

1) The data set contains primarily lifters wearing single-ply lifting gear.

2) The boxplots use the ‘age_class’ variable, so everyone within an ‘age_class’ is treated the same and the age bins appear to be rather large (e.g., 24 – 34). I prefer not to think of age data this way since such large groupings can have a lot of variability within them.

3) Looking at the dot plots, which reflect age as a continuous variable, athletes tend to peak in all three of the lifts around their early 30’s.

4) The trend for peaking in performance seems to be consistent among men and women (which is interesting given that I would have suspected women to peak later given that they might be less inclined to take up serious weight training until later in life, whereas male’s tend to start lifting around their high school years).

5) The distribution of powerlifting totals appears to be relatively normally distributed for both men and women, with more variability in the distribution for men than women.

The beauty of graphing your data is that it often reveals underlying patterns that help you get a sense for what is going on. It is instances like this where a statistical model can serve as a gut check to confirm what you can already clearly see.

In looking at the data, the two questions I’ll explore are:

1) At what age do powerlifters peak for the 3 competition lifts?

2) How many competitions do lifters perform until they finally total elite?

I’ll keep these rather simple and brief, as a means of sharing some ideas. These models can (and should) be more thorough and account for things like sex (in the aging curve model, for example) and other variables that may be relevant to how powerlifters progress across  career. What is presented below is just a simple jumping off point of where I might begin when working with data like this to answer a question before extending the model (for example, creating a mixed model to account for individual lifters).

Models

Powerlifter Aging Curve

To develop a simple aging curve model I built a polynomial regression for each of the 3 lifts (again, to keep things simple, I did not include sex in these models). Before building the models, we noticed from our data exploration was that most of the lifters in this data set are single-ply lifters. So I’m going to limit the analysis to them since changing competition gear can influence performance (I’m not going to get into the philosophical debate about which one is “better” than the other — I’ll leave that to the lifters). Additionally, since I’m interested in how lifters perform across their career and when they tend to “peak”, I’m going to limit my analysis to only those lifters who have competed in at least 10 competitions. After cleaning up the data specific to the above inclusion criteria we are left with 6169 rows of data and 426 unique athletes.


# Data clean up for aging curve model
sply <- df %>%
filter(equipment == "Single-ply") %>%
group_by(name) %>%
filter(n() >= 10)

nrow(sply)
nrow(distinct(sply, name))

# 6169
# 426 athletes

 

Now that the data is in the format we’d like, we can build some simple models for each of the three lifts:


squat_age_fit <- lm(best3squat_kg ~ age + I(age^2), data = sply)
bench_age_fit <- lm(best3bench_kg ~ age + I(age^2), data = sply)
deadlift_age_fit <- lm(best3deadlift_kg ~ age + I(age^2), data = sply)

The summary of the three models can be found on my GitHub page. Here is an example of the squat model output:

We see that the coefficient for age is positive while the polynomial of age is negative. This shouldn’t come as a surprise given that we observed an upside down “U” in our plots during the data exploration phase of our analysis. We can use these two coefficients to calculate the peak age from our regression equation. I’ve written a function to do that:


peak_age <- function(coef1, coef2){
x = -(coef1) / (2 * (coef2))
}

 

By supplying the custom function with the two coefficient (age and age^2) we can obtain the peak age from each of our models:


Just as suggested in our data visualizations, the peak age is around the early to mid 30’s with the squat peaking earlier and the bench press peaking later. As an example, we can plot the actual data along with a prediction line and 95% Confidence Interval for the bench press, where the peak age is around 36 years old:

Number of Competitions Until Totaling Elite

To try and answer this question I built a simple time-to-event (survival) model. In this case, the event of interest is the individual achieving an elite total, coded as a 1, and any competition where they do not achieve an elite total coded as a 0. I’m only calculating time to first elite total for each lifter, so there are some lifters that achieve elite and others that do not.

I wasn’t sure of where to obtain the elite total criteria so I found a criteria to use on THIS WEBSITE. However, I’m not certain if these criteria will carry over to single-ply lifters (IE, perhaps these criteria are only specific to raw lifters?). I also wasn’t able to locate an elite total criteria for female lifters, so the below analysis is only specific to male lifters. Finally, not all of the weight classes observed in the data were available on the referenced website. So, this analysis is far from perfect given the data but it will suffice for a simple example.

After adding in the elite total criteria and removing the athletes who were not in a weight class that was specific to the elite total criteria presented in the website, I was left with 6074 male lifters of which, 22% of them (1335) achieved an elite total during their career:

In looking at the number of competitions until a lifter totals elite (plot below), it appears that many of them are achieving that status in their first competition. This makes me skeptical of the data as I feel like most lifters would require a number of competitions to achieve an elite total. This may be a function of either (a) the subset of data that has been provided by TidyTuesday or (b) I’m using the wrong elite total criteria for single-ply lifters.

The data was fit with a Kaplan-Meier curve in order to create a simple model and nice visual of the data. Below is the summary table produced from the model followed by the time-to-event curve (event being elite total).

 

Conclusions

The TidyTuesday project is a great way to get access to data sets and share ideas. This was a fun one to do given it is specific to sport and I had the opportunity to try a few different models while also showing different ways of graphing the data. Finally, there is a bunch of different coding approaches I used to clean up the data, which you can check out on my GitHub page.

Data Analysis Template in R Markdown & Jupyter Notebook

The nice thing about working on a team with other analysts, working as part of a research group, or working on your PhD is the ability to share analysis with other colleagues, get feedback, and learn new ways of thinking about things.

Interestingly, when I’ve inquired to colleagues at some teams about how they share their analysis with their group they often say that, “people do their analysis and just present the results”. I think this is a big miss in terms of being able to have transparency in the research process, sharing so that others can learn or help to provide constructive feedback, and walking through the steps you went through (data retrieval,  data cleaning, analysis, model testing, etc) to ensure that things make sense to the group before being shared with the end user.

For the PhD student, a more streamlined approach to the entire analysis can help them talk through what they did with their advisors, ensure that all the correct steps were taken during the analysis, and have greater confidence about what their data is and is not saying (which can really come in handy when it is time to defend the thesis!). When I was doing my PhD I would often try and put all my steps into a power point presentation to walk through with my supervisors. I never liked that, however, because it always felt clumsy and I was never really getting to the guts of the analysis as much as I was just sharing the outcomes of what I did and why I did it and talking through how I did it. A template that allows for a clear presentation would have made things much easier for both myself and my supervisors

In my last post, I used R Markdown to create a report that allows the sport scientist to share some basic data analysis with the strength and conditioning staff and other coaches. As I said in that post, R Markdown is a wonderful resource for creating reports where you can hide your code and simply show visualizations of the data and model outputs. But, what if we don’t want to hide our code?! In this sense, R Markdown is extremely useful for setting up a data analysis template to allow you to walk through all the steps in your project and share the results with your colleagues or PhD supervisors. Additionally, you could also keep R Studio open when presenting your findings and address any changes/suggestions that people may have, in real time before, “knitting” the markdown file into the final html or pdf document. This last part allows the analysis to come to life and allows you to make direct changes and immediately show how they impact the outcome of the analysis!

Data Analysis Templates

There are a number of different data analysis frameworks one could follow. Two that immediately come to mind are the Cross Industry Standard Process for Data Mining (CRISP-DM) and the Problem, Plan, Data, Analysis, and Conclusion (PPDAC) Cycle.

Although they come from different industries — CRSIP-DM from the business world  and PPDAC from more of the statistics education world — there is considerable overlap and both have the aim of providing the analyst with a clear path to answering their research question.

The objectives of each phase within these two frameworks is shown below.

 

As you can see, the end goal of the analysis is different between the two frameworks: CRISP-DM being targeted at deploying a model specific to business use cases and PPDAC providing more of a runway for scientific publication. However, both can provide us with an appreciation for creating a systematic process around data analysis, allowing for a clear explanation of our approach when discussing with colleagues or PhD supervisors.

In an attempt to create something more generic and less specific to a certain industry or field, I came up with my own framework:

The framework is freely available on my GitHub page in both an R Markdown and Jupyter Notebook (if you prefer Python) formats. If you’d like to see with the R Markdown HTML looks like, click here >> PWard_-_Data_Analysis_Framework.

All you have to do is take the template (either R Markdown or Jupyter Notebook), delete the comments that I have under each section and fill in your own comments and your R or Python script, where applicable, to conduct your analysis. Once complete, you will have a clean looking file that details your entire approach.

I’ve made a simple example of what using the template could look like. If you are interested in seeing the result in R Markdown, CLICK HERE >> Data_Analysis_Framework_Example_–_MLB_Hitting. If you are interested in seeing the result in a Python Jupyter Notebook, CLICK HERE >> Data Analysis Framework Example — MLB Hitting (Jupyter).

All of the code and the templates for use are available on my GitHub page.

 

Total Score of Athleticism — R Markdown & R Shiny

A battery of performance tests (e.g., strength, power, fitness, agility) are often used by strength and conditioning and sports science staffs to evaluate a player’s current physical status. Such information can help to guide future training programs aiming to improve deficiencies and enhance performance. Having a single value that represents the athlete’s overall athleticism may be useful to identify the most well-rounded athletes in the club and may also help in communicating the test results of each player to the coaching staff in a digestible manner.

Recently, Anthony Turner and colleagues published the paper, Total Score of Athleticism: Holistic Athlete Profiling to Enhance Decision-Making, in the NSCA’s Strength and Conditioning Journal. While there are number of ways that one could index an athlete and represent their athleticism in a single value, this approach is simple to calculate for the practitioner and therefore I wanted to use it as an example of how you can create an R markdown report and Shiny app for displaying the results. (NOTE: For those interested in doing the analysis in excel, Anthony has a YouTube channel where he details the process. CLICK HERE).

Calculating Total Score of Athleticism (TSA)

The TSA is derived by calculating the z-score for each test in your battery and then averaging over the z-score for each individual to produce a single value.

The z-score is calculated simply as:

The values for a z-score will be reported with a mean of 0 and standard deviations ranging from -3 to 3 where ±1 SD represents ~68% of the scores, ±2 SD represents ~95% of the scores, and ±3 SD represents ~99% of the scores. In this way, values that are negative suggest the athlete is below average while values that are positive suggest the athlete is above average, relative to the group.

Some coaches or practitioners, however, may not like looking at a z-score because it is difficult for them to wrap their heads around the values (though I think it is easier to see the results as a z-score when performance is represented as positive and negative) and may instead prefer to look at the scores on a 0-100 scale. To convert the z-scores to t-scores on a 0-100 scale we use this formula:

Now, instead of having positive and negative values representing above or below average athletes, a score of a 50 represents average. As such, 10 points in either direction of 50 represent the number of standard deviations from average the athlete is. For example 60 = 1 SD, 70 =  2 SD, and 80 = 3 SD.

R Markdown Report

R Markdown is a simple way to take your analysis and turn it into a report or document for sharing. The beauty of R Markdown is that you can choose to show your R Code (if you are sharing with other professionals or colleagues) or hide your R Code and just show the results of the analysis (if you are sharing with coaches or other staff members).

I’ll put my R code in here to walk through the steps but if you are interested in the R Markdown file, just go over to my GitHub page.

First, we need to load our required packages and make up some fake data (since I don’t have any data I can use publicly).

# Packages
library(tidyverse)
library(reshape)
library(stringr)

# Simulate data
set.seed(3344)
athlete <- as.factor(1:30)
cmj <- c(round(rnorm(n = 10, mean = 30, sd = 4), 1), round(rnorm(n = 10, mean = 24, sd = 4), 1), round(rnorm(n = 10, mean = 33, sd = 2), 1))
sprint_40 <- c(round(rnorm(n = 10, mean = 4.5, sd = .1), 2), round(rnorm(n = 10, mean = 4.9, sd = .2), 2), round(rnorm(n = 10, mean = 5, sd = .2), 2))
bench <- c(round(rnorm(n = 10, mean = 20, sd = 4), 1), round(rnorm(n = 10, mean = 12, sd = 4), 1), round(rnorm(n = 10, mean = 30, sd = 4), 1))
five_ten_five <- c(round(rnorm(n = 10, mean = 6.4, sd = .2), 2), round(rnorm(n = 10, mean = 6.7, sd = .2), 2), round(rnorm(n = 10, mean = 7.5, sd = .4), 2))
df <- data.frame(athlete, cmj, sprint_40, bench, five_ten_five)

Next we need to write 2 functions, one for calculating our z-score and one for calculating our t-score.

## z-score function
z_score <- function(x){
z = (x - mean(x, na.rm = T)) / sd(x, na.rm = T)
return(z)
}

## t-score function
t_score <- function(x){
t = (x * 10) + 50
return(t)
}

 

Now we are all set to calculate the z-score and t-score results for our individual athletes. Also, note that before converting to the t-score, any test where negative reflects better performance (e.g., speed tests where a faster time is more favorable) you can multiple the z-score by -1. This intuitively makes it easier for those reading the report to always associate values that are positive as “above average” and values that are negative as “below average”.


## calculate the z-score
df <- df %>%
mutate_if(is.numeric, list(z = z_score))

df$sprint_40_z <- df$sprint_40_z * -1
df$five_ten_five_z <- df$five_ten_five_z * -1

## calculate the t-score
df <- df %>%
  mutate(cmj_t = t_score(cmj_z),
         sprint_40_t = t_score(sprint_40_z),
         bench_t = t_score(bench_z),
         five_ten_five_t = t_score(five_ten_five_z))

Finally calculate the TSA z-score (TSA_z) and TSA t-score (TSA_t).


## calculate TSA_z
df$TSA_z <- apply(df[, 6:9], MARGIN = 1, FUN = mean)

## calculate TSA_z
df$TSA_t <- with(df, (TSA_z * 10) + 50)

 

Now that the data is prepared we construct our report. Before plotting the TSA z-scores we need to move the data from the wide format, that it is currently in, to a long format. This will make it easier to code the plot. We will remove the “_z” at the end of each variable to make the labels cleaner looking. We are also going to add a shaded range between -1 and 1 (you can pick whatever range makes sense for you in your situation). Finally, we will include an indicator value that flags the athlete as “green” when they are above average and “red” when they are below average.


# Change data from a wide to long format
df_long <- df %>%
melt(., id = "athlete", measure.vars = c("cmj_z", "sprint_40_z", "bench_z", "five_ten_five_z"))

# remove the _z
df_long$Test <- str_sub(df_long$variable, end = -3)

# Add indicator value
df_long <- df_long %>% mutate("indicator" = ifelse(value > 0, "above avg", "below avg"))

# plot
df_long %>%
filter(athlete %in% c(3, 15, 22, 27)) %>%
ggplot(aes(x = Test, y = value)) +
geom_rect(aes(ymin = -1, ymax = 1), xmin = 0, xmax = Inf, fill = "light grey", alpha = 0.3) +
geom_col(aes(fill = indicator), alpha = 0.8) +
facet_wrap(~athlete) +
scale_fill_manual(values = c("green", "red")) +
theme_light() +
theme(axis.text.x = element_text(face = "bold", size = 12, angle = 45, vjust = 1, hjust = 1),
axis.text.y = element_text(face = "bold", size = 12),
strip.background = element_rect(fill = "black"),
strip.text = element_text(color = "white", face = "bold"),
plot.title = element_text(size = 18),
plot.subtitle = element_text(size = 15)) +
labs(x = "", y = "z-score of performance") +
ggtitle("Test Performance", subtitle = "Player Performance Standardized to the Team") +
ylim(-3, 3)

 

 

We can also plot the entire team’s TSA z-score in a similar manner. I’ve included the TSA z-score values on the plot as well, to help with interpretation.

 

If you would like to see what the finished markdown file looks like click here:
Total_Score_of_Athleticism_-_Report

You can certainly manipulate the code to produce different plots or add more text for the coaches to read. I’ve stuck with the z-score in this report because I personally don’t think the t-score conveys the information in the same manner. My personal preference is that I like to see below average as negative and above average as positive. I’ve added the code for the t-score plot in the markdown file so you can manipulate it if you’d like. Quickly, here is what the plot would look like (horizontal dashed line at 50, indicating average) so you can judge which one you prefer better:

 

Shiny App

You may have noticed from the markdown report that when plotting the individual test results I only showed 4 athletes. In this way the report is rather static! It doesn’t allow us to scroll through players in an efficient manner. For that, we need something that is interactive. Enter Shiny!

Shiny is a way for us to quickly build interactive webpages in R that can be hosted directly on our computer. There is a lot of versatility in these apps. The example below is just a simple app that allows the coach to select an athlete and it will automatically change to that individual’s plot, showing their performance in all of the individual tests as well as their TSA.

We could extend this app to have a second plot to allow for player comparisons or a table of results to allow for viewing the raw values for each of the tests. The code is provided below for you to run on your computer. I don’t go into what all of the elements of the code are doing (for time sake) but perhaps I could revisit other ways of using Shiny in future blog posts and explain more clearly how the code works.


### TotalScore of Athleticism - Shiny App

# Load packages
library(tidyverse)
library(reshape)
library(stringr)
library(shiny)

# simulate data
set.seed(3344)
athlete <- as.factor(1:30)
cmj <- c(round(rnorm(n = 10, mean = 30, sd = 4), 1), round(rnorm(n = 10, mean = 24, sd = 4), 1), round(rnorm(n = 10, mean = 33, sd = 2), 1))
sprint_40 <- c(round(rnorm(n = 10, mean = 4.5, sd = .1), 2), round(rnorm(n = 10, mean = 4.9, sd = .2), 2), round(rnorm(n = 10, mean = 5, sd = .2), 2))
bench <- c(round(rnorm(n = 10, mean = 20, sd = 4), 1), round(rnorm(n = 10, mean = 12, sd = 4), 1), round(rnorm(n = 10, mean = 30, sd = 4), 1))
five_ten_five <- c(round(rnorm(n = 10, mean = 6.4, sd = .2), 2), round(rnorm(n = 10, mean = 6.7, sd = .2), 2), round(rnorm(n = 10, mean = 7.5, sd = .4), 2))
df <- data.frame(athlete, cmj, sprint_40, bench, five_ten_five)

# z-score function

z_score <- function(x){
z = (x - mean(x, na.rm = T)) / sd(x, na.rm = T)
}

##### Data Pre-Processing #####
###############################

# calculate the z-score
df <- df %>%
mutate_if(is.numeric, list(z = z_score))

df$sprint_40_z <- df$sprint_40_z * -1
df$five_ten_five_z <- df$five_ten_five_z * -1

# calculate TSA_z
df$TSA_z <- apply(df[, 6:9], MARGIN = 1, FUN = mean)

# Change data from a wide to long format
df_long <- df %>% 
melt(., id = "athlete", measure.vars = c("cmj_z", "sprint_40_z", "bench_z", "five_ten_five_z", "TSA_z"))

# remove the _z
df_long$Test <- str_sub(df_long$variable, end = -3)

# Add indicator value
df_long < df_long %<% mutate("indicator" = ifelse(value > 0, "above avg", "below avg"))

##### Shiny App #####
#####################

## User Interface

athlete <- as.vector(unique(df_long$athlete))

ui <- fluidPage(

titlePanel("Performance Testing Results"),

selectInput(
input = "athlete",
label = "athlete",
choices = athlete,
selected = "1"
),

plotOutput(outputId = "tsa.plot",
width = "60%")
)

## server

server <- function(input, output){

dat <- reactive({
dataset <- subset(df_long, athlete == input$athlete)
dataset
})

output$tsa.plot <- renderPlot({
d <- dat()

athlete.plot <- ggplot(data = d, aes(x = Test, y = value)) +
geom_rect(aes(ymin = -1, ymax = 1), xmin = 0, xmax = Inf, fill = "light grey", alpha = 0.3) +
geom_col(aes(fill = indicator), alpha = 0.8) +
scale_fill_manual(values = c("green", "red")) +
theme_light() +
theme(axis.text.x = element_text(face = "bold", size = 12, angle = 45, vjust = 1, hjust = 1), 
axis.text.y = element_text(face = "bold", size = 12),
strip.background = element_rect(fill = "black"),
strip.text = element_text(color = "white", face = "bold"),
plot.title = element_text(size = 18),
plot.subtitle = element_text(size = 15)) +
labs(x = "", y = "z-score of performance") +
ggtitle("Test Performance", subtitle = "Player Performance Standardized to the Team") +
ylim(-3, 3)

print(athlete.plot)
})
}

## Run the app
shinyApp(ui = ui, server = server)

The website ends up looking like this:

Conclusion

There are a number of ways that one can index several performance tests and create a single number for coaches to digest. This is one simple example of an approach that is easy for practitioners to understand and action against (e.g., low values in red may require specialized approaches to training and performance development). The simplicity of this approach makes it easy to create simple reports and web apps that allow the strength coach and sports science staff to quickly share information with the coaching staff in an easy and interactive manner.

Obviously care should go into selecting the test battery as it should reflect elements that are specific to success in your given sport. Finally, the current format of the Total Score of Athleticism treats each test with equal weight. However, there may be situations where you want to place more weight on certain tests. For example, some tests may be more important for certain position groups than others. Alternatively, some tests may have a stronger association with sports performance and thus require greater weight than other tests. All of these things can be addressed in your setting by simply using the code in this blog and altering it to meet your needs.

 

 

References

1) Turner, AN. et al. (2019). Total Score of Athleticism: Holistic Athlete Profiling to Enhance Decision-Making, Strength Cond J. 2019. Epub-ahead-of-print.

Estimating Performance

When looking at sports statistics it’s important to keep in mind that performance is a blend of both skill and luck. As such, recognizing that all athletes exhibit some level of regression to the mean helps us put into perspective that observed performance is not necessarily where that individual’s true performance lies. For example, we wouldn’t believe that a baseball player who starts the MLB season going 6 for 10 has a true .600 batting average that they will carry throughout the season. Eventually, they will regress back down to something more normal (more average). Conversely, if a player starts the season 1 for 10 we would expect them to eventually move back up to something more normal.

A goal in sports analytics is to try and estimate the true performance of a player given some observed data. One of my favorite papers on this topic is from Efron and Morris (1977), Stein’s Paradox in Statistics. The paper discusses ways of using observed outcomes to make future forecasts using the James-Stein Estimator and then a Bayes Estimation approach.

Using 2019 MLB data, I’ve put together some R code to walk through the methods proposed in the paper.

Getting Data

First, we need to load Bill Petti’s baseballr package, which is a handy package for scraping MLB data. We will also load the ggplot2 package for data visualizations

library(baseballr)
library(ggplot2)

The aim of this analysis will be to look at hitting performance (batting average) over the first 30 days of the 2019 MLB season, make a forecast of the player’s true batting average, based on the observations of those first 30 days, and then test that forecast on the next 30 days.

We will obtain two data sets, a first 30 days data set and a second 30 days data set.

### Get first 30 days of 2019 MLB and Second 30 days

dat_first30 <- daily_batter_bref(t1 = "2019-03-28", t2 = "2019-04-28")
dat_second30 <- daily_batter_bref(t1 = "2019-04-29", t2 = "2019-05-29")

## Explore the data frames

head(dat_first30)
head(dat_second30)

dim(dat_first30) # 595 x 29
dim(dat_second30) # 611 x 29

Evaluating The Data

Let’s now reduce the two data frame’s down to the columns we need (Name, AB, and BA).

dat_first30 <- dat_first30[, c("Name", "AB", "BA")]
dat_second30 <- dat_second30[, c("Name", "AB", "BA")]

 

Let’s look at the number of observations (AB) we see for the players in the two data sets.

 

quantile(dat_first30$AB)
quantile(dat_second30$AB)

par(mfrow = c(1,2))
hist(dat_first30$AB, col = "grey", main = "First 30 AB")
rug(dat_first30$AB, col = "red", lwd = 2)
hist(dat_second30$AB, col = "grey", main = "Second 30 AB")
rug(dat_second30$AB, col = "red", lwd = 2)

 

We see a considerable right skew in the data with a large number of players observing a small amount of at bats over the first 30 days and a few players observing a large number observations (the everyday players).

Let’s see what Batting Average looks like.

quantile(dat_first30$BA, na.rm = T)
quantile(dat_second30$BA, na.rm = T)

par(mfrow = c(1,2))
hist(dat_first30$BA, col = "grey", main = "First 30 BA")
rug(dat_first30$BA, col = "red", lwd = 2)
hist(dat_second30$BA, col = "grey", main = "Second 30 BA")
rug(dat_second30$BA, col = "red", lwd = 2)

 

We see the median batting average for MLB players over the 60 days ranges from .224, in the first 30 days, to .234, in the second 30 days. We also see some players with a batting average of 1.0, which is probably because they batted only one time and got a hit. We also see that there are a bunch of players with a batting average of 0.

This broad range of at bats and batting average values is actually going to be interesting when we attempt to forecast future performance as small sample sizes make it difficult to have faith in a player’s true ability. This is one area in which the James-Stein Estimator and Bayes Estimation approaches may be useful, as they allow for shrinkage of the observed batting averages towards the mean. In this way, the forecast isn’t too overly confident about the player who went 6 for 10 and it isn’t too under confident about the player who went 1 for 10. Additionally, for the player who never got an at bat, the forecast will suggest that the player is an average player until future data/observations can be gathered to prove otherwise.

Estimating Performance

We will use 3 approaches to forecast performance in the second 30 days:

  1. Use the batting average of the player in the first 30 days and assume that the next 30 days would be similar. This will server as our benchmark for which the other two approaches need to improve upon if we are to use them. Note, however, that this approach doesn’t help us at all for players who had 0 at bats.
  2. Use a James-Stein Estimator to forecast the second 30 days by taking the observed batting averages and applying a level of shrinkage to account for some regression to the mean.
  3. Account for regression to the mean by using some sort of prior assumption of the batting average mean and standard deviation of MLB players. This is a type of Bayes approach to handling the problem and is discussed in the last section of Effron and Moriss’s article.

Before we start making our forecasts, I’m going to take a random sample of 50 players from the first 30 days data set. This will allow us to work with a subset of the data for the sake of the example. Additionally, rather than cleaning up the data or setting an inclusion criteria based on number of at bats, by taking a random sample, I’ll get a good mix of players that had no at bats, very little at bats, an average number of at bats, and a large number of at bats. This will allow us to see how well the three forecast approaches handle a large variability in observations. (Technical Note: If you are going to follow along with the r-script below, make sure you use the same set.seed() as I do to ensure reproducibility).

## Get a sample of 50 players from the first 30

set.seed(1657)
N <- nrow(dat_first30)
samp_size <- 50
samp <- sample(x = N, size = samp_size, replace = F)

first30_samp <- dat_first30[samp, ]
head(first30_samp)

We need to locate these same players in the second 30 days data set so that we can test our forecasts.

## Find the same players in the second 30 days data set

second30_samp <- subset(dat_second30, Name %in% unique(first30_samp$Name))

nrow(second30_samp) # 37

Only 37 players from the first set of players (the initial 50) are available in the second 30 days. This could be due to a number of reasons such as injury, getting sent down to triple A, getting benched for a player that was performing better, etc. In any event, we will work with these 37 players from here on out. So, we need to go into the sample from the first 30 days and find those players. Then we merge the two sample sets together

## Subset out the 37 players in the first 30 days sample set

first30_samp <- subset(first30_samp, Name %in% second30_samp$Name)

nrow(first30_samp) # 37

## Merge the two samples together

df <- merge(first30_samp, second30_samp, by = "Name")
head(df)

The first 30 days are denoted as AB.x and BA.x while the second are AB.y and BA.y.

First 30 Day Batting Average to Forecast Second 30 Day Batting Average

Using the two columns, BA.x and BA.y, we can subtract one from the other to obtain the difference in our forecast had we simply assumed the first 30 day’s performance (BA.x) would be similar to the second (BA.y). From there, we can calculate the mean absolute error (MAE) and the root mean square error (RMSE), which we will use to compare the other two methods.

## Difference between first 30 day BA and second 30 day BA

df$Proj_Diff_Avg <- with(df, BA.x - BA.y)
mae <- mean(abs(df$Proj_Diff_Avg), na.rm = T)
rmse <- sqrt(mean(df$Proj_Diff_Avg^2, na.rm = T))


Using the James-Stein Estimator

The forumla for the James-Stein Estimator is as follows:

JS = group_mean + C(obs_BA – group_mean)

Where:

  • group_mean = the average of all players in the first 30 days
  • obs_BA = the observed batting average for an individual player in the first 30 days
  • C = a constant that represents the shrinkage factor. C is calculated as:
    • C = 1 – ((k – 3)*σ2)/(Σ(y – ŷ)2)
      • k = the number of unknown means we are trying to forecast (in this case the sample size of our first 30 days)
      • σ2 = the group variance observed during the first 30 days
      • (Σ(y – ŷ)2) = the sum of the squared differences between each player’s observed batting average and the group average during the first 30 days

First we will calculate our group mean, group SD, and squared differences from the first 30 day sample.


# Calculate grand mean and sd

group_mean <- round(mean(df$BA.x, na.rm = T), 3)
group_mean # .214

group_sd <- round(sd(df$BA.x, na.rm = T), 3)
group_sd # .114

## Calculate sum of squared differences from the group average

sq.diff <- sum((df$BA.x - group_mean)^2)
sq.diff # .467

Next we calculate our shrinkage factor, C.

## Calculate shirinkage factor

k <- nrow(df)
c <- 1 - ((k - 3) * group_sd^2) / sq.diff
c # .053

Now we are ready to make a forecast of batting average for the second 30 days using the James-Stein Estimator and calculate the MAE and RMSE.

df$JS <- group_mean + c*(df$BA.x - group_mean)

df$Proj_Diff_JS <- with(df, JS - BA.y)

mae_JS <- mean(abs(df$Proj_Diff_JS), na.rm = T)
rmse_JS <- sqrt(mean(df$Proj_Diff_JS^2, na.rm = T))


Using a Prior Assumption for MLB Batting Average

In this example, the prior batting average I’m going to use will be the mean and standard deviation from the entire first 30 day data set (the original data set, not the sample). I could, of course, use historic data to build my prior assumption but I figured I’ll just start with this approach since I have the data readily accessible.

# Get a prior for BA

prior_BA_avg <- mean(dat_first30$BA, na.rm = T)
prior_BA_sd <- sd(dat_first30$BA, na.rm = T)

prior_BA_avg # .203
prior_BA_sd # .136

For our Bayes Estimator, we will use the following approach:

BE = prior_BA_avg + prior_BA_sd(obs_BA – prior_BA_avg)

df$BE <- prior_BA_avg + prior_BA_sd*(df$BA.x - prior_BA_avg)
mae_BE <- mean(abs(df$Proj_Diff_BE), na.rm = T)
rmse_BE <- sqrt(mean(df$Proj_Diff_BE^2, na.rm = T))


Looking at Our Forecasts

Let’s put the MAE and RMSE of our 3 approaches into a data frame so we can see how they performed.

Comparisons <- c("First30_Avg", "James_Stein", "Bayes")
mae_grp <- c(mae_avg, mae_JS, mae_BE)
rmse_grp <- c(rmse_avg, rmse_JS, rmse_BE)

model_comps <- data.frame(Comparisons, MAE = mae_grp, RMSE = rmse_grp)
model_comps[order(model_comps$RMSE), ]

It looks like the lowest RMSE is the Bayes Estimator. The James-Stein Estimator is not far behind. Both approaches out performed just using the player’s first 30 day average as a naive forecast for future performance.

We can visualizes the differences in our projections for the three approaches as well.

We can see that making projections based off of first 30 day average (green) is wildly spread out and the mean value appears to over project a player’s true ability (the peak is greater than 0, the dashed red line). Conversly, the James-Stein Estimator (blue) has a pretty strong peak that is just below 0, meaning it may be under projecting players and pulling some players down too far towards the group average. Finally, the Bayes Estimator (grey) resides in the middle of the two projections with its peak just above 0.

The last thing I’ll do is put some confidence intervals around the Bayes Estimator for each player and take a look at how the sample size influences the forecast and where the observed batting average during the second 30 days was in relationship to that forecast.

df$Bayes_SE <- with(df, sqrt((bayes * (1-bayes))/AB.x))
df$Low_CI <- df$bayes - 1.96*df$Bayes_SE
df$High_CI <- df$bayes + 1.96*df$Bayes_SE

We can then plot the results. (Technical Note: To keep the plot from being too busy, I used only plot the first 15 rows of the data set).

ggplot(df[1:15, ], aes(x = reorder(Name, AB.x), y = bayes)) +
	geom_point(color = "blue") +
	geom_point(aes(y = BA.y), color = "red") +
	geom_errorbar(aes(ymin = Low_CI, ymax = High_CI), width = 0) +
	geom_text(aes(label = paste(AB.x, "ABs", sep = " ")), vjust = -1, size = 3) +
	geom_hline(aes(yintercept = 0), linetype = "dashed") +	
	coord_flip() +
	theme_classic() +
	ggtitle("Bayes Estimator", 
		subtitle = "Blue = Bayes Estimation \nRed = Observed BA during second 30 day \nLabeled ABs = Individual Sample Size the Estimation was built on (First 30 days ABs)")

 

The at bats labelled in the plot are specific to the first 30 days of data, as they represent the number of observations for each individual that the forecast was built on. We can see that when we have more observations, the forecast does better at identifying the player’s true performance based on their first 30 days. Rizzo and Igelsias were the two that really beat their forecast in the second 30 days of the 2019 season (Rizzo in particular). The guys at the bottom (Butera, Wynns, Duplantier, and Fedde) are much harder to forecast given they had such few observations in the first 30 days. In the second 30 days, Duplantier only had 1 AB while Butera and Fedde only had 2.

Wrapping Up

The aim of this post was to work through the approaches to forecasting performance used in a 1977 paper from Efron and Morris. Dealing with small samples is a problem in sport and what we saw was that if we naively just use the average performance of a player during those small number of observations we may be missing the boat on their underlying true potential due to regression to the mean. As such, things like the James-Stein Estimator or Bayes Estimator can help us obtain better estimates by using a prior assumption about the average player in the population.

There are other ways to handle this problem, of course. For example, an Empirical Bayes Approach could be used by assuming a beta distribution for our data and making our forecasts from there. Finally, alternative approaches to modeling could account for different variables that might influence a player during the first 30 days (injury, park factors, strength of opponent, etc). However, the simple approaches presented by Efron and Morris are a nice start.

References

  1. Efron, B., Morris, CN. (1977). Stein’s Paradox in Statistics. Scientific American, 236(5): 119-127.