Category Archives: Sports Analytics

If the examples we see in textbooks don’t represent the real world, what about the sports science papers we read?

I enjoyed this short article by Andrew Gelman on the blog he keeps with several of his colleagues. The main point, which I agree 100% with, is that the examples in our stats and data analysis textbooks never seem to match what we see in the real world. The examples always seem to work! The data is clean and looks perfectly manicured for the analysis. I get it! The idea is to convey the concept of how different analytical approaches work. The rub is that once you get to the real world and look at your data you end up being like, “Uh. Wait….what is this? What do I do now?!”

The blog got me thinking about something else, though. Something that really frustrates me. If the examples we see in textbooks don’t reflect the data problems we face in the real world, what about the examples we read about in applied sport science research? How much do those examples reflect what we see in the real world?

At the risk of upsetting some colleagues, I’ll go ahead and say it:

I’m not convinced that the research we read in publications completely represents the real world either!

How can that be? This is applied science! Isn’t the real word THE research?

Well, yes and no. Yes, the data was collected in the real world, with real athletes, and in a sport setting. But often, reading a paper, looking at the aim and the conclusions, and then parsing through the methods section to see how they handled the data leaves me scratching my head.

My entire day revolves around looking at data and I can tell you; the real world is very messy.

  • How things get collected
  • How things get saved
  • How things get loaded
  • How things get logged

There are potential hiccups all along the way, no matter how stringent you are in trying to keep sound data collection practices!

In research, the problem is that the data is often touched up in some manner to create an analysis sufficient for publication. Missing values need to be handled a certain way (sometimes those rows get dropped, sometimes values get imputed), class imbalance can be an issue, errant values and outliers from technology flub-ups are a very real thing, data entry issues arise, etc. These things are all problematic and, if not identified prior to analysis, can be a major issue with the findings. I do believe that most people recognize these problems and would agree with me that they are very real issues. However, it is less about knowing that there are problems but rather, figuring out what to do about them. Often the methods sections gloss over these details (I get it, word counts for journals can be a pain in the butt) and simply produce a result that, on paper at least, seems overly optimistic. As I read through the results section, without details about data processing, I frequently say to myself, “No way. There is no way this effect is as real as they report. I can’t reproduce this without knowing how they cleaned up their data to observe this effect.”

Maybe someone should post a paper about how crappy their data is in the applied setting? Maybe we should just be more transparent about the data cleaning processes we go through so that we aren’t overly bullish on our findings and more realistic about the things we can say with confidence in the applied setting?

Does anyone else feel this way? Maybe I’m wrong and I’m being pessimistic, and this isn’t as big of an issue as I believe it to be? Is the data we see in publication truly representative of the real world?

Bayesian Updating of Reference Ranges for Serial Measurements

Introduction

The collection of serial measurements on athletes across a season (or multiple seasons) is one of the more common types of data being generated in the applied sport science environment. The question that coaches and practitioners often have is, “Is this player outside of their ‘normal range’?”

The best approach for establishing a reference range of ‘normal’ values is a frequently discussed topic in sport science. One common strategy is to use z-scores and represent the reference range as 1 standard deviation above or below the mean (Figure A) or plot the raw values and set the reference range 1 standard deviation above or below the raw mean (Figure B), for practitioners who might have a difficult time understanding standardized scores. Of course, the mean and standard deviation will now be related to all prior values. As such, if the athletes go through a training phase with substantially higher values than other phases (e.g., training camp) it could skew your reference ranges. To alleviate this issue, some choose to use a rolling mean and standard deviation, to represent the normal range of values relative to more recent training sessions (Figure C).

A problem with the approaches above is that they require a number of training sessions to allow a mean and standard deviation to be determined for the individual athlete. One solution to this issue is to base our initial normal reference ranges off of prior knowledge that we have from collecting data on players in previous seasons (or prior knowledge from research papers, if we don’t have data of our own yet). This type of Bayesian updating approach has been applied in WADA’s drug testing practices1. More recently, Hecksteden et al., used this approach to evaluate the CK levels of team-sport athletes in both fatigued and non-fatigued states2.

The mathematics of the approach was presented in the paper but might look intimidating to those not used to looking at mathematical equations in this manner.

The author’s provided a nice excel sheet where you can input your own data and get the updated reference ranges. However, the sheet is a protected sheet, which doesn’t afford the opportunity of seeing how the underlying equations work and you can’t alter the sheet to make appropriate for your data (for example, the data in the sheet log transforms the raw data automatically). Thus, I’ve decided to code the analysis out, both in excel and R, to help practitioners looking to adopt this approach.

Setting Priors

To apply this type of analysis, we need to first establish some prior values for three parameters: Prior Mean (mu), Prior Standard Deviation (tau), and a Prior Repeated-Measures Standard Deviation (sigmaRM). These values represent our current knowledge of the variable we are measuring before seeing any new data. As new data is collected, we can update these priors to get an individual (posterior) estimate for the athlete. I’ll use the priors set by Hecksteden and colleagues for CK levels of Male athletes:

  • Mu = 5.527
  • Tau = 0.661
  • sigmaRM = 0.504

Once we have established our prior parameters, we are ready to update them, using the math equations above, as new data comes in.

Bayesian Updating in Excel

The excel sheet is available at my GitHub page. It looks like this:

All of the heavy lifting occurs in the two columns under the header Bayesian Updating (Log Scale). The equation in the first row (Test 1) is different than the other equations below it because requires the prior information to get going. After that first test, the updated data become the prior for the next test and this continues for all tests forward. You can download the excel sheet and see how the equations work, so I won’t go through them here. Instead, I’ll show them more clearly in the R script, below.

Bayesian Updating in R

We first need to convert the math equations provided in the paper (posted above) into R code. Rather that leaving things to mathematical notation, I’ll plug in the variables in plain English:

To be clear, here are the definitions for the variables above:

Now that we know the variables we need for each equation we can begin the process of updating or reference ranges.

First create a data set of the test observations and their log values. This will be the same data we observed in our excel sheet:

Then we set our priors (in log format):

## priors
prior_mu <- 5.527
prior_sd <- 0.661
prior_repeated_measure_sd <- 0.504

 

We will start by seeing how the updating works for the mean and standard deviation parameters after the first test. To do this, we will create a function for each parameter (mean and standard deviation) that updates the priors with the observed values based on the above equations:

 

posterior_mu <- function(prior_mu, prior_sd, prior_repeated_measure_sd, obs_value){
  
  numerator <- prior_repeated_measure_sd^2 * prior_mu + prior_sd^2 * obs_value
  denominator <- prior_repeated_measure_sd^2 + prior_sd^2
  
  post_mu <- numerator / denominator
  return(post_mu)
  
  }

posterior_sd <- function(prior_repeated_measure_sd, prior_sd, test_num){
  
  post_var <- 1 / ((test_num - 1 + 1) * 1/prior_repeated_measure_sd^2 + 1/prior_sd^2) 
  post_sd <- sqrt(post_var)
  return(post_sd)
  
}

After running the functions on the observations of our first test, our updated mean and standard deviation are:


Notice that we obtain the same values that we see following test_1 in our excel workbook. We can also calculate 95% confidence intervals and take the exponent (since the data is on a log scale) to get the individual athlete’s updated reference range on the raw scale:

Again, these results confirm the values we see in our excel workbook.

That’s cool and all, but we need to be able to iteratively update the data set as new data comes in. Let’s write a for() loop!

First, we create a new column in the data that provides us with the updated standard deviation after observing the results in each test. This is a necessary first step as we will use this value to then update the mean value.

 

## Calculate the updated SD based on sample size
df2 <- df %>%
  mutate(bayes_sd = sqrt(1 / ((test - 1 + 1) * 1 / prior_repeated_measure_sd^2 + 1 / prior_sd^2))) 

df2

Next, we need a for() loop. This is a bit tricky because test_1 is updating based solely on the priors while all other tests (test_2 to test_N) will be updating based on the mean and standard deviation in the row above them. Thus, we need to have our for() loop look back at the previous row above it once those values are calculated. This sort of thing is easy to set up in excel but in R (or Python) we need to think about how to code our row indexes. I covered this sort of iterative row computing in two previous articles HERE and HERE.

Within the loop we first set up vectors for the prior variance (sd^2), the denominator in our equation, and the log transformed observations from our data set. Then, we calculate the updated posterior for the mean (mu) on each pass through the loop, each time using the value preceding it in the vector, [i – 1], to allow us to iteratively update the data.

Once we run the loop, we add the results to our data set (removing the first observation in the vector since that was the original prior before seeing any data):

 

# Create a vector to store results
N <- length(df2$ln_value) + 1
bayes_mu <- c(prior_mu, rep(NA, N - 1))


## For loop
for(i in 2:N){
  
  ## Set up vectors for the variance, denominator, and newly observed values
  prior_var <- c(prior_sd^2, df2$bayes_sd^2)
  denominator <- prior_repeated_measure_sd^2 + prior_var
  vals <- df2$ln_value
  
  ## calculate bayesian updated mu
  bayes_mu[i] <- (prior_repeated_measure_sd^2 * bayes_mu[i-1] + prior_var[i-1] * vals[i-1]) / denominator[i-1]
    
}

df2$bayes_mean <- bayes_mu[-1]
df2

The two columns, bayes_sd and bayes_mean, contain our updated prior values and they are the exact same results we obtained in our excel workbook.

To use these updated parameters for creating individual athlete reference ranges, we calculate the 95% Confidence Intervals:

NOTE: I added a row at the start of data frame to establish the priors, before seeing the data, so that they could also be plotted as part of the reference ranges.

### Confidence Intervals
first_prior <- data.frame(test = 0, value = NA, ln_value = NA, bayes_sd = prior_sd, bayes_mean = prior_mu)

df2 <- df2 %>%
  bind_rows(first_prior) %>%
  arrange(test)

## Exponentiate back to get the reference range
df2$low95 <- exp(df2$bayes_mean - 1.96*df2$bayes_sd)
df2$high95 <- exp(df2$bayes_mean + 1.96*df2$bayes_sd)
df2

Finally, we plot the observations along with the continually updated references ranges. You can clearly see how large the normal range is before seeing any data (test_0) and then how quickly this range begins to shrink down once we start observing data from the individual.

 

To access the R code and the excel workbook please visit my GitHub page.

References

  • Sottas PE et al. (2007). Bayesian detection of abnormal values in longitudinal biomarkers with application to T/E ratio. Biostatistics; 8(2): 285-296.
  • Hecksteden et al. (2017). A new method to individualize monitoring of muscle recovery in athletes. Int J Sport Phys Perf; 12: 1137-1142.

Force Decks – Force Plate Shiny Dashboard

Last week, two of the data scientists at Vald Performance, Josh Ruddy and Nick Murray, put out a free online tutorial on how to create a force plate reports using R with data from their Force Decks software.

It was a nice tutorial to give an overview of some of the power behind ggplot2 and the suite of packages that come with tidyverse. Since they made the data available (in the link above), I decided to pull it down and put together a quick shiny app for those that might be interested in extending the report to an interactive web app.

This isn’t the first time I’ve build a shiny app for the blog using force plate data. Interested readers might want to check out my post from a year ago where I built a shiny interactive report for force-velocity profiling.

You can watch a short preview of the end product in the below video link and the screen shots below the link show a static view of what the final shiny App will look like.

A few key features:

  1. App always defaults to the most recent testing day on the testDay tab.
  2. The user can select the position group at the top and that position group will be maintained across all tabs. For example, if you select Forwards, when you switch between tabs one and two, forwards will always be there.
  3. The time series plots on the Player Time Series tab are done using plotly, so they are interactive, allowing the user to hover over each test session and see the change from week-to-week in the tool tip. When the change exceeds the meaningful change, the point turns red. Finally, because it is plotly, the user can slice out specific dates that they want to look at (as you can see me do in the video example), which comes in handy when there are a large number of tests over time.

All code and data s accessible through my GitHub page.

vald_shiny_app

Loading and preparing the data

  • I load the data in using read.csv() and file.choose(), so navigate to wherever you have the data on your computer and select it.
  • There is some light cleaning to change the date in to a date variable. Additionally, there were no player positions in the original data set, so I just made some up and joined those in.

### packages ------------------------------------------------------------------
library(tidyverse)
library(lubridate)
library(psych)
library(shiny)
library(plotly)

theme_set(theme_light())

### load & clean data ---------------------------------------------------------
cmj <- read.csv(file.choose(), header = TRUE) %>%
  janitor::clean_names() %>%
  mutate(date = dmy(date))

player_positions <- data.frame(name = unique(cmj$name),
                               position = c(rep("Forwards", times = 15),
                                            rep("Mids", times = 15),
                                            rep("Backs", times = 15)))

# join position data with jump data
cmj <- cmj %>%
  inner_join(player_positions)

 

Determining Typical Error and Meaningful Change

  • In this example, I’ll just pretend as if the first 2 sessions represented our test-retest data and I’ll work from there.
  • Typical Error Measurement (TEM) was calculated as the standard deviation of differences between test 1 and 2 divided by the square root of 2.
  • For the meaningful change, instead of using 0.2 (the commonly used smallest worthwhile change multiplier) I decided to use a moderate change (0.6), since 0.2 is such a small fraction of the between subject SD.
  • For info on these two values, I covered them in a blog post last week using Python and a paper Anthony Turner and colleagues wrote.

change_standards <- cmj %>%
  group_by(name) %>%
  mutate(test_id = row_number()) %>%
  filter(test_id < 3) %>%
  select(name, test_id, rel_con_peak_power) %>%
  pivot_wider(names_from = test_id,
              names_prefix = "test_",
              values_from = rel_con_peak_power) %>%
  mutate(diff = test_2 - test_1) %>%
  ungroup() %>%
  summarize(TEM = sd(diff) / sqrt(2),
            moderate_change = 0.6 * sd(c(test_1, test_2)))

Building the Shiny App

  • In the user interface, I first create my sidebar panel, allowing the user to select the position group of interest. You’ll notice that this sidebar panel is not within the tab panels, which is why it stands alone and allows us to select a position group that will be retained across all tabs.
  • Next, I set up 2 tabs. Notice that in the first tab (testDay) I include a select input, to allow the user to select the date of interest. In the selected argument I tell shiny to always select the max(cmj$date) so that the most recent session is always shown to the user.
  • The server is pretty straight forward. I commented out where each tab data is built. Basically, it is just taking the user specified information and performing simple data filtering and then ggplot2 charts to provide us with the relevant information.
  • On the testDay plot, we use the meaningful change to shade the region around 0 in grey and we use the TEM around the athlete’s observed performance on a given day to specify the amount of error that we might expect for the test.
  • One the Player Time Series plot we have the athlete’s average line and ±1 SD lines to accompany their data, with points changing color when the week-to-week change exceeds out meaningful change.
### Shiny App -----------------------------------------------------------------------------

## Set up user interface

ui <- fluidPage(
  
  ## set title of the app
  titlePanel("Team CMJ Analysis"),
  
  ## create a selection bar for position group that works across all tabs
  sidebarPanel(
    selectInput(inputId = "position",
                label = "Select Position Group:",
                choices = unique(cmj$position),
                selected = "Backs",
                multiple = FALSE),
    width = 2
  ),
  
  ## set up 2 tabs: One for team daily analysis and one for player time series
  tabsetPanel(
    
    tabPanel(title = "testDay",
             
             selectInput(inputId = "date",
                         label = "Select Date:",
                         choices = unique(cmj$date)[-1],
                         selected = max(cmj$date),
                         multiple = FALSE),
             
             mainPanel(plotOutput(outputId = "day_plt", width = "100%", height = "650px"),
                       width = 12)),
    
    tabPanel(title = "Player Time Series",
             
             mainPanel(plotlyOutput(outputId = "player_plt", width = "100%", height = "700px"),
                       width = 12))
  )
  
)


server <- function(input, output){
  
  ##### Day plot tab ####
  ## day plot data
  day_dat <- reactive({
    
    d <- cmj %>%
      group_by(name) %>%
      mutate(change_power = rel_con_peak_power - lag(rel_con_peak_power)) %>%
      filter(date == input$date,
             position == input$position)
    
    d
    
  })
  
  ## day plot
  output$day_plt <- renderPlot({ day_dat() %>%
      ggplot(aes(x = reorder(name, change_power), y = change_power)) +
      geom_rect(aes(ymin = -change_standards$moderate_change, ymax = change_standards$moderate_change),
                xmin = 0,
                xmax = Inf,
                fill = "light grey",
                alpha = 0.6) +
      geom_hline(yintercept = 0) +
      geom_point(size = 4) +
      geom_errorbar(aes(ymin = change_power - change_standards$TEM, ymax = change_power + change_standards$TEM),
                    width = 0.2,
                    size = 1.2) +
      theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1),
            axis.text = element_text(size = 16, face = "bold"),
            axis.title = element_text(size = 18, face = "bold"),
            plot.title = element_text(size = 22)) +
      labs(x = NULL,
           y = "Weekly Change",
           title = "Week-to-Week Change in Realtive Concentric Peak Power")
    
  })
  
  ##### Player plot tab ####
  ## player plot data
  
  player_dat <- reactive({
    
    d <- cmj %>%
      group_by(name) %>%
      mutate(avg = mean(rel_con_peak_power),
             sd = sd(rel_con_peak_power),
             change = rel_con_peak_power - lag(rel_con_peak_power),
             change_flag = ifelse(change >= change_standards$moderate_change | change <= -change_standards$moderate_change, "Flag", "No Flag")) %>%
      filter(position == input$position)
    
    d
  })
  
  ## player plot
  output$player_plt <- renderPlotly({
    
    plt <- player_dat() %>%
      ggplot(aes(x = date, y = rel_con_peak_power, label = change)) +
      geom_rect(aes(ymin = avg - sd, ymax = avg + sd),
                xmin = 0,
                xmax = Inf,
                fill = "light grey",
                alpha = 0.6) +
      geom_hline(aes(yintercept = avg - sd),
                 color = "black",
                 linetype = "dashed",
                 size = 1.2) +
      geom_hline(aes(yintercept = avg + sd),
                 color = "black",
                 linetype = "dashed",
                 size = 1.2) +
      geom_hline(aes(yintercept = avg), size = 1) +
      geom_line(size = 1) +
      geom_point(shape = 21,
                 size = 3,
                 aes(fill = change_flag)) +
      facet_wrap(~name) +
      scale_fill_manual(values = c("red", "black", "black")) +
      theme(axis.text = element_text(size = 13, face = "bold"),
            axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
            plot.title = element_text(size = 18),
            strip.background = element_rect(fill = "black"),
            strip.text = element_text(size = 13, face = "bold"),
            legend.position = "none") +
      labs(x = NULL,
           y = NULL,
           title = "Relative Concentric Peak Power")
    
    ggplotly(plt)
    
  })
  
  
}



shinyApp(ui, server)

TidyTuesday — Powerlifting Performance & Age

TidyTuesday is a really neat project where every week a new data set is provided (for free) and anyone can download the data and share their findings. The basic idea was to get people to trade ideas on how to arrange, summarize, and visualize data within R (primarily using the suite of data science packages that make up the tidyverse).

I’ve enjoyed seeing what people share on Twitter and my friend Ellis Hughes suggested that I join in the fun. As such, I found a data set from an earlier week that was sports related (to keep the analysis relevant with the theme of my blog).

The data set comes from the TidyTuesday on 10/8/2019 (free to download HERE). Briefly, the data set contains outcomes from International Powerlifting Federation (IPF) Competitions from 1973 up through 2019. Each row represents an individual athlete’s best lift in the squat, bench press, and deadlift, for a given competition. In total, the data set contains 38,244 rows and 15996 unique lifters. (NOTE: There is a much larger data set that is linked to on the GitHub page, but I did not use that one).

I’ll use the Data Analysis Template I discussed in a previous blog article. The only difference between the template from the prior article and the approach I’ll take here is that I have no prior knowledge of the data set. The template works well when we have a specific question to answer as it helps to guide the process from data collection to analysis. However, in this case, as is sometimes common in the real world, people may provide you with a data set without a specific question. As such, some level of data exploration is required to understand the data set and what type of questions may be interesting. Therefore, I’ll begin with just familiarizing myself with the data before developing a question I may want to answer.

Loading Data & Cleaning Data

  • Read in the data from the TidyTuesday GitHub page.
  • Notice that I added a cleaning step when reading in the data. I filter out any age class of 5-12 and I also remove any NA values in the age column (which happened because sometimes exact age wasn’t recorded). I added this step when importing the data after I worked through my analysis because I felt like it was better to do this right away and  space in the code.
  • In the second step, I ordered the data set by athlete name and date of competition.
  • Finally, I created a long format of the data frame (since it is originally in a Wide format) to assist with building data visualizations and I remove any NA’s that were present in the data set (e.g., if a lifter bombs out on their squat in a competition then they have no value for the squat).


df <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-08/ipf_lifts.csv") %>%
filter(age_class != "5-12", !is.na(age))

# order the data by lifter and date

df <- df %>%
arrange(name, date)

# create a long format of the data 

df_long <- df %>%
reshape2::melt(., id = c("name", "date", "age", "age_class", "weight_class_kg", "sex"), measure.vars = c("best3squat_kg", "best3bench_kg", "best3deadlift_kg")) %>%
na.omit(df_long)

Data Exploration

Since I don’t really know anything about the data set provided, it is hard to have a question to answer. Thus, I create some basic plots to help orient myself to to the data we are working with.

First, I wanted to see the athletes who have competed in the most competitions in this data set:

I know that lifters in the IPF have a choice of wearing different types of lifting equipment so I wanted to see what sort of competition gear the athletes in this data set wore:

I was curious about the age class and actual age of when athletes, on average, achieve their best lift:

We can also look at this by male and female:

Finally, I want to explore the distribution of power lifting totals between men and women:

Research Question

After exploring the data a little bit, some of the things that stand out:

1) The data set contains primarily lifters wearing single-ply lifting gear.

2) The boxplots use the ‘age_class’ variable, so everyone within an ‘age_class’ is treated the same and the age bins appear to be rather large (e.g., 24 – 34). I prefer not to think of age data this way since such large groupings can have a lot of variability within them.

3) Looking at the dot plots, which reflect age as a continuous variable, athletes tend to peak in all three of the lifts around their early 30’s.

4) The trend for peaking in performance seems to be consistent among men and women (which is interesting given that I would have suspected women to peak later given that they might be less inclined to take up serious weight training until later in life, whereas male’s tend to start lifting around their high school years).

5) The distribution of powerlifting totals appears to be relatively normally distributed for both men and women, with more variability in the distribution for men than women.

The beauty of graphing your data is that it often reveals underlying patterns that help you get a sense for what is going on. It is instances like this where a statistical model can serve as a gut check to confirm what you can already clearly see.

In looking at the data, the two questions I’ll explore are:

1) At what age do powerlifters peak for the 3 competition lifts?

2) How many competitions do lifters perform until they finally total elite?

I’ll keep these rather simple and brief, as a means of sharing some ideas. These models can (and should) be more thorough and account for things like sex (in the aging curve model, for example) and other variables that may be relevant to how powerlifters progress across  career. What is presented below is just a simple jumping off point of where I might begin when working with data like this to answer a question before extending the model (for example, creating a mixed model to account for individual lifters).

Models

Powerlifter Aging Curve

To develop a simple aging curve model I built a polynomial regression for each of the 3 lifts (again, to keep things simple, I did not include sex in these models). Before building the models, we noticed from our data exploration was that most of the lifters in this data set are single-ply lifters. So I’m going to limit the analysis to them since changing competition gear can influence performance (I’m not going to get into the philosophical debate about which one is “better” than the other — I’ll leave that to the lifters). Additionally, since I’m interested in how lifters perform across their career and when they tend to “peak”, I’m going to limit my analysis to only those lifters who have competed in at least 10 competitions. After cleaning up the data specific to the above inclusion criteria we are left with 6169 rows of data and 426 unique athletes.


# Data clean up for aging curve model
sply <- df %>%
filter(equipment == "Single-ply") %>%
group_by(name) %>%
filter(n() >= 10)

nrow(sply)
nrow(distinct(sply, name))

# 6169
# 426 athletes

 

Now that the data is in the format we’d like, we can build some simple models for each of the three lifts:


squat_age_fit <- lm(best3squat_kg ~ age + I(age^2), data = sply)
bench_age_fit <- lm(best3bench_kg ~ age + I(age^2), data = sply)
deadlift_age_fit <- lm(best3deadlift_kg ~ age + I(age^2), data = sply)

The summary of the three models can be found on my GitHub page. Here is an example of the squat model output:

We see that the coefficient for age is positive while the polynomial of age is negative. This shouldn’t come as a surprise given that we observed an upside down “U” in our plots during the data exploration phase of our analysis. We can use these two coefficients to calculate the peak age from our regression equation. I’ve written a function to do that:


peak_age <- function(coef1, coef2){
x = -(coef1) / (2 * (coef2))
}

 

By supplying the custom function with the two coefficient (age and age^2) we can obtain the peak age from each of our models:


Just as suggested in our data visualizations, the peak age is around the early to mid 30’s with the squat peaking earlier and the bench press peaking later. As an example, we can plot the actual data along with a prediction line and 95% Confidence Interval for the bench press, where the peak age is around 36 years old:

Number of Competitions Until Totaling Elite

To try and answer this question I built a simple time-to-event (survival) model. In this case, the event of interest is the individual achieving an elite total, coded as a 1, and any competition where they do not achieve an elite total coded as a 0. I’m only calculating time to first elite total for each lifter, so there are some lifters that achieve elite and others that do not.

I wasn’t sure of where to obtain the elite total criteria so I found a criteria to use on THIS WEBSITE. However, I’m not certain if these criteria will carry over to single-ply lifters (IE, perhaps these criteria are only specific to raw lifters?). I also wasn’t able to locate an elite total criteria for female lifters, so the below analysis is only specific to male lifters. Finally, not all of the weight classes observed in the data were available on the referenced website. So, this analysis is far from perfect given the data but it will suffice for a simple example.

After adding in the elite total criteria and removing the athletes who were not in a weight class that was specific to the elite total criteria presented in the website, I was left with 6074 male lifters of which, 22% of them (1335) achieved an elite total during their career:

In looking at the number of competitions until a lifter totals elite (plot below), it appears that many of them are achieving that status in their first competition. This makes me skeptical of the data as I feel like most lifters would require a number of competitions to achieve an elite total. This may be a function of either (a) the subset of data that has been provided by TidyTuesday or (b) I’m using the wrong elite total criteria for single-ply lifters.

The data was fit with a Kaplan-Meier curve in order to create a simple model and nice visual of the data. Below is the summary table produced from the model followed by the time-to-event curve (event being elite total).

 

Conclusions

The TidyTuesday project is a great way to get access to data sets and share ideas. This was a fun one to do given it is specific to sport and I had the opportunity to try a few different models while also showing different ways of graphing the data. Finally, there is a bunch of different coding approaches I used to clean up the data, which you can check out on my GitHub page.

Data Analysis Template in R Markdown & Jupyter Notebook

The nice thing about working on a team with other analysts, working as part of a research group, or working on your PhD is the ability to share analysis with other colleagues, get feedback, and learn new ways of thinking about things.

Interestingly, when I’ve inquired to colleagues at some teams about how they share their analysis with their group they often say that, “people do their analysis and just present the results”. I think this is a big miss in terms of being able to have transparency in the research process, sharing so that others can learn or help to provide constructive feedback, and walking through the steps you went through (data retrieval,  data cleaning, analysis, model testing, etc) to ensure that things make sense to the group before being shared with the end user.

For the PhD student, a more streamlined approach to the entire analysis can help them talk through what they did with their advisors, ensure that all the correct steps were taken during the analysis, and have greater confidence about what their data is and is not saying (which can really come in handy when it is time to defend the thesis!). When I was doing my PhD I would often try and put all my steps into a power point presentation to walk through with my supervisors. I never liked that, however, because it always felt clumsy and I was never really getting to the guts of the analysis as much as I was just sharing the outcomes of what I did and why I did it and talking through how I did it. A template that allows for a clear presentation would have made things much easier for both myself and my supervisors

In my last post, I used R Markdown to create a report that allows the sport scientist to share some basic data analysis with the strength and conditioning staff and other coaches. As I said in that post, R Markdown is a wonderful resource for creating reports where you can hide your code and simply show visualizations of the data and model outputs. But, what if we don’t want to hide our code?! In this sense, R Markdown is extremely useful for setting up a data analysis template to allow you to walk through all the steps in your project and share the results with your colleagues or PhD supervisors. Additionally, you could also keep R Studio open when presenting your findings and address any changes/suggestions that people may have, in real time before, “knitting” the markdown file into the final html or pdf document. This last part allows the analysis to come to life and allows you to make direct changes and immediately show how they impact the outcome of the analysis!

Data Analysis Templates

There are a number of different data analysis frameworks one could follow. Two that immediately come to mind are the Cross Industry Standard Process for Data Mining (CRISP-DM) and the Problem, Plan, Data, Analysis, and Conclusion (PPDAC) Cycle.

Although they come from different industries — CRSIP-DM from the business world  and PPDAC from more of the statistics education world — there is considerable overlap and both have the aim of providing the analyst with a clear path to answering their research question.

The objectives of each phase within these two frameworks is shown below.

 

As you can see, the end goal of the analysis is different between the two frameworks: CRISP-DM being targeted at deploying a model specific to business use cases and PPDAC providing more of a runway for scientific publication. However, both can provide us with an appreciation for creating a systematic process around data analysis, allowing for a clear explanation of our approach when discussing with colleagues or PhD supervisors.

In an attempt to create something more generic and less specific to a certain industry or field, I came up with my own framework:

The framework is freely available on my GitHub page in both an R Markdown and Jupyter Notebook (if you prefer Python) formats. If you’d like to see with the R Markdown HTML looks like, click here >> PWard_-_Data_Analysis_Framework.

All you have to do is take the template (either R Markdown or Jupyter Notebook), delete the comments that I have under each section and fill in your own comments and your R or Python script, where applicable, to conduct your analysis. Once complete, you will have a clean looking file that details your entire approach.

I’ve made a simple example of what using the template could look like. If you are interested in seeing the result in R Markdown, CLICK HERE >> Data_Analysis_Framework_Example_–_MLB_Hitting. If you are interested in seeing the result in a Python Jupyter Notebook, CLICK HERE >> Data Analysis Framework Example — MLB Hitting (Jupyter).

All of the code and the templates for use are available on my GitHub page.