Creating A Data Dictionary Function in R

In my previous post, I did a bit of impromptu analysis on some Powerlifting data provided from the TidyTusday project.

When sitting down to work with a new data set it is important to familiarize yourself with the variables in each column, get a grasp for what sort of values you may be dealing with, and quickly identify any potential issues with the data that may require your attention.

For looking at the type of variables you are dealing with the functions str() in base R or glimpse() in tidyverse can be useful. If it’s summary statistics you’re after, the psych package’s describe() function will do the trick. The summary() function in base R can also be useful for getting min, max, mean, median, IQR, and the number of missing values (NA) in each column.

The issue with this is that you have to go through a few steps to get the info you want — variable types, number of missing values, and summary statistics. Thus, I decided to create my own data dictionary function. After passing your data frame to the function, you will get the name of each variable, the variable type, the number of missing values for each variable, the total amount of data (rows) for each value, and a host of summary statistics such as mean, standard deviation, median, standard error, min, max, and range. While the function defaults to printing the results in your R console you can choose to set the argument print_table = “Yes” and the results will be returned in a nice table that you can use for reports or presentations to colleagues.

Let’s take a look at function in action.

First, we will create some fake data:


Names <- c("Sal", "John", "Jeff", "Karl", "Ben")
HomeTown <- c("CLE", "NYC", "CHI", "DEN", "SEA")
var1 <- rnorm(n = length(Names), mean = 10, sd = 2)
var2 <- rnorm(n = length(Names), mean = 300, sd = 150)
var3 <- rnorm(n = length(Names), mean = 1000, sd = 350)
var4 <- c(6, 7, NA, 3, NA)

df <- data.frame(Names, HomeTown, var1, var2, var3, var4)
df

 

 

We can see from the output that the code includes a few NA values in the var4 column. Additionally, the first two columns are not numeric values. We can run the data_dict() function I’ve created to get a read out of the data we are looking at.

First, let’s look at the output in the R console:


# without table
data_dict(df, print_table = "No")

We are immediately returned an output that consolidates some key information for helping us quickly evaluate our data set.

By setting the argument print_table = “Yes” we will get our result in a nice table format.


# with table
data_dict(df, print_table = "Yes")

Let’s look at the results in table format for a much larger data set — the Lahman Baseball Batting data set.

As you can see, it is a pretty handy function. Very quickly we can identify:

1) The types of variables in our data
2) The amount of data in each column
3) The number of missing values in each column
4) A variety of summary statistics

If you’re interested in using the function, you can obtain it on my GitHub page.

 

TidyTuesday — Powerlifting Performance & Age

TidyTuesday is a really neat project where every week a new data set is provided (for free) and anyone can download the data and share their findings. The basic idea was to get people to trade ideas on how to arrange, summarize, and visualize data within R (primarily using the suite of data science packages that make up the tidyverse).

I’ve enjoyed seeing what people share on Twitter and my friend Ellis Hughes suggested that I join in the fun. As such, I found a data set from an earlier week that was sports related (to keep the analysis relevant with the theme of my blog).

The data set comes from the TidyTuesday on 10/8/2019 (free to download HERE). Briefly, the data set contains outcomes from International Powerlifting Federation (IPF) Competitions from 1973 up through 2019. Each row represents an individual athlete’s best lift in the squat, bench press, and deadlift, for a given competition. In total, the data set contains 38,244 rows and 15996 unique lifters. (NOTE: There is a much larger data set that is linked to on the GitHub page, but I did not use that one).

I’ll use the Data Analysis Template I discussed in a previous blog article. The only difference between the template from the prior article and the approach I’ll take here is that I have no prior knowledge of the data set. The template works well when we have a specific question to answer as it helps to guide the process from data collection to analysis. However, in this case, as is sometimes common in the real world, people may provide you with a data set without a specific question. As such, some level of data exploration is required to understand the data set and what type of questions may be interesting. Therefore, I’ll begin with just familiarizing myself with the data before developing a question I may want to answer.

Loading Data & Cleaning Data

  • Read in the data from the TidyTuesday GitHub page.
  • Notice that I added a cleaning step when reading in the data. I filter out any age class of 5-12 and I also remove any NA values in the age column (which happened because sometimes exact age wasn’t recorded). I added this step when importing the data after I worked through my analysis because I felt like it was better to do this right away and  space in the code.
  • In the second step, I ordered the data set by athlete name and date of competition.
  • Finally, I created a long format of the data frame (since it is originally in a Wide format) to assist with building data visualizations and I remove any NA’s that were present in the data set (e.g., if a lifter bombs out on their squat in a competition then they have no value for the squat).


df <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-08/ipf_lifts.csv") %>%
filter(age_class != "5-12", !is.na(age))

# order the data by lifter and date

df <- df %>%
arrange(name, date)

# create a long format of the data 

df_long <- df %>%
reshape2::melt(., id = c("name", "date", "age", "age_class", "weight_class_kg", "sex"), measure.vars = c("best3squat_kg", "best3bench_kg", "best3deadlift_kg")) %>%
na.omit(df_long)

Data Exploration

Since I don’t really know anything about the data set provided, it is hard to have a question to answer. Thus, I create some basic plots to help orient myself to to the data we are working with.

First, I wanted to see the athletes who have competed in the most competitions in this data set:

I know that lifters in the IPF have a choice of wearing different types of lifting equipment so I wanted to see what sort of competition gear the athletes in this data set wore:

I was curious about the age class and actual age of when athletes, on average, achieve their best lift:

We can also look at this by male and female:

Finally, I want to explore the distribution of power lifting totals between men and women:

Research Question

After exploring the data a little bit, some of the things that stand out:

1) The data set contains primarily lifters wearing single-ply lifting gear.

2) The boxplots use the ‘age_class’ variable, so everyone within an ‘age_class’ is treated the same and the age bins appear to be rather large (e.g., 24 – 34). I prefer not to think of age data this way since such large groupings can have a lot of variability within them.

3) Looking at the dot plots, which reflect age as a continuous variable, athletes tend to peak in all three of the lifts around their early 30’s.

4) The trend for peaking in performance seems to be consistent among men and women (which is interesting given that I would have suspected women to peak later given that they might be less inclined to take up serious weight training until later in life, whereas male’s tend to start lifting around their high school years).

5) The distribution of powerlifting totals appears to be relatively normally distributed for both men and women, with more variability in the distribution for men than women.

The beauty of graphing your data is that it often reveals underlying patterns that help you get a sense for what is going on. It is instances like this where a statistical model can serve as a gut check to confirm what you can already clearly see.

In looking at the data, the two questions I’ll explore are:

1) At what age do powerlifters peak for the 3 competition lifts?

2) How many competitions do lifters perform until they finally total elite?

I’ll keep these rather simple and brief, as a means of sharing some ideas. These models can (and should) be more thorough and account for things like sex (in the aging curve model, for example) and other variables that may be relevant to how powerlifters progress across  career. What is presented below is just a simple jumping off point of where I might begin when working with data like this to answer a question before extending the model (for example, creating a mixed model to account for individual lifters).

Models

Powerlifter Aging Curve

To develop a simple aging curve model I built a polynomial regression for each of the 3 lifts (again, to keep things simple, I did not include sex in these models). Before building the models, we noticed from our data exploration was that most of the lifters in this data set are single-ply lifters. So I’m going to limit the analysis to them since changing competition gear can influence performance (I’m not going to get into the philosophical debate about which one is “better” than the other — I’ll leave that to the lifters). Additionally, since I’m interested in how lifters perform across their career and when they tend to “peak”, I’m going to limit my analysis to only those lifters who have competed in at least 10 competitions. After cleaning up the data specific to the above inclusion criteria we are left with 6169 rows of data and 426 unique athletes.


# Data clean up for aging curve model
sply <- df %>%
filter(equipment == "Single-ply") %>%
group_by(name) %>%
filter(n() >= 10)

nrow(sply)
nrow(distinct(sply, name))

# 6169
# 426 athletes

 

Now that the data is in the format we’d like, we can build some simple models for each of the three lifts:


squat_age_fit <- lm(best3squat_kg ~ age + I(age^2), data = sply)
bench_age_fit <- lm(best3bench_kg ~ age + I(age^2), data = sply)
deadlift_age_fit <- lm(best3deadlift_kg ~ age + I(age^2), data = sply)

The summary of the three models can be found on my GitHub page. Here is an example of the squat model output:

We see that the coefficient for age is positive while the polynomial of age is negative. This shouldn’t come as a surprise given that we observed an upside down “U” in our plots during the data exploration phase of our analysis. We can use these two coefficients to calculate the peak age from our regression equation. I’ve written a function to do that:


peak_age <- function(coef1, coef2){
x = -(coef1) / (2 * (coef2))
}

 

By supplying the custom function with the two coefficient (age and age^2) we can obtain the peak age from each of our models:


Just as suggested in our data visualizations, the peak age is around the early to mid 30’s with the squat peaking earlier and the bench press peaking later. As an example, we can plot the actual data along with a prediction line and 95% Confidence Interval for the bench press, where the peak age is around 36 years old:

Number of Competitions Until Totaling Elite

To try and answer this question I built a simple time-to-event (survival) model. In this case, the event of interest is the individual achieving an elite total, coded as a 1, and any competition where they do not achieve an elite total coded as a 0. I’m only calculating time to first elite total for each lifter, so there are some lifters that achieve elite and others that do not.

I wasn’t sure of where to obtain the elite total criteria so I found a criteria to use on THIS WEBSITE. However, I’m not certain if these criteria will carry over to single-ply lifters (IE, perhaps these criteria are only specific to raw lifters?). I also wasn’t able to locate an elite total criteria for female lifters, so the below analysis is only specific to male lifters. Finally, not all of the weight classes observed in the data were available on the referenced website. So, this analysis is far from perfect given the data but it will suffice for a simple example.

After adding in the elite total criteria and removing the athletes who were not in a weight class that was specific to the elite total criteria presented in the website, I was left with 6074 male lifters of which, 22% of them (1335) achieved an elite total during their career:

In looking at the number of competitions until a lifter totals elite (plot below), it appears that many of them are achieving that status in their first competition. This makes me skeptical of the data as I feel like most lifters would require a number of competitions to achieve an elite total. This may be a function of either (a) the subset of data that has been provided by TidyTuesday or (b) I’m using the wrong elite total criteria for single-ply lifters.

The data was fit with a Kaplan-Meier curve in order to create a simple model and nice visual of the data. Below is the summary table produced from the model followed by the time-to-event curve (event being elite total).

 

Conclusions

The TidyTuesday project is a great way to get access to data sets and share ideas. This was a fun one to do given it is specific to sport and I had the opportunity to try a few different models while also showing different ways of graphing the data. Finally, there is a bunch of different coding approaches I used to clean up the data, which you can check out on my GitHub page.

Python Pivot Tables

Feedback from the last two blog entries suggests that people seem to enjoy the “how to” nature of them. Walking through how to do some basic coding steps while exploring, summarizing, visualizing, and analyzing data is a great way to help others feel comfortable when trying to move from excel or SPSS to a code language (like R or Python). Additionally, this also helping me to get back to first principles and put together a step-by-step approach to teaching this stuff to others.

As such, I’ve decided to start incorporating some articles where I walk through each line of code. Since I’ve started to do a bit more work in Python (though I still strongly favor R), I feel like this is a great place for me to start.

For those of us who started out in sport and exercise science, using Excel Pivot Tables was one of the first way we began to explore our data and summarize what was there. Today, I’ll walk through how we can create some of those exact same pivot tables in Python.

The Jupyter Notebook and data for this article are available on my GitHub page.

The data I used for this article was taken from www.hockey-reference.com and consists of game outcomes from the 2014-2015 Season through the 2018-2019 Season. I stored the data as a .csv file for this article but I’ll do a future blog article on how to scrape data from webpages like this. This is additionally fun because I know very little about hockey, so I get to learn along the way!

Step 1: Load Packages

  • First I start out just loading the packages (Python people call them “libraries” but I’m still stuck in the old R jargon, so I say “packages”) that contain some of the functions that I’ll need to do my analysis.
  • Pandas is the main data frame package in Python and numpy is useful for dealing with arrays and various mathematical functions.
  • Seaborn and Matplotlib are two packages that I mainly use for data visualizing, however I do not do anything visualization in this article.
  • Finally, the OS package is what I use to check my working directory and change the working directory, if needed (more on that later).
  • Once importing each package I give them an alias (e.g., pandas as pd) so that I can use a short hand version of the package name when I call one of it’s functions, which you will see throughout this article.

# Load packages

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os as os

Step 2: Change the working directory if needed and load data

  • The working directory is the place where the data for your analysis is stored.
  • os.getcwd() is a function from the “os” package that gets my current working directory. If the directory needs to be changed to a different folder (for example, if the data were stored somewhere other than my desktop), I use the os.chdir() function to change it.
  • Once I have the directory set, I load the data using the pd.read_csv() function from the pandas package.
  • After the data is loaded, we take a look at the first few rows of the data frame.
  • An example of all of these steps is provided in the screen shot of my Jupyter notebook.

# Get current working directory
os.getcwd()

# Change the working directory
os.chdir('/Users/Patrick/Desktop')

# Load the 2014 - 2018 NHL Scores data
nhl = pd.read_csv('2014-2018 NHL Scores.csv')

# Look at the first few rows
nhl.head()

Step 3: Clean up the data (All little housekeeping)

  • We notice that the Goals (‘G’) column is repeated, once for the Visitor and once for the Home team. Thus, the second time it is referred to as ‘G.1’. So, we can rename these columns to ‘Visitor_G’ and ‘Home_G’, so that they are more explicit and less confusing.
  • Next, we want to create four new columns. First, we create a column that calculates the point differential of the game for the home team. Then we create a column that identifies whether the winner was the home or away team. Finally, we create a column that tells us which team was the winner and a column telling us which team was the loser. You can see that I do all of these steps by first creating the conditions I’m testing (comparing goals for the home team to goals for the away team) and then assigning the outcome a specific label (e.g. home, away, or tie).

nhl['Point_Diff'] = nhl['Home_G'] - nhl['Visitor_G']

conditions = [
(nhl['Home_G'] > nhl['Visitor_G']),
(nhl['Home_G'] < nhl['Visitor_G']),
(nhl['Home_G'] == nhl['Visitor_G'])]

choices = ['home', 'visitor', 'tie']

nhl['winner'] = np.select(conditions, choices, default = 'null')

cond_for_winner = [
(nhl['winner'] == 'home'),
(nhl['winner'] == 'visitor'),
(nhl['winner'] == 'tie')]

choices_for_winner = [nhl['Home'], nhl['Visitor'], 'tie']
choices_for_loser = [nhl['Visitor'], nhl['Home'], 'tie']

nhl['winning_team'] = np.select(cond_for_winner, choices_for_winner, default = 'null')
nhl['losing_team'] = np.select(cond_for_winner, choices_for_loser, default = 'null')

  • Our data frame now looks like this:

Step 4: Create Pivot Tables

Now that we have the data arranged, we can create some pivot tables.

First, let’s create a pivot table looking at the average home point differential by season.


nhl.groupby('Season').mean()['Point_Diff']

 

I produced this, as you can see from the code, by looking for the mean of ‘Point_Diff’, grouped by the variable ‘Season’.

This is useful, but I don’t love how it looks. Using the pivot_table() function, we can get a nicer looking output.


nhl.pivot_table(values = 'Point_Diff', index = 'Season', aggfunc = np.mean)

 

Much nicer!!

Next, lets look at how many times the home team won.


nhl.groupby(['winner']).size()

If we divide by the total number of observations, we get the percentage of times the home team won.


nhl.groupby(['winner']).size() / len(nhl.index)

 

We can also look at home and visitor win percentage by each season within our data, again by using the groupby() function. We then create two columns, one for home and one for visitor win percentage with just some basic math.


season_totals = nhl.groupby(['Season', 'winner']).size().unstack()
season_totals['Home_Win_Pct'] = season_totals['home'] / (season_totals['home'] + season_totals['visitor'])
season_totals['Visitor_Win_Pct'] = season_totals['visitor'] / (season_totals['home'] + season_totals['visitor'])
season_totals

Finally, we can look at the Win% for each team over this 5 year stretch. Three steps were required to do this:

  1. The first chunk of code gets the total number of times each team played a game.
  2. The second chunk of code tally’s up the teams total wins and losses and stores them in a data frame (team_perf).
  3. The final chunk of code calculates a win percentage an then sorts the win percentage from best to worse.

team_win_totals = nhl.groupby(['winning_team']).size()
team_loss_totals = nhl.groupby(['losing_team']).size()

team_win_totals = pd.DataFrame(team_win_totals, columns = ['wins'])
team_loss_totals = pd.DataFrame(team_loss_totals, columns = ['losses'])
team_perf = team_win_totals.join(team_loss_totals)

team_perf['Win_Pct'] = team_perf['wins'] / (team_perf['wins'] + team_perf['losses'])
team_perf.sort_values(by = ['Win_Pct'], ascending = False)

Conclusion

There are a number of other things we could have built Pivot Tables on to explore this simple data set. Feel free to take the code and the data from my GitHub page and play around with it yourself. I don’t code too often in Python (still trying to figure out a lot of the syntax) so if you notice any errors in the code or ways to make it cleaner (nothing beats R Tidyverse, in my opinion), feel free to comment below.

 

 

Data Analysis Template in R Markdown & Jupyter Notebook

The nice thing about working on a team with other analysts, working as part of a research group, or working on your PhD is the ability to share analysis with other colleagues, get feedback, and learn new ways of thinking about things.

Interestingly, when I’ve inquired to colleagues at some teams about how they share their analysis with their group they often say that, “people do their analysis and just present the results”. I think this is a big miss in terms of being able to have transparency in the research process, sharing so that others can learn or help to provide constructive feedback, and walking through the steps you went through (data retrieval,  data cleaning, analysis, model testing, etc) to ensure that things make sense to the group before being shared with the end user.

For the PhD student, a more streamlined approach to the entire analysis can help them talk through what they did with their advisors, ensure that all the correct steps were taken during the analysis, and have greater confidence about what their data is and is not saying (which can really come in handy when it is time to defend the thesis!). When I was doing my PhD I would often try and put all my steps into a power point presentation to walk through with my supervisors. I never liked that, however, because it always felt clumsy and I was never really getting to the guts of the analysis as much as I was just sharing the outcomes of what I did and why I did it and talking through how I did it. A template that allows for a clear presentation would have made things much easier for both myself and my supervisors

In my last post, I used R Markdown to create a report that allows the sport scientist to share some basic data analysis with the strength and conditioning staff and other coaches. As I said in that post, R Markdown is a wonderful resource for creating reports where you can hide your code and simply show visualizations of the data and model outputs. But, what if we don’t want to hide our code?! In this sense, R Markdown is extremely useful for setting up a data analysis template to allow you to walk through all the steps in your project and share the results with your colleagues or PhD supervisors. Additionally, you could also keep R Studio open when presenting your findings and address any changes/suggestions that people may have, in real time before, “knitting” the markdown file into the final html or pdf document. This last part allows the analysis to come to life and allows you to make direct changes and immediately show how they impact the outcome of the analysis!

Data Analysis Templates

There are a number of different data analysis frameworks one could follow. Two that immediately come to mind are the Cross Industry Standard Process for Data Mining (CRISP-DM) and the Problem, Plan, Data, Analysis, and Conclusion (PPDAC) Cycle.

Although they come from different industries — CRSIP-DM from the business world  and PPDAC from more of the statistics education world — there is considerable overlap and both have the aim of providing the analyst with a clear path to answering their research question.

The objectives of each phase within these two frameworks is shown below.

 

As you can see, the end goal of the analysis is different between the two frameworks: CRISP-DM being targeted at deploying a model specific to business use cases and PPDAC providing more of a runway for scientific publication. However, both can provide us with an appreciation for creating a systematic process around data analysis, allowing for a clear explanation of our approach when discussing with colleagues or PhD supervisors.

In an attempt to create something more generic and less specific to a certain industry or field, I came up with my own framework:

The framework is freely available on my GitHub page in both an R Markdown and Jupyter Notebook (if you prefer Python) formats. If you’d like to see with the R Markdown HTML looks like, click here >> PWard_-_Data_Analysis_Framework.

All you have to do is take the template (either R Markdown or Jupyter Notebook), delete the comments that I have under each section and fill in your own comments and your R or Python script, where applicable, to conduct your analysis. Once complete, you will have a clean looking file that details your entire approach.

I’ve made a simple example of what using the template could look like. If you are interested in seeing the result in R Markdown, CLICK HERE >> Data_Analysis_Framework_Example_–_MLB_Hitting. If you are interested in seeing the result in a Python Jupyter Notebook, CLICK HERE >> Data Analysis Framework Example — MLB Hitting (Jupyter).

All of the code and the templates for use are available on my GitHub page.

 

Total Score of Athleticism — R Markdown & R Shiny

A battery of performance tests (e.g., strength, power, fitness, agility) are often used by strength and conditioning and sports science staffs to evaluate a player’s current physical status. Such information can help to guide future training programs aiming to improve deficiencies and enhance performance. Having a single value that represents the athlete’s overall athleticism may be useful to identify the most well-rounded athletes in the club and may also help in communicating the test results of each player to the coaching staff in a digestible manner.

Recently, Anthony Turner and colleagues published the paper, Total Score of Athleticism: Holistic Athlete Profiling to Enhance Decision-Making, in the NSCA’s Strength and Conditioning Journal. While there are number of ways that one could index an athlete and represent their athleticism in a single value, this approach is simple to calculate for the practitioner and therefore I wanted to use it as an example of how you can create an R markdown report and Shiny app for displaying the results. (NOTE: For those interested in doing the analysis in excel, Anthony has a YouTube channel where he details the process. CLICK HERE).

Calculating Total Score of Athleticism (TSA)

The TSA is derived by calculating the z-score for each test in your battery and then averaging over the z-score for each individual to produce a single value.

The z-score is calculated simply as:

The values for a z-score will be reported with a mean of 0 and standard deviations ranging from -3 to 3 where ±1 SD represents ~68% of the scores, ±2 SD represents ~95% of the scores, and ±3 SD represents ~99% of the scores. In this way, values that are negative suggest the athlete is below average while values that are positive suggest the athlete is above average, relative to the group.

Some coaches or practitioners, however, may not like looking at a z-score because it is difficult for them to wrap their heads around the values (though I think it is easier to see the results as a z-score when performance is represented as positive and negative) and may instead prefer to look at the scores on a 0-100 scale. To convert the z-scores to t-scores on a 0-100 scale we use this formula:

Now, instead of having positive and negative values representing above or below average athletes, a score of a 50 represents average. As such, 10 points in either direction of 50 represent the number of standard deviations from average the athlete is. For example 60 = 1 SD, 70 =  2 SD, and 80 = 3 SD.

R Markdown Report

R Markdown is a simple way to take your analysis and turn it into a report or document for sharing. The beauty of R Markdown is that you can choose to show your R Code (if you are sharing with other professionals or colleagues) or hide your R Code and just show the results of the analysis (if you are sharing with coaches or other staff members).

I’ll put my R code in here to walk through the steps but if you are interested in the R Markdown file, just go over to my GitHub page.

First, we need to load our required packages and make up some fake data (since I don’t have any data I can use publicly).

# Packages
library(tidyverse)
library(reshape)
library(stringr)

# Simulate data
set.seed(3344)
athlete <- as.factor(1:30)
cmj <- c(round(rnorm(n = 10, mean = 30, sd = 4), 1), round(rnorm(n = 10, mean = 24, sd = 4), 1), round(rnorm(n = 10, mean = 33, sd = 2), 1))
sprint_40 <- c(round(rnorm(n = 10, mean = 4.5, sd = .1), 2), round(rnorm(n = 10, mean = 4.9, sd = .2), 2), round(rnorm(n = 10, mean = 5, sd = .2), 2))
bench <- c(round(rnorm(n = 10, mean = 20, sd = 4), 1), round(rnorm(n = 10, mean = 12, sd = 4), 1), round(rnorm(n = 10, mean = 30, sd = 4), 1))
five_ten_five <- c(round(rnorm(n = 10, mean = 6.4, sd = .2), 2), round(rnorm(n = 10, mean = 6.7, sd = .2), 2), round(rnorm(n = 10, mean = 7.5, sd = .4), 2))
df <- data.frame(athlete, cmj, sprint_40, bench, five_ten_five)

Next we need to write 2 functions, one for calculating our z-score and one for calculating our t-score.

## z-score function
z_score <- function(x){
z = (x - mean(x, na.rm = T)) / sd(x, na.rm = T)
return(z)
}

## t-score function
t_score <- function(x){
t = (x * 10) + 50
return(t)
}

 

Now we are all set to calculate the z-score and t-score results for our individual athletes. Also, note that before converting to the t-score, any test where negative reflects better performance (e.g., speed tests where a faster time is more favorable) you can multiple the z-score by -1. This intuitively makes it easier for those reading the report to always associate values that are positive as “above average” and values that are negative as “below average”.


## calculate the z-score
df <- df %>%
mutate_if(is.numeric, list(z = z_score))

df$sprint_40_z <- df$sprint_40_z * -1
df$five_ten_five_z <- df$five_ten_five_z * -1

## calculate the t-score
df <- df %>%
  mutate(cmj_t = t_score(cmj_z),
         sprint_40_t = t_score(sprint_40_z),
         bench_t = t_score(bench_z),
         five_ten_five_t = t_score(five_ten_five_z))

Finally calculate the TSA z-score (TSA_z) and TSA t-score (TSA_t).


## calculate TSA_z
df$TSA_z <- apply(df[, 6:9], MARGIN = 1, FUN = mean)

## calculate TSA_z
df$TSA_t <- with(df, (TSA_z * 10) + 50)

 

Now that the data is prepared we construct our report. Before plotting the TSA z-scores we need to move the data from the wide format, that it is currently in, to a long format. This will make it easier to code the plot. We will remove the “_z” at the end of each variable to make the labels cleaner looking. We are also going to add a shaded range between -1 and 1 (you can pick whatever range makes sense for you in your situation). Finally, we will include an indicator value that flags the athlete as “green” when they are above average and “red” when they are below average.


# Change data from a wide to long format
df_long <- df %>%
melt(., id = "athlete", measure.vars = c("cmj_z", "sprint_40_z", "bench_z", "five_ten_five_z"))

# remove the _z
df_long$Test <- str_sub(df_long$variable, end = -3)

# Add indicator value
df_long <- df_long %>% mutate("indicator" = ifelse(value > 0, "above avg", "below avg"))

# plot
df_long %>%
filter(athlete %in% c(3, 15, 22, 27)) %>%
ggplot(aes(x = Test, y = value)) +
geom_rect(aes(ymin = -1, ymax = 1), xmin = 0, xmax = Inf, fill = "light grey", alpha = 0.3) +
geom_col(aes(fill = indicator), alpha = 0.8) +
facet_wrap(~athlete) +
scale_fill_manual(values = c("green", "red")) +
theme_light() +
theme(axis.text.x = element_text(face = "bold", size = 12, angle = 45, vjust = 1, hjust = 1),
axis.text.y = element_text(face = "bold", size = 12),
strip.background = element_rect(fill = "black"),
strip.text = element_text(color = "white", face = "bold"),
plot.title = element_text(size = 18),
plot.subtitle = element_text(size = 15)) +
labs(x = "", y = "z-score of performance") +
ggtitle("Test Performance", subtitle = "Player Performance Standardized to the Team") +
ylim(-3, 3)

 

 

We can also plot the entire team’s TSA z-score in a similar manner. I’ve included the TSA z-score values on the plot as well, to help with interpretation.

 

If you would like to see what the finished markdown file looks like click here:
Total_Score_of_Athleticism_-_Report

You can certainly manipulate the code to produce different plots or add more text for the coaches to read. I’ve stuck with the z-score in this report because I personally don’t think the t-score conveys the information in the same manner. My personal preference is that I like to see below average as negative and above average as positive. I’ve added the code for the t-score plot in the markdown file so you can manipulate it if you’d like. Quickly, here is what the plot would look like (horizontal dashed line at 50, indicating average) so you can judge which one you prefer better:

 

Shiny App

You may have noticed from the markdown report that when plotting the individual test results I only showed 4 athletes. In this way the report is rather static! It doesn’t allow us to scroll through players in an efficient manner. For that, we need something that is interactive. Enter Shiny!

Shiny is a way for us to quickly build interactive webpages in R that can be hosted directly on our computer. There is a lot of versatility in these apps. The example below is just a simple app that allows the coach to select an athlete and it will automatically change to that individual’s plot, showing their performance in all of the individual tests as well as their TSA.

We could extend this app to have a second plot to allow for player comparisons or a table of results to allow for viewing the raw values for each of the tests. The code is provided below for you to run on your computer. I don’t go into what all of the elements of the code are doing (for time sake) but perhaps I could revisit other ways of using Shiny in future blog posts and explain more clearly how the code works.


### TotalScore of Athleticism - Shiny App

# Load packages
library(tidyverse)
library(reshape)
library(stringr)
library(shiny)

# simulate data
set.seed(3344)
athlete <- as.factor(1:30)
cmj <- c(round(rnorm(n = 10, mean = 30, sd = 4), 1), round(rnorm(n = 10, mean = 24, sd = 4), 1), round(rnorm(n = 10, mean = 33, sd = 2), 1))
sprint_40 <- c(round(rnorm(n = 10, mean = 4.5, sd = .1), 2), round(rnorm(n = 10, mean = 4.9, sd = .2), 2), round(rnorm(n = 10, mean = 5, sd = .2), 2))
bench <- c(round(rnorm(n = 10, mean = 20, sd = 4), 1), round(rnorm(n = 10, mean = 12, sd = 4), 1), round(rnorm(n = 10, mean = 30, sd = 4), 1))
five_ten_five <- c(round(rnorm(n = 10, mean = 6.4, sd = .2), 2), round(rnorm(n = 10, mean = 6.7, sd = .2), 2), round(rnorm(n = 10, mean = 7.5, sd = .4), 2))
df <- data.frame(athlete, cmj, sprint_40, bench, five_ten_five)

# z-score function

z_score <- function(x){
z = (x - mean(x, na.rm = T)) / sd(x, na.rm = T)
}

##### Data Pre-Processing #####
###############################

# calculate the z-score
df <- df %>%
mutate_if(is.numeric, list(z = z_score))

df$sprint_40_z <- df$sprint_40_z * -1
df$five_ten_five_z <- df$five_ten_five_z * -1

# calculate TSA_z
df$TSA_z <- apply(df[, 6:9], MARGIN = 1, FUN = mean)

# Change data from a wide to long format
df_long <- df %>% 
melt(., id = "athlete", measure.vars = c("cmj_z", "sprint_40_z", "bench_z", "five_ten_five_z", "TSA_z"))

# remove the _z
df_long$Test <- str_sub(df_long$variable, end = -3)

# Add indicator value
df_long < df_long %<% mutate("indicator" = ifelse(value > 0, "above avg", "below avg"))

##### Shiny App #####
#####################

## User Interface

athlete <- as.vector(unique(df_long$athlete))

ui <- fluidPage(

titlePanel("Performance Testing Results"),

selectInput(
input = "athlete",
label = "athlete",
choices = athlete,
selected = "1"
),

plotOutput(outputId = "tsa.plot",
width = "60%")
)

## server

server <- function(input, output){

dat <- reactive({
dataset <- subset(df_long, athlete == input$athlete)
dataset
})

output$tsa.plot <- renderPlot({
d <- dat()

athlete.plot <- ggplot(data = d, aes(x = Test, y = value)) +
geom_rect(aes(ymin = -1, ymax = 1), xmin = 0, xmax = Inf, fill = "light grey", alpha = 0.3) +
geom_col(aes(fill = indicator), alpha = 0.8) +
scale_fill_manual(values = c("green", "red")) +
theme_light() +
theme(axis.text.x = element_text(face = "bold", size = 12, angle = 45, vjust = 1, hjust = 1), 
axis.text.y = element_text(face = "bold", size = 12),
strip.background = element_rect(fill = "black"),
strip.text = element_text(color = "white", face = "bold"),
plot.title = element_text(size = 18),
plot.subtitle = element_text(size = 15)) +
labs(x = "", y = "z-score of performance") +
ggtitle("Test Performance", subtitle = "Player Performance Standardized to the Team") +
ylim(-3, 3)

print(athlete.plot)
})
}

## Run the app
shinyApp(ui = ui, server = server)

The website ends up looking like this:

Conclusion

There are a number of ways that one can index several performance tests and create a single number for coaches to digest. This is one simple example of an approach that is easy for practitioners to understand and action against (e.g., low values in red may require specialized approaches to training and performance development). The simplicity of this approach makes it easy to create simple reports and web apps that allow the strength coach and sports science staff to quickly share information with the coaching staff in an easy and interactive manner.

Obviously care should go into selecting the test battery as it should reflect elements that are specific to success in your given sport. Finally, the current format of the Total Score of Athleticism treats each test with equal weight. However, there may be situations where you want to place more weight on certain tests. For example, some tests may be more important for certain position groups than others. Alternatively, some tests may have a stronger association with sports performance and thus require greater weight than other tests. All of these things can be addressed in your setting by simply using the code in this blog and altering it to meet your needs.

 

 

References

1) Turner, AN. et al. (2019). Total Score of Athleticism: Holistic Athlete Profiling to Enhance Decision-Making, Strength Cond J. 2019. Epub-ahead-of-print.