Category Archives: R Tips & Tricks

TidyX 69: Cleaning Messy Data, Part 6 – Special Guest Mara Averick

To round out our Cleaning Messy Data series, Ellis Hughes and I are joined by R superstar, Mara Averick, as she walks us through cleaning up very messy (but timely) Olympic Pentathalon data.

For those that don’t know, Mara is a Developer Advocate for R Studio and is an a fantastic Twitter follow. She is always sharing useful R content and helping others in the R community solve difficult problems with their code.

To watch the screen cast, CLICK HERE.

To access Mara’s code, CLICK HERE.

Force Decks – Force Plate Shiny Dashboard

Last week, two of the data scientists at Vald Performance, Josh Ruddy and Nick Murray, put out a free online tutorial on how to create a force plate reports using R with data from their Force Decks software.

It was a nice tutorial to give an overview of some of the power behind ggplot2 and the suite of packages that come with tidyverse. Since they made the data available (in the link above), I decided to pull it down and put together a quick shiny app for those that might be interested in extending the report to an interactive web app.

This isn’t the first time I’ve build a shiny app for the blog using force plate data. Interested readers might want to check out my post from a year ago where I built a shiny interactive report for force-velocity profiling.

You can watch a short preview of the end product in the below video link and the screen shots below the link show a static view of what the final shiny App will look like.

A few key features:

  1. App always defaults to the most recent testing day on the testDay tab.
  2. The user can select the position group at the top and that position group will be maintained across all tabs. For example, if you select Forwards, when you switch between tabs one and two, forwards will always be there.
  3. The time series plots on the Player Time Series tab are done using plotly, so they are interactive, allowing the user to hover over each test session and see the change from week-to-week in the tool tip. When the change exceeds the meaningful change, the point turns red. Finally, because it is plotly, the user can slice out specific dates that they want to look at (as you can see me do in the video example), which comes in handy when there are a large number of tests over time.

All code and data s accessible through my GitHub page.

vald_shiny_app

Loading and preparing the data

  • I load the data in using read.csv() and file.choose(), so navigate to wherever you have the data on your computer and select it.
  • There is some light cleaning to change the date in to a date variable. Additionally, there were no player positions in the original data set, so I just made some up and joined those in.

### packages ------------------------------------------------------------------
library(tidyverse)
library(lubridate)
library(psych)
library(shiny)
library(plotly)

theme_set(theme_light())

### load & clean data ---------------------------------------------------------
cmj <- read.csv(file.choose(), header = TRUE) %>%
  janitor::clean_names() %>%
  mutate(date = dmy(date))

player_positions <- data.frame(name = unique(cmj$name),
                               position = c(rep("Forwards", times = 15),
                                            rep("Mids", times = 15),
                                            rep("Backs", times = 15)))

# join position data with jump data
cmj <- cmj %>%
  inner_join(player_positions)

 

Determining Typical Error and Meaningful Change

  • In this example, I’ll just pretend as if the first 2 sessions represented our test-retest data and I’ll work from there.
  • Typical Error Measurement (TEM) was calculated as the standard deviation of differences between test 1 and 2 divided by the square root of 2.
  • For the meaningful change, instead of using 0.2 (the commonly used smallest worthwhile change multiplier) I decided to use a moderate change (0.6), since 0.2 is such a small fraction of the between subject SD.
  • For info on these two values, I covered them in a blog post last week using Python and a paper Anthony Turner and colleagues wrote.

change_standards <- cmj %>%
  group_by(name) %>%
  mutate(test_id = row_number()) %>%
  filter(test_id < 3) %>%
  select(name, test_id, rel_con_peak_power) %>%
  pivot_wider(names_from = test_id,
              names_prefix = "test_",
              values_from = rel_con_peak_power) %>%
  mutate(diff = test_2 - test_1) %>%
  ungroup() %>%
  summarize(TEM = sd(diff) / sqrt(2),
            moderate_change = 0.6 * sd(c(test_1, test_2)))

Building the Shiny App

  • In the user interface, I first create my sidebar panel, allowing the user to select the position group of interest. You’ll notice that this sidebar panel is not within the tab panels, which is why it stands alone and allows us to select a position group that will be retained across all tabs.
  • Next, I set up 2 tabs. Notice that in the first tab (testDay) I include a select input, to allow the user to select the date of interest. In the selected argument I tell shiny to always select the max(cmj$date) so that the most recent session is always shown to the user.
  • The server is pretty straight forward. I commented out where each tab data is built. Basically, it is just taking the user specified information and performing simple data filtering and then ggplot2 charts to provide us with the relevant information.
  • On the testDay plot, we use the meaningful change to shade the region around 0 in grey and we use the TEM around the athlete’s observed performance on a given day to specify the amount of error that we might expect for the test.
  • One the Player Time Series plot we have the athlete’s average line and ±1 SD lines to accompany their data, with points changing color when the week-to-week change exceeds out meaningful change.
### Shiny App -----------------------------------------------------------------------------

## Set up user interface

ui <- fluidPage(
  
  ## set title of the app
  titlePanel("Team CMJ Analysis"),
  
  ## create a selection bar for position group that works across all tabs
  sidebarPanel(
    selectInput(inputId = "position",
                label = "Select Position Group:",
                choices = unique(cmj$position),
                selected = "Backs",
                multiple = FALSE),
    width = 2
  ),
  
  ## set up 2 tabs: One for team daily analysis and one for player time series
  tabsetPanel(
    
    tabPanel(title = "testDay",
             
             selectInput(inputId = "date",
                         label = "Select Date:",
                         choices = unique(cmj$date)[-1],
                         selected = max(cmj$date),
                         multiple = FALSE),
             
             mainPanel(plotOutput(outputId = "day_plt", width = "100%", height = "650px"),
                       width = 12)),
    
    tabPanel(title = "Player Time Series",
             
             mainPanel(plotlyOutput(outputId = "player_plt", width = "100%", height = "700px"),
                       width = 12))
  )
  
)


server <- function(input, output){
  
  ##### Day plot tab ####
  ## day plot data
  day_dat <- reactive({
    
    d <- cmj %>%
      group_by(name) %>%
      mutate(change_power = rel_con_peak_power - lag(rel_con_peak_power)) %>%
      filter(date == input$date,
             position == input$position)
    
    d
    
  })
  
  ## day plot
  output$day_plt <- renderPlot({ day_dat() %>%
      ggplot(aes(x = reorder(name, change_power), y = change_power)) +
      geom_rect(aes(ymin = -change_standards$moderate_change, ymax = change_standards$moderate_change),
                xmin = 0,
                xmax = Inf,
                fill = "light grey",
                alpha = 0.6) +
      geom_hline(yintercept = 0) +
      geom_point(size = 4) +
      geom_errorbar(aes(ymin = change_power - change_standards$TEM, ymax = change_power + change_standards$TEM),
                    width = 0.2,
                    size = 1.2) +
      theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1),
            axis.text = element_text(size = 16, face = "bold"),
            axis.title = element_text(size = 18, face = "bold"),
            plot.title = element_text(size = 22)) +
      labs(x = NULL,
           y = "Weekly Change",
           title = "Week-to-Week Change in Realtive Concentric Peak Power")
    
  })
  
  ##### Player plot tab ####
  ## player plot data
  
  player_dat <- reactive({
    
    d <- cmj %>%
      group_by(name) %>%
      mutate(avg = mean(rel_con_peak_power),
             sd = sd(rel_con_peak_power),
             change = rel_con_peak_power - lag(rel_con_peak_power),
             change_flag = ifelse(change >= change_standards$moderate_change | change <= -change_standards$moderate_change, "Flag", "No Flag")) %>%
      filter(position == input$position)
    
    d
  })
  
  ## player plot
  output$player_plt <- renderPlotly({
    
    plt <- player_dat() %>%
      ggplot(aes(x = date, y = rel_con_peak_power, label = change)) +
      geom_rect(aes(ymin = avg - sd, ymax = avg + sd),
                xmin = 0,
                xmax = Inf,
                fill = "light grey",
                alpha = 0.6) +
      geom_hline(aes(yintercept = avg - sd),
                 color = "black",
                 linetype = "dashed",
                 size = 1.2) +
      geom_hline(aes(yintercept = avg + sd),
                 color = "black",
                 linetype = "dashed",
                 size = 1.2) +
      geom_hline(aes(yintercept = avg), size = 1) +
      geom_line(size = 1) +
      geom_point(shape = 21,
                 size = 3,
                 aes(fill = change_flag)) +
      facet_wrap(~name) +
      scale_fill_manual(values = c("red", "black", "black")) +
      theme(axis.text = element_text(size = 13, face = "bold"),
            axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
            plot.title = element_text(size = 18),
            strip.background = element_rect(fill = "black"),
            strip.text = element_text(size = 13, face = "bold"),
            legend.position = "none") +
      labs(x = NULL,
           y = NULL,
           title = "Relative Concentric Peak Power")
    
    ggplotly(plt)
    
  })
  
  
}



shinyApp(ui, server)

TidyX 67: Cleaning Messy Data, Part 4 – Viewer Submission

In Part 4 of our Cleaning Messy Data series Ellis Hughes and I are FINALLY shooting an episode in person again!

This week, we received a viewer submission (opened at issue on our GitHub repo) asking for some help in cleaning up UK Covid19 data. The data is downloaded from a government webpage and comes with a number of excel tabs. In this episode we will deal with the weekly tracking data and show how to go from an excel sheet, that reflects an intuitive way to store data if you are a decision-maker trying to look at it each week and gain insight, into a more usable format that a data scientist can work with.

To watch our screen cast, CLICK HERE.

To access the data and our code, CLICK HERE.

R Tips & Tricks: Creating a Multipage PDF with {ggplot2}

I was recently asked by a colleague for a simple solution to produce a multipage PDF for a training load report. The colleague wanted the report to generate a new page for each position group.

There are a few ways to solve this problem with loops, but below is literally the easiest approach you can take.

First, we will load in two packages that we need, {tidyverse}, for data manipulation and visualization, and {patchwork}, for organizing multiple plots on a page. Additionally, I’ll create a z-score function so that we can standardize the training load variables for each individual (In my opinion, it makes these types of charts look nicer when the data is on the same scale since athletes within the same position group can sometimes have very different training responses).

## load packages and custom z-score function
library(tidyverse)
library(patchwork)

z_score <- function(x){
  z = (x - mean(x, na.rm = T)) / sd(x, na.rm = T)
  return(z)
}

 

Next, we will just simulate some fake training data.

 

## simulate data
athlete <- rep(LETTERS[1:10], each = 10)
pos <- rep(c("DB", "LB", "DL", "OL", "WR"), each = 20)
week <- rep(1:10, times = 10)
Total_Dist <- round(rnorm(n = length(athlete), mean = 3200, sd = 400), 0) 
HSR <- round(rnorm(n = length(athlete), mean = 450, sd = 100), 0)

df <- data.frame(athlete, pos, week, Total_Dist, HSR) df %>% head()

 

Let’s go ahead and apply our z-score function to our two training variables, Total Distance (Total_Dist) and High Speed Running (HSR). Notice that I group by “athlete” to ensure that the mean and standard deviation used to normalize each variable is specific to the individual and not the entire population.

 

df <- df %>%
  group_by(athlete) %>%
  mutate(TD_z = z_score(Total_Dist),
         HSR_z = z_score(HSR))

 

Now we need to make a function that will create the plots we want. The code below can look a little intimidating, so here are a few points to help you wrap your head around it:

  • It is literally just two {ggplot2} plots. All I did was store each one in their own object (so that we could pair them together with {patchwork} and wrap them inside of this function).
  • The easiest way to get used to doing this is to write your {ggplot2} plots out as you normally would (as if you were creating them for a single position group). When you have the plot built to your specifications then just wrap it into a function. The argument for the function should take the value that you want to iterate over. In this case, we want to create plots for each position group, so I call the argument “POS”, short for position. When I run that function I provide the “POS” argument with the abbreviation for the position group I am interested in and the function will do the rest. The function works in this manner because you’ll notice that the second line of each of the plots is a filter that is specifically pulling out the position group of interest from the original data set.
  • The final line of the function creates an element called “plots”. You’ll see that the element consists of the two plots that we created above it and they are separated by a “|”. This vertical bar is just telling the {patchwork} package to place one plot right next to the other.
### Build a function for the plots to loop over position group

plt_function <- function(POS){
  
  dist_plt <- df %>%
    filter(pos == POS) %>%
    ggplot(aes(x = as.factor(week), y = TD_z, group = 1)) +
    geom_hline(yintercept = 0) +
    geom_line(size = 1) +
    geom_area(fill = "light green", 
              alpha = 0.7) +
    facet_wrap(~athlete) +
    theme_bw() +
    theme(axis.text.x = element_text(size = 9, face = "bold"),
          axis.text.y = element_text(size = 9, face = "bold"),
          strip.background = element_rect(fill = "black"),
          strip.text = element_text(color = "white", face = "bold", size = 8)) +
    labs(x = "",
         y = "Total Distance",
         title = "Weekly Training Distance",
         subtitle = paste("Position", POS, sep = " = ")) +
    ylim(c(-3.5, 3.5))
  
  hsr_plt <- df %>%
    filter(pos == POS) %>%
    ggplot(aes(x = as.factor(week), y = HSR_z, group = 1)) +
    geom_hline(yintercept = 0) +
    geom_line(size = 1) +
    geom_area(fill = "light green", 
              alpha = 0.7) +
    facet_wrap(~athlete) +
    theme_bw() +
    theme(axis.text.x = element_text(size = 9, face = "bold"),
          axis.text.y = element_text(size = 9, face = "bold"),
          strip.background = element_rect(fill = "black"),
          strip.text = element_text(color = "white", face = "bold", size = 8)) +
    labs(x = "",
         y = "HSR",
         title = "Weekly HSR",
         subtitle = paste("Position", POS, sep = " = ")) +
    ylim(c(-3.5, 3.5))
  
  
  plots <- dist_plt | hsr_plt
  plots
  
}


 

Let’s try out the function on just one group. We will pass the POS argument the abbreviation “DB”, for the defensive backs group.

 

# try out the function

plt_function(POS = "DB")

 

It worked!!

Okay, now let’s create our multipage PDF report. To do this, all we need to do is run the above line of code for each of our position groups. To ensure that we get each position plot into the PDF, we begin the code chunk with the pdf() function. It is here that we will specify the width and height of the plot page within the PDF itself (NOTE: you many need to play around with this depending on what your plots look like). We can also name the PDF report. Here I just called it “Team.pdf”. Finally, after running the line of code for each position group plot, we run the function dev.off(), which just shuts down the specified PDF device so that R knows that we are done making plots.

 

## create a multipage pdf with each page representing a position group

pdf(width = 12, height = 8, "Team.pdf")
plt_function(POS = "DB")
plt_function(POS = "LB")
plt_function(POS = "DL")
plt_function(POS = "OL")
plt_function(POS = "WR")
dev.off()

 

And that’s it! We end up with a 5 page PDF that has a different position group on each page.

 

 

If you want to see the finished product, click here: Team

The full code is on my github page. CLICK HERE

R Tips & Tricks: Building a Shiny Training Load Dashboard

In TidyX Episode 19 we discussed a way of building a dashboard with the {formattable} package. The dashboard included both a data table and a small visual using {sparkline}. Such a table is great when you need to make a report for a presentation but there are times when you might want to have something that is more interactive and flowing. In the sports science setting, this often comes in the form of evaluating athlete training loads. As such, I decided to spin up a quick {shiny} web app to show how easy it is to create whatever you want without sinking a ton of money into an athlete management system.

The code is available on my GITHUB page.

Packages & Custom Functions

Before doing anything, always start by loading the packages you will need for your work. In this tutorial, we will using the {tidyverse} and {shiny} packages, so be sure to install them if you haven’t already. I also like to set my plot theme to classic so that I get rid of grid lines for the {ggplot2} figures that I create.

Finally, I also wrote a custom function for calculating a z-score. This will come in handy when we go to visualize our data.

# custom function for calculating z-score
z_score <- function(x){
  z = (x - mean(x, na.rm = T)) / sd(x, na.rm = T)
  return(z)
}

 

Data

Next we simulate a bunch of fake data for a basketball team so that we have something to build our {shiny} app against. We will simulate total weekly training loads for 10 athletes across three different positions (fwd, guard, center), for a 3 week pre-season and a 15 week in-season.

 

set.seed(55)
athlete <- rep(LETTERS[1:10], each = 15)
position <- rep(c("fwd", "fwd", "fwd", "fwd", "guard", "guard", "guard", "center", "center", "center"), each = 15)
week <- rep(c("pre_1", "pre_2", "pre_3", 1:12), times = 10)
training_load <- round(rnorm(n = length(athlete), mean = 1500, sd = 350), 1)

df <- data.frame(athlete, position, week, training_load)
df$flag <- factor(df$flag, levels = c("high", "normal", "low"))
df$week <- factor(df$week, levels = c("pre_1", "pre_2", "pre_3", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12")) 

df %>% head()

The first few rows of the data look like this:

Shiny App

There are two components to building a {shiny} app:

1) The user interface, which defines the input that the user specifies on the web app.
2) The server, which reacts to what the user does and then produces the output that the user sees.

The user interface I will define as tl_ui (training load user interface). I start by creating a fluid page (since the user is going to be defining what they want). I set the parameters of what the user can control. Here, I provide them with the ability to select the position group of interest and then the week(s) of training they would like to see in their table and plot.

The server, which I call tl_server (training load server), is a little more involved as we need to have it take an input from the user interface , get the correct data from our stored data frame, and then produce an output. This part can be a bit involved. Below you can see the code and I’ll try and articulate a few of the main things that are taking place.

1) First, I get the data for the visualization the user wants. I’m using our simulated data and I calculate the z-score for each individual athlete, using the custom function I wrote. After that I create my flag (±1 SD from the individual athlete’s mean) and then I finally filter the data specific to the inputs provided in the user interface. This last part is critical. If I filter before applying my z-score and flags, then the mean and SD used to calculate the z-score will only be relative to the data that has been filtered down. This probably isn’t what we want. Rather, we will want to calculate our z-score using all data, season-to-date. So we retain everything and perform our calculation and then we filter the data.

2) After getting the data, I build my plot using {ggplot2}.

3) Then I create another data set specific to the table weeks of training selected by the user. This time, we will retain the raw values (non-standardized) for each athlete as the user may want to see raw values in a table. Notice that I also pivot that data using the pivot_wider() function as the data is currently stored in a long format with each row representing a new week for the athlete. Instead, I’d like the user to see this data across their screen, with each week representing a new column. As the user selects new weeks of data to visualize our server will tell {shiny} to populate the next column for that athlete.

4) Once I have the data we need, I simply render the table.

Those 2 steps complete the building of our {shiny} app!

Finally, the last step is to run the {shiny} app using the shinyApp() function which takes two arguments, the name of your user interface (tl_ui) and the name of your server (tl_server). This will open up a shiny app in a new window within R Studio that you can then open in your browser.

Visually, you can see what the app looks like below. If you want to see it in action, aside from getting my code and running it yourself, I put together this short video >> R Tips & Tricks – Training Load Dashboard

Conclusion

As you can tell, it is pretty easy to build and customize your own {shiny} app. I’ll admit, the syntax for {shiny} can get a bit frustrating but once you get the hang of it, like most things, it isn’t that bad. Also, unless you are doing really complex tasks, most of the syntax for something as simple as a training load dashboard wont be too challenging. Finally, such a quick and easy method not only allows you to customize things exactly as you want them, but it also saves you money from having to purchase an expensive athlete management system.

As always, you can obtain the code for this tutorial on my GITHUB page.