TidyX 69: Cleaning Messy Data, Part 6 – Special Guest Mara Averick

To round out our Cleaning Messy Data series, Ellis Hughes and I are joined by R superstar, Mara Averick, as she walks us through cleaning up very messy (but timely) Olympic Pentathalon data.

For those that don’t know, Mara is a Developer Advocate for R Studio and is an a fantastic Twitter follow. She is always sharing useful R content and helping others in the R community solve difficult problems with their code.

To watch the screen cast, CLICK HERE.

To access Mara’s code, CLICK HERE.

TidyX 68: Cleaning Messy Data, Part 5 – Writing Results to an External File

For the last 4 screen casts, Ellis Hughes and I have been showing methods of cleaning messy excel data. Now that you have the data cleaned and structured in a data frame, you might need to write that data back out to an external file. Perhaps you need to save it as a csv and upload it to a database or maybe you want to write it out to a cleaner excel file that has the data in an accessible manner for your end users.

This week, we will write our cleaned data out to csv and excel in four formats:

  1. Write to csv
  2. Write the flat file out to a single excel sheet
  3. Write the flat file out to individual athlete tabs within an excel sheet
  4. Write the flat file out to individual athlete tabs within an excel sheet along with a bunch of custom excel formatting, available via the {openxlsx} package.

To watch the screen cast, CLICK HERE.

To access our code, CLICK HERE.

 

Force Decks – Force Plate Shiny Dashboard

Last week, two of the data scientists at Vald Performance, Josh Ruddy and Nick Murray, put out a free online tutorial on how to create a force plate reports using R with data from their Force Decks software.

It was a nice tutorial to give an overview of some of the power behind ggplot2 and the suite of packages that come with tidyverse. Since they made the data available (in the link above), I decided to pull it down and put together a quick shiny app for those that might be interested in extending the report to an interactive web app.

This isn’t the first time I’ve build a shiny app for the blog using force plate data. Interested readers might want to check out my post from a year ago where I built a shiny interactive report for force-velocity profiling.

You can watch a short preview of the end product in the below video link and the screen shots below the link show a static view of what the final shiny App will look like.

A few key features:

  1. App always defaults to the most recent testing day on the testDay tab.
  2. The user can select the position group at the top and that position group will be maintained across all tabs. For example, if you select Forwards, when you switch between tabs one and two, forwards will always be there.
  3. The time series plots on the Player Time Series tab are done using plotly, so they are interactive, allowing the user to hover over each test session and see the change from week-to-week in the tool tip. When the change exceeds the meaningful change, the point turns red. Finally, because it is plotly, the user can slice out specific dates that they want to look at (as you can see me do in the video example), which comes in handy when there are a large number of tests over time.

All code and data s accessible through my GitHub page.

vald_shiny_app

Loading and preparing the data

  • I load the data in using read.csv() and file.choose(), so navigate to wherever you have the data on your computer and select it.
  • There is some light cleaning to change the date in to a date variable. Additionally, there were no player positions in the original data set, so I just made some up and joined those in.

### packages ------------------------------------------------------------------
library(tidyverse)
library(lubridate)
library(psych)
library(shiny)
library(plotly)

theme_set(theme_light())

### load & clean data ---------------------------------------------------------
cmj <- read.csv(file.choose(), header = TRUE) %>%
  janitor::clean_names() %>%
  mutate(date = dmy(date))

player_positions <- data.frame(name = unique(cmj$name),
                               position = c(rep("Forwards", times = 15),
                                            rep("Mids", times = 15),
                                            rep("Backs", times = 15)))

# join position data with jump data
cmj <- cmj %>%
  inner_join(player_positions)

 

Determining Typical Error and Meaningful Change

  • In this example, I’ll just pretend as if the first 2 sessions represented our test-retest data and I’ll work from there.
  • Typical Error Measurement (TEM) was calculated as the standard deviation of differences between test 1 and 2 divided by the square root of 2.
  • For the meaningful change, instead of using 0.2 (the commonly used smallest worthwhile change multiplier) I decided to use a moderate change (0.6), since 0.2 is such a small fraction of the between subject SD.
  • For info on these two values, I covered them in a blog post last week using Python and a paper Anthony Turner and colleagues wrote.

change_standards <- cmj %>%
  group_by(name) %>%
  mutate(test_id = row_number()) %>%
  filter(test_id < 3) %>%
  select(name, test_id, rel_con_peak_power) %>%
  pivot_wider(names_from = test_id,
              names_prefix = "test_",
              values_from = rel_con_peak_power) %>%
  mutate(diff = test_2 - test_1) %>%
  ungroup() %>%
  summarize(TEM = sd(diff) / sqrt(2),
            moderate_change = 0.6 * sd(c(test_1, test_2)))

Building the Shiny App

  • In the user interface, I first create my sidebar panel, allowing the user to select the position group of interest. You’ll notice that this sidebar panel is not within the tab panels, which is why it stands alone and allows us to select a position group that will be retained across all tabs.
  • Next, I set up 2 tabs. Notice that in the first tab (testDay) I include a select input, to allow the user to select the date of interest. In the selected argument I tell shiny to always select the max(cmj$date) so that the most recent session is always shown to the user.
  • The server is pretty straight forward. I commented out where each tab data is built. Basically, it is just taking the user specified information and performing simple data filtering and then ggplot2 charts to provide us with the relevant information.
  • On the testDay plot, we use the meaningful change to shade the region around 0 in grey and we use the TEM around the athlete’s observed performance on a given day to specify the amount of error that we might expect for the test.
  • One the Player Time Series plot we have the athlete’s average line and ┬▒1 SD lines to accompany their data, with points changing color when the week-to-week change exceeds out meaningful change.
### Shiny App -----------------------------------------------------------------------------

## Set up user interface

ui <- fluidPage(
  
  ## set title of the app
  titlePanel("Team CMJ Analysis"),
  
  ## create a selection bar for position group that works across all tabs
  sidebarPanel(
    selectInput(inputId = "position",
                label = "Select Position Group:",
                choices = unique(cmj$position),
                selected = "Backs",
                multiple = FALSE),
    width = 2
  ),
  
  ## set up 2 tabs: One for team daily analysis and one for player time series
  tabsetPanel(
    
    tabPanel(title = "testDay",
             
             selectInput(inputId = "date",
                         label = "Select Date:",
                         choices = unique(cmj$date)[-1],
                         selected = max(cmj$date),
                         multiple = FALSE),
             
             mainPanel(plotOutput(outputId = "day_plt", width = "100%", height = "650px"),
                       width = 12)),
    
    tabPanel(title = "Player Time Series",
             
             mainPanel(plotlyOutput(outputId = "player_plt", width = "100%", height = "700px"),
                       width = 12))
  )
  
)


server <- function(input, output){
  
  ##### Day plot tab ####
  ## day plot data
  day_dat <- reactive({
    
    d <- cmj %>%
      group_by(name) %>%
      mutate(change_power = rel_con_peak_power - lag(rel_con_peak_power)) %>%
      filter(date == input$date,
             position == input$position)
    
    d
    
  })
  
  ## day plot
  output$day_plt <- renderPlot({ day_dat() %>%
      ggplot(aes(x = reorder(name, change_power), y = change_power)) +
      geom_rect(aes(ymin = -change_standards$moderate_change, ymax = change_standards$moderate_change),
                xmin = 0,
                xmax = Inf,
                fill = "light grey",
                alpha = 0.6) +
      geom_hline(yintercept = 0) +
      geom_point(size = 4) +
      geom_errorbar(aes(ymin = change_power - change_standards$TEM, ymax = change_power + change_standards$TEM),
                    width = 0.2,
                    size = 1.2) +
      theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1),
            axis.text = element_text(size = 16, face = "bold"),
            axis.title = element_text(size = 18, face = "bold"),
            plot.title = element_text(size = 22)) +
      labs(x = NULL,
           y = "Weekly Change",
           title = "Week-to-Week Change in Realtive Concentric Peak Power")
    
  })
  
  ##### Player plot tab ####
  ## player plot data
  
  player_dat <- reactive({
    
    d <- cmj %>%
      group_by(name) %>%
      mutate(avg = mean(rel_con_peak_power),
             sd = sd(rel_con_peak_power),
             change = rel_con_peak_power - lag(rel_con_peak_power),
             change_flag = ifelse(change >= change_standards$moderate_change | change <= -change_standards$moderate_change, "Flag", "No Flag")) %>%
      filter(position == input$position)
    
    d
  })
  
  ## player plot
  output$player_plt <- renderPlotly({
    
    plt <- player_dat() %>%
      ggplot(aes(x = date, y = rel_con_peak_power, label = change)) +
      geom_rect(aes(ymin = avg - sd, ymax = avg + sd),
                xmin = 0,
                xmax = Inf,
                fill = "light grey",
                alpha = 0.6) +
      geom_hline(aes(yintercept = avg - sd),
                 color = "black",
                 linetype = "dashed",
                 size = 1.2) +
      geom_hline(aes(yintercept = avg + sd),
                 color = "black",
                 linetype = "dashed",
                 size = 1.2) +
      geom_hline(aes(yintercept = avg), size = 1) +
      geom_line(size = 1) +
      geom_point(shape = 21,
                 size = 3,
                 aes(fill = change_flag)) +
      facet_wrap(~name) +
      scale_fill_manual(values = c("red", "black", "black")) +
      theme(axis.text = element_text(size = 13, face = "bold"),
            axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
            plot.title = element_text(size = 18),
            strip.background = element_rect(fill = "black"),
            strip.text = element_text(size = 13, face = "bold"),
            legend.position = "none") +
      labs(x = NULL,
           y = NULL,
           title = "Relative Concentric Peak Power")
    
    ggplotly(plt)
    
  })
  
  
}



shinyApp(ui, server)

TidyX 67: Cleaning Messy Data, Part 4 – Viewer Submission

In Part 4 of our Cleaning Messy Data series Ellis Hughes and I are FINALLY shooting an episode in person again!

This week, we received a viewer submission (opened at issue on our GitHub repo) asking for some help in cleaning up UK Covid19 data. The data is downloaded from a government webpage and comes with a number of excel tabs. In this episode we will deal with the weekly tracking data and show how to go from an excel sheet, that reflects an intuitive way to store data if you are a decision-maker trying to look at it each week and gain insight, into a more usable format that a data scientist can work with.

To watch our screen cast, CLICK HERE.

To access the data and our code, CLICK HERE.

Doing things in Python that you’d normally do in Excel – Data Analysis for Strength & Conditioning Coaches (Turner et al., 2015)

My previous blog on, Doing things in Python that you would normally do in Excel, got some nice feedback and seemed to be useful to a number of folks. As such, I’ve decided to continue on with that theme and put together a Python approach for constructing the stats in Anthony Turner‘s paper, Data Analysis for Strength and Conditioning Coaches: Using Excel to Analyze Reliability, Differences and Relationships.

In the paper, Anthony works through a few examples of calculating things like Smallest Worthwhile Change, Typical Error Measurement, and Cohen’s d Effect Size for CMJ, RSI, and Pro Agility. For the sake of brevity, my tutorial will only work through the CMJ data.

In his paper, Anthony walks through how a strength coach can apply this type of analysis very simply in Excel. I use his data to construct the same analysis in Python and end up with a data frame that represents each athlete on the team, their 3 CMJ’s, a goal/training target based on their SWC, and their Error Measurement.

 

Some things I cover in the tutorial that might be of interest to folks just starting out with Python:

    1. Adding new columns to a data frame while calculating summary summary statistics row-wise (e.g., working across each individual athlete’s row instead of calculating summary statistics over an entire column, which is how we commonly do it).
    2. Writing a custom function. In this tutorial I write a custom function to calculate the average and standard deviation across rows so that we only need one line of code to extract the information we are interested in.
    3. Step-by-step calculation of Cohen’s d Effect Size in Python.

Building your own data frames. I manually build Anthony’s data from the paper into a data frame. I also build a data frame at the end of the tutorial for Cohen’s d Effect Size interpretation.

To access my Jupyter Notebook, go to my Github page, HERE.