Category Archives: R Tips & Tricks

Simulating preferential attachment in R

I’m currently re-reading Michael Mauboussin’s Success Equation. The book is a discussion about the roll both skill and luck play in business and sport success. On page 118, Mauboussin discusses the Mathew Effect. The Mathew Effect, termed by sociologist Robert Merton, comes from a phrase in the bible written in the Gospel of Matthew:

“For whosoever hath, to him shall be given, and he shall have more abundance: but whosoever hath not, from him shall be taken away even that he hath.”

In a nutshell, the Mathew Effect is describing the phenomenon, “the rich get richer and the poor get poorer”.

Mauboussin goes on to provide an example of two graduate students, both with equal ability. Following graduation, the two students are applying for faculty positions. One is hired by an Ivey League university while the other goes to work at a less prestigious university. The Ivey League professor has a wonderful opportunity with perhaps more qualified students, high caliber faculty peers, and more funding for research. Such an opportunity leads to more scientific publications and greater notoriety and accolades in comparison to their peer at the less prestigious university.

As Mauboussin says, “initial conditions matter”. Both students had the same level of skill but different levels of luck. Student one’s initial condition of obtaining a faculty position at an Ivey League university set her up for better opportunities in her professional career, despite not being any more talented than student two.

Such an example applies in many areas of our lives, not just sport and business. For example, in the educational sector, some students may grow up in areas of the country where the public school environment does not provide the same educational experience that more affluent regions might. These students may not be any less intelligent than their peers, however, their initial conditions are not the same, ultimately having an influence in how the rest of their life opportunities turn out and how things look at the finish line.

Luck ends up playing a big role in our lives and the starting line isn’t the same for everyone. Mauboussin refers to this as preferential attachment, whereby the more connections you start with in life, the more new connections you are able to make. To show this concept, Mauboussin creates a simple game of drawing marbles from a jar (pg. 119):

We have a jar filled with the following marbles:

  • 5 red
  • 4 black
  • 3 yellow
  • 2 green
  • 1 blue

You close your eyes and select a marble at random. You then place that marble back in the jar and add one more marble of the same color. For example, let’s say you reach in and grab a yellow marble. You put the yellow marble back in the jar and add one more yellow marble so that there are now 4 yellow marbles in the jar. You repeat this game 100 times.

We can clearly see that starting out, some marbles have a higher chance of being selected than others. For example, there is a 33.3% chance (5/15) of selecting a red marble and only a 6.7% (1/15) chance of selecting a blue marble. The kicker is that, because of the difference in starting points as you select red marbles you end up also adding more red marbles, increasing the probability of selecting future red marbles even further! The red and black marbles begin with a higher number of connections than the other marbles and thus overtime their wealth in connections grows larger.

Let’s see what this looks like in an R simulation!

First, we create our initial starting values for the marbles in the jar:

Let’s play the game one time and see how it works. We reach in, grab a marble at random, and whatever color we get, we will add an additional marble of that same color back to the jar.


In this trial, we selected a green marble. Therefore, there are now 3 green marbles in the jar instead of 2.

If we were to do this 100 times, it would be pretty tedious. Instead, we will write a for() loop that can play out the game for us, each time selecting a marble at random and then adding an additional marble to the jar of the same color.

After running the loop of 100 trials, we end up observing the following number and proportion for each marble color:

Notice that when we started 26.7% of the marbles were black and 6.7% were blue. After 100 trials of our game, black now makes up 32% of the population while blue is only at 7%. Remember, these are random samples, so it is pure chance as to which marble we select in each trial of the game. However, the initial conditions were more favorable for black and less favorable for blue, creating different ending points.

We can take our simulated data and build a plot of the trials, recreating the plot that Mauboussin shows on page 121:

The visual is not exactly what you see on pg. 121 because this is a random simulation. But we can see how each marble grows overtime based on their starting point (which you will notice is different on the y-axis at trial number 0 – the initial number of marbles in the jar).

If you run this code yourself, you will get a slightly different outcome as well. Try it a few times and see how random luck changes the final position of each marble. Increase the number of trials from 100 to 1,000 or 10,000 and see what happens! Simulations like this provide an interesting opportunity to understand the world around us.

The code for creating the simulation and visual are available on my GITHUB page.

TidyX 69: Cleaning Messy Data, Part 6 – Special Guest Mara Averick

To round out our Cleaning Messy Data series, Ellis Hughes and I are joined by R superstar, Mara Averick, as she walks us through cleaning up very messy (but timely) Olympic Pentathalon data.

For those that don’t know, Mara is a Developer Advocate for R Studio and is an a fantastic Twitter follow. She is always sharing useful R content and helping others in the R community solve difficult problems with their code.

To watch the screen cast, CLICK HERE.

To access Mara’s code, CLICK HERE.

Force Decks – Force Plate Shiny Dashboard

Last week, two of the data scientists at Vald Performance, Josh Ruddy and Nick Murray, put out a free online tutorial on how to create a force plate reports using R with data from their Force Decks software.

It was a nice tutorial to give an overview of some of the power behind ggplot2 and the suite of packages that come with tidyverse. Since they made the data available (in the link above), I decided to pull it down and put together a quick shiny app for those that might be interested in extending the report to an interactive web app.

This isn’t the first time I’ve build a shiny app for the blog using force plate data. Interested readers might want to check out my post from a year ago where I built a shiny interactive report for force-velocity profiling.

You can watch a short preview of the end product in the below video link and the screen shots below the link show a static view of what the final shiny App will look like.

A few key features:

  1. App always defaults to the most recent testing day on the testDay tab.
  2. The user can select the position group at the top and that position group will be maintained across all tabs. For example, if you select Forwards, when you switch between tabs one and two, forwards will always be there.
  3. The time series plots on the Player Time Series tab are done using plotly, so they are interactive, allowing the user to hover over each test session and see the change from week-to-week in the tool tip. When the change exceeds the meaningful change, the point turns red. Finally, because it is plotly, the user can slice out specific dates that they want to look at (as you can see me do in the video example), which comes in handy when there are a large number of tests over time.

All code and data s accessible through my GitHub page.


Loading and preparing the data

  • I load the data in using read.csv() and file.choose(), so navigate to wherever you have the data on your computer and select it.
  • There is some light cleaning to change the date in to a date variable. Additionally, there were no player positions in the original data set, so I just made some up and joined those in.

### packages ------------------------------------------------------------------


### load & clean data ---------------------------------------------------------
cmj <- read.csv(file.choose(), header = TRUE) %>%
  janitor::clean_names() %>%
  mutate(date = dmy(date))

player_positions <- data.frame(name = unique(cmj$name),
                               position = c(rep("Forwards", times = 15),
                                            rep("Mids", times = 15),
                                            rep("Backs", times = 15)))

# join position data with jump data
cmj <- cmj %>%


Determining Typical Error and Meaningful Change

  • In this example, I’ll just pretend as if the first 2 sessions represented our test-retest data and I’ll work from there.
  • Typical Error Measurement (TEM) was calculated as the standard deviation of differences between test 1 and 2 divided by the square root of 2.
  • For the meaningful change, instead of using 0.2 (the commonly used smallest worthwhile change multiplier) I decided to use a moderate change (0.6), since 0.2 is such a small fraction of the between subject SD.
  • For info on these two values, I covered them in a blog post last week using Python and a paper Anthony Turner and colleagues wrote.

change_standards <- cmj %>%
  group_by(name) %>%
  mutate(test_id = row_number()) %>%
  filter(test_id < 3) %>%
  select(name, test_id, rel_con_peak_power) %>%
  pivot_wider(names_from = test_id,
              names_prefix = "test_",
              values_from = rel_con_peak_power) %>%
  mutate(diff = test_2 - test_1) %>%
  ungroup() %>%
  summarize(TEM = sd(diff) / sqrt(2),
            moderate_change = 0.6 * sd(c(test_1, test_2)))

Building the Shiny App

  • In the user interface, I first create my sidebar panel, allowing the user to select the position group of interest. You’ll notice that this sidebar panel is not within the tab panels, which is why it stands alone and allows us to select a position group that will be retained across all tabs.
  • Next, I set up 2 tabs. Notice that in the first tab (testDay) I include a select input, to allow the user to select the date of interest. In the selected argument I tell shiny to always select the max(cmj$date) so that the most recent session is always shown to the user.
  • The server is pretty straight forward. I commented out where each tab data is built. Basically, it is just taking the user specified information and performing simple data filtering and then ggplot2 charts to provide us with the relevant information.
  • On the testDay plot, we use the meaningful change to shade the region around 0 in grey and we use the TEM around the athlete’s observed performance on a given day to specify the amount of error that we might expect for the test.
  • One the Player Time Series plot we have the athlete’s average line and ±1 SD lines to accompany their data, with points changing color when the week-to-week change exceeds out meaningful change.
### Shiny App -----------------------------------------------------------------------------

## Set up user interface

ui <- fluidPage(
  ## set title of the app
  titlePanel("Team CMJ Analysis"),
  ## create a selection bar for position group that works across all tabs
    selectInput(inputId = "position",
                label = "Select Position Group:",
                choices = unique(cmj$position),
                selected = "Backs",
                multiple = FALSE),
    width = 2
  ## set up 2 tabs: One for team daily analysis and one for player time series
    tabPanel(title = "testDay",
             selectInput(inputId = "date",
                         label = "Select Date:",
                         choices = unique(cmj$date)[-1],
                         selected = max(cmj$date),
                         multiple = FALSE),
             mainPanel(plotOutput(outputId = "day_plt", width = "100%", height = "650px"),
                       width = 12)),
    tabPanel(title = "Player Time Series",
             mainPanel(plotlyOutput(outputId = "player_plt", width = "100%", height = "700px"),
                       width = 12))

server <- function(input, output){
  ##### Day plot tab ####
  ## day plot data
  day_dat <- reactive({
    d <- cmj %>%
      group_by(name) %>%
      mutate(change_power = rel_con_peak_power - lag(rel_con_peak_power)) %>%
      filter(date == input$date,
             position == input$position)
  ## day plot
  output$day_plt <- renderPlot({ day_dat() %>%
      ggplot(aes(x = reorder(name, change_power), y = change_power)) +
      geom_rect(aes(ymin = -change_standards$moderate_change, ymax = change_standards$moderate_change),
                xmin = 0,
                xmax = Inf,
                fill = "light grey",
                alpha = 0.6) +
      geom_hline(yintercept = 0) +
      geom_point(size = 4) +
      geom_errorbar(aes(ymin = change_power - change_standards$TEM, ymax = change_power + change_standards$TEM),
                    width = 0.2,
                    size = 1.2) +
      theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1),
            axis.text = element_text(size = 16, face = "bold"),
            axis.title = element_text(size = 18, face = "bold"),
            plot.title = element_text(size = 22)) +
      labs(x = NULL,
           y = "Weekly Change",
           title = "Week-to-Week Change in Realtive Concentric Peak Power")
  ##### Player plot tab ####
  ## player plot data
  player_dat <- reactive({
    d <- cmj %>%
      group_by(name) %>%
      mutate(avg = mean(rel_con_peak_power),
             sd = sd(rel_con_peak_power),
             change = rel_con_peak_power - lag(rel_con_peak_power),
             change_flag = ifelse(change >= change_standards$moderate_change | change <= -change_standards$moderate_change, "Flag", "No Flag")) %>%
      filter(position == input$position)
  ## player plot
  output$player_plt <- renderPlotly({
    plt <- player_dat() %>%
      ggplot(aes(x = date, y = rel_con_peak_power, label = change)) +
      geom_rect(aes(ymin = avg - sd, ymax = avg + sd),
                xmin = 0,
                xmax = Inf,
                fill = "light grey",
                alpha = 0.6) +
      geom_hline(aes(yintercept = avg - sd),
                 color = "black",
                 linetype = "dashed",
                 size = 1.2) +
      geom_hline(aes(yintercept = avg + sd),
                 color = "black",
                 linetype = "dashed",
                 size = 1.2) +
      geom_hline(aes(yintercept = avg), size = 1) +
      geom_line(size = 1) +
      geom_point(shape = 21,
                 size = 3,
                 aes(fill = change_flag)) +
      facet_wrap(~name) +
      scale_fill_manual(values = c("red", "black", "black")) +
      theme(axis.text = element_text(size = 13, face = "bold"),
            axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
            plot.title = element_text(size = 18),
            strip.background = element_rect(fill = "black"),
            strip.text = element_text(size = 13, face = "bold"),
            legend.position = "none") +
      labs(x = NULL,
           y = NULL,
           title = "Relative Concentric Peak Power")

shinyApp(ui, server)

TidyX 67: Cleaning Messy Data, Part 4 – Viewer Submission

In Part 4 of our Cleaning Messy Data series Ellis Hughes and I are FINALLY shooting an episode in person again!

This week, we received a viewer submission (opened at issue on our GitHub repo) asking for some help in cleaning up UK Covid19 data. The data is downloaded from a government webpage and comes with a number of excel tabs. In this episode we will deal with the weekly tracking data and show how to go from an excel sheet, that reflects an intuitive way to store data if you are a decision-maker trying to look at it each week and gain insight, into a more usable format that a data scientist can work with.

To watch our screen cast, CLICK HERE.

To access the data and our code, CLICK HERE.

R Tips & Tricks: Creating a Multipage PDF with {ggplot2}

I was recently asked by a colleague for a simple solution to produce a multipage PDF for a training load report. The colleague wanted the report to generate a new page for each position group.

There are a few ways to solve this problem with loops, but below is literally the easiest approach you can take.

First, we will load in two packages that we need, {tidyverse}, for data manipulation and visualization, and {patchwork}, for organizing multiple plots on a page. Additionally, I’ll create a z-score function so that we can standardize the training load variables for each individual (In my opinion, it makes these types of charts look nicer when the data is on the same scale since athletes within the same position group can sometimes have very different training responses).

## load packages and custom z-score function

z_score <- function(x){
  z = (x - mean(x, na.rm = T)) / sd(x, na.rm = T)


Next, we will just simulate some fake training data.


## simulate data
athlete <- rep(LETTERS[1:10], each = 10)
pos <- rep(c("DB", "LB", "DL", "OL", "WR"), each = 20)
week <- rep(1:10, times = 10)
Total_Dist <- round(rnorm(n = length(athlete), mean = 3200, sd = 400), 0) 
HSR <- round(rnorm(n = length(athlete), mean = 450, sd = 100), 0)

df <- data.frame(athlete, pos, week, Total_Dist, HSR) df %>% head()


Let’s go ahead and apply our z-score function to our two training variables, Total Distance (Total_Dist) and High Speed Running (HSR). Notice that I group by “athlete” to ensure that the mean and standard deviation used to normalize each variable is specific to the individual and not the entire population.


df <- df %>%
  group_by(athlete) %>%
  mutate(TD_z = z_score(Total_Dist),
         HSR_z = z_score(HSR))


Now we need to make a function that will create the plots we want. The code below can look a little intimidating, so here are a few points to help you wrap your head around it:

  • It is literally just two {ggplot2} plots. All I did was store each one in their own object (so that we could pair them together with {patchwork} and wrap them inside of this function).
  • The easiest way to get used to doing this is to write your {ggplot2} plots out as you normally would (as if you were creating them for a single position group). When you have the plot built to your specifications then just wrap it into a function. The argument for the function should take the value that you want to iterate over. In this case, we want to create plots for each position group, so I call the argument “POS”, short for position. When I run that function I provide the “POS” argument with the abbreviation for the position group I am interested in and the function will do the rest. The function works in this manner because you’ll notice that the second line of each of the plots is a filter that is specifically pulling out the position group of interest from the original data set.
  • The final line of the function creates an element called “plots”. You’ll see that the element consists of the two plots that we created above it and they are separated by a “|”. This vertical bar is just telling the {patchwork} package to place one plot right next to the other.
### Build a function for the plots to loop over position group

plt_function <- function(POS){
  dist_plt <- df %>%
    filter(pos == POS) %>%
    ggplot(aes(x = as.factor(week), y = TD_z, group = 1)) +
    geom_hline(yintercept = 0) +
    geom_line(size = 1) +
    geom_area(fill = "light green", 
              alpha = 0.7) +
    facet_wrap(~athlete) +
    theme_bw() +
    theme(axis.text.x = element_text(size = 9, face = "bold"),
          axis.text.y = element_text(size = 9, face = "bold"),
          strip.background = element_rect(fill = "black"),
          strip.text = element_text(color = "white", face = "bold", size = 8)) +
    labs(x = "",
         y = "Total Distance",
         title = "Weekly Training Distance",
         subtitle = paste("Position", POS, sep = " = ")) +
    ylim(c(-3.5, 3.5))
  hsr_plt <- df %>%
    filter(pos == POS) %>%
    ggplot(aes(x = as.factor(week), y = HSR_z, group = 1)) +
    geom_hline(yintercept = 0) +
    geom_line(size = 1) +
    geom_area(fill = "light green", 
              alpha = 0.7) +
    facet_wrap(~athlete) +
    theme_bw() +
    theme(axis.text.x = element_text(size = 9, face = "bold"),
          axis.text.y = element_text(size = 9, face = "bold"),
          strip.background = element_rect(fill = "black"),
          strip.text = element_text(color = "white", face = "bold", size = 8)) +
    labs(x = "",
         y = "HSR",
         title = "Weekly HSR",
         subtitle = paste("Position", POS, sep = " = ")) +
    ylim(c(-3.5, 3.5))
  plots <- dist_plt | hsr_plt


Let’s try out the function on just one group. We will pass the POS argument the abbreviation “DB”, for the defensive backs group.


# try out the function

plt_function(POS = "DB")


It worked!!

Okay, now let’s create our multipage PDF report. To do this, all we need to do is run the above line of code for each of our position groups. To ensure that we get each position plot into the PDF, we begin the code chunk with the pdf() function. It is here that we will specify the width and height of the plot page within the PDF itself (NOTE: you many need to play around with this depending on what your plots look like). We can also name the PDF report. Here I just called it “Team.pdf”. Finally, after running the line of code for each position group plot, we run the function, which just shuts down the specified PDF device so that R knows that we are done making plots.


## create a multipage pdf with each page representing a position group

pdf(width = 12, height = 8, "Team.pdf")
plt_function(POS = "DB")
plt_function(POS = "LB")
plt_function(POS = "DL")
plt_function(POS = "OL")
plt_function(POS = "WR")


And that’s it! We end up with a 5 page PDF that has a different position group on each page.



If you want to see the finished product, click here: Team

The full code is on my github page. CLICK HERE