Category Archives: R Tips & Tricks

TidyX Episode 163: Advanced Shiny – Creating dynamic links to webpages using DT

In this episode, Ellis Hughes and I use the {DT} package within a {shiny} app and scrape data from hockey-reference.com. We show how you can build a shiny web-app that allows you to insert an html link into the data table so the user can link directly out to the original website if they want to obtain extra data from the source itself.

To access our code, CLICK HERE.

To watch our screen cast, CLICK HERE.

Simulations in R Part 1: Functions for Simulation & Resampling

Simulating data is something I find myself doing all the time. Not only to explore uncertainty in data but also to explore model assumptions, understand how models behave under different circumstances, or to try and understand how a future analysis might work given some underlying data generating process. Thus, I decided to put together a series on simulations and resampling using R (I’ll also add a few analog scripts using Python to the GitHub repository).

In Part 1, I’ll provide some thoughts around why you might want to simulate or resample data and then show how you can simply do this in R. Additionally, I’ll walk through several helper functions for conducting and summarizing simulations/resamples as well as some basics around for() and while() loops, as we will use these extensively in our simulation and resampling processes.

My Github repository will contain all of the scripts in this series.

Why do we simulate or resample data?

  • The data generating process is what defines the properties of our data and dictates the type of distribution we are dealing with. For example, the mean and standard deviation reflect the two parameters of the data generating process for a normal distribution. We rarely know what the data generating process of our data is in the real world, thus we must infer it from our sample data. Both resampling and simulation offer methods of understanding the data generating process of data.
  • Sample data represents a small sliver of what might be occurring in the broader population. Using resampling and simulation, we are able to build larger data sets based on information contained in the sample data. Such approaches allow us to explore our uncertainty around what we have observed in our sample and the inferences we might be able to make about that larger population.
  • Creating samples of data allows us to assess patterns in the data and evaluate those patterns under different circumstances, which we can directly program.
  • By coding a simulation, we are able to reflect a desired data generating process, allowing us to evaluate assumptions or limitations of data that we have collected or are going to collect.
  • The world is full of randomness, meaning that every observation we make comes with some level of uncertainty. The uncertainty that we have about the true value of our observation can be expressed via various probability distributions. Resamping and simulation are ways that we can mimic this randomness in the world and help calibrate our expectation about the probability of certain events or observations occurring.

Difference between resampling and simulation

Resampling and simulation are both useful at generating data sets and reflecting uncertainty. However, they accomplish this task in different ways.

  • Resampling deals with techniques that take the observed sample data and randomly draw observations from that data to construct a new data set. This is often done thousands of times, building thousands of new data sets, and then summary statistics are produced on those data sets as a means of understanding the data generating properties.
  • Simulation works by assuming a data generating process (e.g., making a best guess or estimating a plausible mean and standard deviation for the population from previous literature) and then generating multiple samples of data, randomly, from the data generating process features.

Sampling from common distributions

To create a distribution in R we can use any one of the four primary prefixes, which define the type of information we want returned about the distribution, followed by the suffix that defines the distribution we are interested in.

Here is a helpful cheat sheet I put together for some of the common distributions one might use:

Some examples:

# The probability that a random variable is less than or equal to 1.645 has a cumulative density of 95% (CDF)
pnorm(q = 1.645, mean = 0, sd = 1)

# What is the exact probability (PDF) that we flip 10 coins, with 50% chance of heads or tails, and get 1 heads?
dbinom(x = 1, size = 10, prob = 0.5)

# What is the z-score for the 95 percentile when the data is Normal(0, 1)?
qnorm(p = 0.95, mean = 0, sd = 1)

# randomly draw 10 values from a uniform distribution with a min of 5 and max of 10
runif(n = 10, min = 5, max = 10)

We can completely simulate different distributions and properties of those distributions using these function. For several examples of different distributions see the GitHub code. Below is an example of 1,000 random observations from a normal distribution with a mean of 30 and standard deviation of 15 and plot the results..

## set the seed for reproducibility
set.seed(10)
norm_dat <- rnorm(n = 1000, mean = 30, sd = 15)

hist(norm_dat,
     main = "Random Simulation from a Normal Distribution",
     xlab = "N(30, 15^2)")

We can produce a number of summary statistics on this vector of random values:

# sample size
length(norm_dat)

# mean, standard deviation, and variance
mean(norm_dat)
sd(norm_dat)
var(norm_dat)

# median, median absolute deviation
median(norm_dat)
mad(norm_dat)

for & while loops

Typically, we are going to want to resample data more than once or to run multiple simulations. Often, we will want to do this thousands of times. We can use R to help us in the endeavor by programming for() and while() loops to do the heavy lifting for us and store the results in a convenient format (e.g., vector, data frame, matrix, or list) so that we can summarize it later.

for loops

for() loops are easy ways to tell `R` that we want it to do some sort of task for a specified number of iterations.

For example, let’s create a for() loop that adds 5 for every value from 1 to 10, for(i in 1:10).

# program the loop to add 5 to every value from 1:10
for(i in 1:10){
  
  print(i + 5)
  
}

We notice that the result is printed directly to the console. If we are doing thousands of iterations or if we want to store the results to plot and summarize them later, this wont be a good option. Instead, we can allocate an empty vector or data frame to store these values.

## storing values as vector
n <- 10
vector_storage <- rep(NA, times = n)

for(i in 1:n){
  vector_storage[i] <- i + 5
}

vector_storage

## store results back to a data frame
n <- 10
df_storage <- data.frame(n = 1:10)

for(i in 1:n){
  df_storage$n2[i] <- i + 5
}

df_storage

while loops

while() loops differ from for() loops in that they continue to perform a process while some condition is met.

For example, if we start with a count of 0 observations and continually add 1 observation we want to perform this process as long as the observations are below 10.

observations <- 0

while(observations < 10){
	observations <- observations + 1
	print(observations)
} 

We can also use while() loops to test logical arguments.

For example, let’s say we have five coins in our pocket and want to play a game with a fried where we flip a fair coin and every time it ends on heads (coin_flip == 1) we get a coin and every time it ends on tails we lose a coin. We are only willing to continue playing the game as long as retain between 3 and 10 coins.

## starting number of coins
coins <- 5 

## while loop 
while(coins >= 3 && coins <= 10){
	
  # flip a fair coin (50/50 chance of heads or tails)
	coin_flip <- rbinom(1,1,0.5)
	
	# If the coin leads on heads (1) you win a coin and if it lands on tails (0) you lose a coin
	if(coin_flip == 1){
	  
		coins <- coins + 1
		
		}else{
			coins <- coins - 1
		}
	
	## NOTE: we only play while our winnings are between 3 and 10 coins
	
	# print the result
	print(coins)
}

You can run the code many times and find out, on average, how many flips you will get!

Finally, we can also use while() loops if we are building models to minimize error. For example, lets say we have an error = 30 and we want to continue running the code until we have minimized the error below 1. So, the code will run while(error > 1).

error <- 30 while(error > 1){
  
 error <- error / 2
 print(error)
}

Helper functions for summarizing distributions

There are a number of helper functions in base R that can assist us in summarizing data.

  • apply() will return your results in a vector
  • lapply() will return your results as a list
  • sapply() can return the results as a vector or a list (if you set the argument `simplify = FALSE`)
  • tapply() will return your results in a named vector based on whichever grouping variable you specify
## create fake data
set.seed(1234)
fake_dat <- data.frame(
  group = rep(c("a", "b", "c"), each = 5),
  x = rnorm(n = 15, mean = 10, sd = 2),
  y = rnorm(n = 15, mean = 30, sd = 10),
  z = rnorm(n = 15, mean = 75, sd = 20)
)

fake_dat

#### apply ####
# get the column averages
apply(X = fake_dat[,-1], MARGIN = 2, FUN = mean)

# get the row averages
apply(X = fake_dat[,-1], MARGIN = 1, FUN = mean)

#### lapply ####
# Get the 95% quantile interval for each column
lapply(X = fake_dat[,-1], FUN = quantile, probs = c(0.025, 0.975))

#### sapply ####
# Get the standard deviation of each column in a vector
sapply(X = fake_dat[,-1], FUN = sd)

# Get the standard deviation of each column in a list
sapply(X = fake_dat[,-1], FUN = sd, simplify = FALSE)

#### tapply ####
# Get the average of x for each group
tapply(X = fake_dat$x, INDEX = fake_dat$group, FUN = mean)

We can alternatively do a lot of this type of data summarizing using the convenient R package {tidyverse}.

library(tidyverse)

## get the mean of each numeric column
fake_dat %>%
  summarize(across(.cols = x:z,
                   .fns = ~mean(.x)))

## get the mean across each row for the numeric columns
fake_dat %>%
  rowwise() %>%
  mutate(AVG = mean(c_across(cols = x:z)))

## Get the mean of x for each grou
fake_dat %>%
  group_by(group) %>%
  summarize(avg_x = mean(x),
            .groups = "drop")

Finally, another handy base R function is replicate(), which allows us to replicate a task n number of times.

For example, let’s say we want to draw from a random normal distribution, rnorm() with a mean = 0 and sd = 1 but, we want to run this random simulation 10 times and get 10 different data sets. replicate()` allows us to do this and stores the results in a matrix with 10 columns, each with 10 rows of the random sample.

replicate(n = 10, expr = rnorm(n = 10, mean = 0, sd = 1))

Wrapping Up

In this first part of my simulation and resampling series we went through some of the key functions in R that will help us build the scaffolding for our future work. In Part 2, we we dive into bootstrap resampling and simulating bivariate and multivariate normal distributions.

All code is available in both rmarkdown and html format on my Github page.

R Tips & Tricks: Normalizing test dates & calculating test differences

A friend of mine was downloading some force plate data from the software provider so that he could evaluate test data in a few of his athletes during return to play. The issue he was running into was that the different athletes all had different numbers of tests and different start and end testing times. The software exports the test outputs by date and he was wondering how he could normalize the dates to numeric values (e.g. Test 1, Test 2, etc.) so that he could model the date (since we can’t really use a Date in a regression model).

I’ll be the first to admit that working with dates and times can be an incredible pain in the butt. For reference, I covered the topic of converting Catapult GPS practice duration strings to actual training minutes, HERE. To help him out, I provided a few different solutions depending on the research question. I also add some code for calculating changes in test performance between tests and from each test to baseline.

The full code is available on my GITHUB page.

Loading Packages & Simulating Data

## load packages ----------------------------------------------
library(tidyverse)
library(lubridate)

## Simulate data ----------------------------------------------
set.seed(78)
dat <- tibble(
  
  athlete = rep(c("Tom", "Bob", "Franklin"), times = c(5, 10, 3)),
  test_dates = c(
    seq(as.Date("2023-01-01"), as.Date("2023-01-5"), by = "days"),
    seq(as.Date("2023-02-15"), as.Date("2023-02-24"), by = "days"),
    as.Date(c("2023-01-19", "2023-01-30", "2023-02-26"))
  ),
  jump_height = round(rnorm(n = 18, mean = 28, sd = 2.5), 1)
  
)

dat


We can see that Tom has 5 tests, Bob has 10, and Franklin has only 3. Additionally, Tom and Bob tested every day, consecutively, while Franklin was less compliant and has larger time frames between his tests.

Create a test number

First, let’s normalize the Dates so that they are numeric. Basically, instead of dates we want a value indicating whether the test was test 1, or test 5, or test N. We will do this by creating a row_number() id/counter for each individual athlete.

### Create a test number ------------------------------------------
dat <- dat %>%
  group_by(athlete) %>%
  mutate(test_day = row_number())

dat

Calculating the time between tests

Alternatively, we may not just want to know the test number of each test but we may want to determine the amount of days between each test.

The code to do this is a bit ugly looking so let’s unpack it.

  1. Since we are dealing with dates we use the difftime() function which takes an argument for the two times you are looking to calculate the difference between. Here, we are trying to calculate the difference in time (days) between one date and the date preceding it for each individual athlete.
  2. The difftime() function will produce a to time variable. If we want to make this numeric we need to convert it to a character so we do that with the as.character() function.
  3. Once the variable is a character we use the as.numeric() function to convert it to a numeric value.
  4. Finally, since the first value for each athlete will be an NA, since there is no date preceding the first test, we use the coalesce() function to fill in a 0 value for each of the NA’s, to indicate that this was the first test and thus there was no time between it and any other test.
### Calculate the time between tests -------------------------------
dat <- dat %>%
  group_by(athlete) %>%
  mutate(time_btw_tests = coalesce(as.numeric(as.character(difftime(test_dates, lag(test_dates)))), 0))

dat

Notice that Tom and Bob have 1 day between all of their tests while Franklin’s second test was 11 days after his first and his third test was 27 days after his second.

Calculate the difference in jump height from one test to the next

### Calculate difference in jump height from one day to the next -------------------
dat <- dat %>%
  group_by(athlete) %>%
  mutate(test_to_test_diff = jump_height - lag(jump_height))

dat

Here, we use the lag() function to calculate the difference in one value from the value before it within in the same column. Since we grouped by athlete, which is what we want, their first test will always have an NA, in this new column, since there was no test preceding it.

Calculating the difference in jump height from the baseline test

Finally, we might also be interested to evaluate the performance on each test relative to the athlete’s baseline test. To do this we simply subtract jump_height from the jump_height indexed in row one for each athlete.

### Calculate difference in jump height from each test to the baseline test -------------

dat <- dat %>%
  group_by(athlete) %>%
  mutate(test_to_baseline_diff = jump_height - jump_height[1])

dat

Wrapping Up

Dates and times are always tricky to deal with. Most of the sports technology providers will proved data as dates (or unix timestamps) meaning that we have to do some cleaning of the data to codify the dates as numeric values representing the test number or the days between tests (depending on the research question). Additionally, using lag functions can be helpful for calculating he difference from one test to the next or from each test to the baseline.

The entire code is available on my GITHUB page.

If you have any data cleaning issues that you are dealing with from various sports science technologies, feel free to reach out!

Displaying Tables & Plots Together Part 2: Adding Titles & Captions

Yesterday’s post about creating single page reports with tables and plots in the same display window got a lot of follow up questions and comments (which is great!). In particular, Lyle Danley brought up a good point that adding titles to these reports, while important, can sometimes be tricky with these types of displays. So, I decided to do a quick follow up to show how to add titles and captions to your reports (in case you want to point out some key things to your colleagues or the practitioners you are working with).

I’m going to use the exact same code as yesterday, so check out that article to see the general approach to building these reports. As always, all of the code for yesterday’s article and today’s are on my GITHUB page.

Review

Recall that yesterday we  constructed the below plot using both ggarrange() and the {patchwork} package.

I’m going to use both approaches and add a title and a bullet point caption box in the bottom right.

Titles & Captions with ggarrange()

I wont rehash all of the code from yesterday, but the ggarrange() table that we created with was constructed with the following code.

## Build table into a nice ggtextable() to visualize it
tbl <- ggtexttable(fit, rows = NULL, theme = ttheme("blank")) %>%
  tab_add_hline(at.row = 1:2, row.side = "top", linewidth = 2) %>%
  tab_add_hline(at.row = 4, row.side = "bottom", linewidth = 3, linetype = 1)

To create a bullet point caption box I need to first create a string of text that I want to display, using the paste() function. I then wrap this text into the ggparagraph() function so that it can be appropriately displayed on the plot. Then, similar to yesterday, I use the ggarrange() function to put the two plots, the table, and the caption box, into a single canvas.

## Create text for a caption
text <- paste("* It appears that gender and flipper length are important for estimating bill length.",
              " ",
              "* Males have a bill length that is 2.073 mm greater than females on average.",
              " ",
              "* Penguins on different islands should be tested to determine how well this model will perform out of sample.",
              sep = "\n")

text.p <- ggparagraph(text = text, 
                      #face = "italic", 
                      size = 12,
                      color = "black") + 
  theme(plot.margin = unit(c(t = 1, b = -3, r = 1, l = 2),"cm"))


## Plots & Table together with the caption using ggarange()
final_display <- ggarrange(plt1, plt2, tbl, text.p,
          ncol = 2, nrow = 2)

I saved the canvas as final_display which I can now wrap in the annotate_figure() function to add the common title to the report.

## add a common title
annotate_figure(final_display, top = text_grob("Investigation of Penguin Bill Lengths", 
                                      color = "blue", face = "bold", size = 18))

The finished product looks like this:

Titles & Captions with patchwork

Now, we will do the same thing with {patchwork}. Just like yesterday, to use {patchwork} we need to change the table from a ggtextable to a tableGrob. After that we can wrap it together with our two plots.

# Need to build the table as a tableGrob() instead of ggtextable
# to make it work with patch work
tbl2 <- tableGrob(fit, rows = NULL, theme = ttheme("blank")) %>%
  tab_add_hline(at.row = 1:2, row.side = "top", linewidth = 2) %>%
  tab_add_hline(at.row = 4, row.side = "bottom", linewidth = 3, linetype = 1)

# now visualize together
final_display2 <- wrap_plots(plt1, plt2, tbl2, 
           ncol = 2,
           nrow = 2)

We stored the final canvas in the element final_display2. We can add a title, subtitle, caption, and bullet point box to this using patchwork’s plot_annotation() function by simply specifying the text that we would like.

final_display2 + plot_annotation(
  title = "Investigation of Penguin Bill Lengths",
  subtitle = "Careful, sometimes the Penguins bite!!",
  caption = "data courtesty of {palmerpenguins} R package") + 
  grid::textGrob(hjust = 0, x = 0,
                   "* It appears that gender and flipper length are important for estimating bill length.\n* Males have a bill length that is 2.073 mm greater than females on average.\n* Penguins on different islands should be tested to determine\nhow well this model will perform out of sample.")

Ads here is our final report:

Wrapping up

There are two simple ways using two different R packages to create single page reports with plots, data tables, and even bullet point notes for the reader. Happy report constructing!

For the complete code to the blog article check out my GITHUB page.

Displaying Tables & Plots Together

A common question that I get asked is for a simple way of displaying tables and plots together in the same one-page report. Most in the sport science space that are new to R will copy and paste their plot and table outputs into a word document and then share that with their colleagues. But, this creates extra work — copying, pasting, making sure you don’t mess up and forget to paste the latest plot, etc. So, today’s blog article will walk through a really easy way to create a single page document for combining tables and plots into a single report, which you can save to PDF or jpeg directly from RStudio. This same approach is also useful for researchers looking to combine tables and plots into a single figure for publication. I’ll show how to do this using both ggarange() and {patchwork}.

As always, the full code is available on my GITHUB page.

Load Libraries and Set a Plotting Theme

### Load libraries
library(tidyverse)
library(ggpubr)
library(gridExtra)
library(patchwork)
library(broom)
library(palmerpenguins)

## set plot theme
theme_set(theme_classic() +
            theme(axis.text = element_text(size = 11, face = "bold"),
                  axis.title = element_text(size = 13, face = "bold"),
                  plot.title = element_text(size = 15),
                  legend.position = "top"))

 

Load Data

We will use the {palmerpenguins} data that is freely available in R.

## load data
data("penguins")
d <- penguins %>%
  na.omit()

 

Build the Plots & Table

First we will build our plots. We are going to create two plots and one table. The table will store the information from a linear regression which regresses bill length on flipper length and penguin sex. The plots will be our visualization of these relationships.

## Create Plots
plt1 <- d %>%
  ggplot(aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point(aes(fill = sex),
             size = 4,
             shape = 21,
             color = "black",
             alpha = 0.5) +
  geom_smooth(method = "lm",
              aes(color = sex)) +
  scale_fill_manual(values = c("female" = "green", "male" = "blue")) +
  scale_color_manual(values = c("female" = "green", "male" = "blue")) +
  labs(x = "Flipper Length (mm)",
       y = "Bill Length (mm)",
       title = "Bill Length ~ Flipper Length")


plt2 <- d %>%
  ggplot(aes(x = sex, y = bill_length_mm)) +
  geom_violin(alpha = 0.5,
              aes(fill = sex)) +
  geom_boxplot(width = 0.2) +
  geom_jitter(alpha = 0.5) +
  labs(x = "Sex",
       y = "Bill Length (mm)",
       title = "Bill Length Conditional on Penguin Gender")


## Create table
fit <- d %>%
  lm(bill_length_mm ~ flipper_length_mm + sex, data = .) %>%
  tidy() %>%
  mutate(across(.cols = estimate:statistic,
                ~round(.x, 3)),
         term = case_when(term == "(Intercept)" ~ "Intercept",
                          term == "flipper_length_mm" ~ "Flipper Length (mm)",
                          term == "sexmale" ~ "Male"))

Convert the table into a ggtextable format

Right now the table is in a tibble/data frame format. To get it into a format that is usable within the display grid we will convert it to a ggtextable and use some styling to make it look pretty.

## Build table into a nice ggtextable() to visualize it
tbl <- ggtexttable(fit, rows = NULL, theme = ttheme("blank")) %>%
  tab_add_hline(at.row = 1:2, row.side = "top", linewidth = 2) %>%
  tab_add_hline(at.row = 4, row.side = "bottom", linewidth = 3, linetype = 1)

Display the Table and Plots using ggarrange

We simply add our plot and table elements to the ggarrange() function and get a nice looking report.

## Plots & Table together using ggarange()
ggarrange(plt1, plt2, tbl,
          ncol = 2, nrow = 2)

Display the Table and Plots using patchwork

We can accomplish the same goal using the {patchwork} package. The only trick here is that we can’t pass a ggarrange element into patchwork. We need to convert the table into a tableGrob() to make this work. A tableGrob() is simple a way for us to capture all of the information that is required for the table structure we’d like. Also note that we can pass the same tableGrob() into ggarrange above and it will work.

## Plots & Table together using patchwork
# Need to build the table as a tableGrob() instead of ggtextable
# to make it work with patch work
tbl2 <- tableGrob(fit, rows = NULL, theme = ttheme("blank")) %>%
  tab_add_hline(at.row = 1:2, row.side = "top", linewidth = 2) %>%
  tab_add_hline(at.row = 4, row.side = "bottom", linewidth = 3, linetype = 1)

Now we wrap the tableGrob and our plots into the wrap_plots() function and we are all set!

# now visualize together
wrap_plots(plt1, plt2, tbl2, 
           ncol = 2,
           nrow = 2)

Wrapping Up

Instead of copying and pasting tables and plots into word, try using these two simple approaches to creating a single report page that stores all of the necessary information that you colleagues need to see!

All of the code is available on my GITHUB page.