Category Archives: R Tips & Tricks

R Tips & Tricks: Creating a Multipage PDF with {ggplot2}

I was recently asked by a colleague for a simple solution to produce a multipage PDF for a training load report. The colleague wanted the report to generate a new page for each position group.

There are a few ways to solve this problem with loops, but below is literally the easiest approach you can take.

First, we will load in two packages that we need, {tidyverse}, for data manipulation and visualization, and {patchwork}, for organizing multiple plots on a page. Additionally, I’ll create a z-score function so that we can standardize the training load variables for each individual (In my opinion, it makes these types of charts look nicer when the data is on the same scale since athletes within the same position group can sometimes have very different training responses).

## load packages and custom z-score function

z_score <- function(x){
  z = (x - mean(x, na.rm = T)) / sd(x, na.rm = T)


Next, we will just simulate some fake training data.


## simulate data
athlete <- rep(LETTERS[1:10], each = 10)
pos <- rep(c("DB", "LB", "DL", "OL", "WR"), each = 20)
week <- rep(1:10, times = 10)
Total_Dist <- round(rnorm(n = length(athlete), mean = 3200, sd = 400), 0) 
HSR <- round(rnorm(n = length(athlete), mean = 450, sd = 100), 0)

df <- data.frame(athlete, pos, week, Total_Dist, HSR) df %>% head()


Let’s go ahead and apply our z-score function to our two training variables, Total Distance (Total_Dist) and High Speed Running (HSR). Notice that I group by “athlete” to ensure that the mean and standard deviation used to normalize each variable is specific to the individual and not the entire population.


df <- df %>%
  group_by(athlete) %>%
  mutate(TD_z = z_score(Total_Dist),
         HSR_z = z_score(HSR))


Now we need to make a function that will create the plots we want. The code below can look a little intimidating, so here are a few points to help you wrap your head around it:

  • It is literally just two {ggplot2} plots. All I did was store each one in their own object (so that we could pair them together with {patchwork} and wrap them inside of this function).
  • The easiest way to get used to doing this is to write your {ggplot2} plots out as you normally would (as if you were creating them for a single position group). When you have the plot built to your specifications then just wrap it into a function. The argument for the function should take the value that you want to iterate over. In this case, we want to create plots for each position group, so I call the argument “POS”, short for position. When I run that function I provide the “POS” argument with the abbreviation for the position group I am interested in and the function will do the rest. The function works in this manner because you’ll notice that the second line of each of the plots is a filter that is specifically pulling out the position group of interest from the original data set.
  • The final line of the function creates an element called “plots”. You’ll see that the element consists of the two plots that we created above it and they are separated by a “|”. This vertical bar is just telling the {patchwork} package to place one plot right next to the other.
### Build a function for the plots to loop over position group

plt_function <- function(POS){
  dist_plt <- df %>%
    filter(pos == POS) %>%
    ggplot(aes(x = as.factor(week), y = TD_z, group = 1)) +
    geom_hline(yintercept = 0) +
    geom_line(size = 1) +
    geom_area(fill = "light green", 
              alpha = 0.7) +
    facet_wrap(~athlete) +
    theme_bw() +
    theme(axis.text.x = element_text(size = 9, face = "bold"),
          axis.text.y = element_text(size = 9, face = "bold"),
          strip.background = element_rect(fill = "black"),
          strip.text = element_text(color = "white", face = "bold", size = 8)) +
    labs(x = "",
         y = "Total Distance",
         title = "Weekly Training Distance",
         subtitle = paste("Position", POS, sep = " = ")) +
    ylim(c(-3.5, 3.5))
  hsr_plt <- df %>%
    filter(pos == POS) %>%
    ggplot(aes(x = as.factor(week), y = HSR_z, group = 1)) +
    geom_hline(yintercept = 0) +
    geom_line(size = 1) +
    geom_area(fill = "light green", 
              alpha = 0.7) +
    facet_wrap(~athlete) +
    theme_bw() +
    theme(axis.text.x = element_text(size = 9, face = "bold"),
          axis.text.y = element_text(size = 9, face = "bold"),
          strip.background = element_rect(fill = "black"),
          strip.text = element_text(color = "white", face = "bold", size = 8)) +
    labs(x = "",
         y = "HSR",
         title = "Weekly HSR",
         subtitle = paste("Position", POS, sep = " = ")) +
    ylim(c(-3.5, 3.5))
  plots <- dist_plt | hsr_plt


Let’s try out the function on just one group. We will pass the POS argument the abbreviation “DB”, for the defensive backs group.


# try out the function

plt_function(POS = "DB")


It worked!!

Okay, now let’s create our multipage PDF report. To do this, all we need to do is run the above line of code for each of our position groups. To ensure that we get each position plot into the PDF, we begin the code chunk with the pdf() function. It is here that we will specify the width and height of the plot page within the PDF itself (NOTE: you many need to play around with this depending on what your plots look like). We can also name the PDF report. Here I just called it “Team.pdf”. Finally, after running the line of code for each position group plot, we run the function, which just shuts down the specified PDF device so that R knows that we are done making plots.


## create a multipage pdf with each page representing a position group

pdf(width = 12, height = 8, "Team.pdf")
plt_function(POS = "DB")
plt_function(POS = "LB")
plt_function(POS = "DL")
plt_function(POS = "OL")
plt_function(POS = "WR")


And that’s it! We end up with a 5 page PDF that has a different position group on each page.



If you want to see the finished product, click here: Team

The full code is on my github page. CLICK HERE

R Tips & Tricks: Building a Shiny Training Load Dashboard

In TidyX Episode 19 we discussed a way of building a dashboard with the {formattable} package. The dashboard included both a data table and a small visual using {sparkline}. Such a table is great when you need to make a report for a presentation but there are times when you might want to have something that is more interactive and flowing. In the sports science setting, this often comes in the form of evaluating athlete training loads. As such, I decided to spin up a quick {shiny} web app to show how easy it is to create whatever you want without sinking a ton of money into an athlete management system.

The code is available on my GITHUB page.

Packages & Custom Functions

Before doing anything, always start by loading the packages you will need for your work. In this tutorial, we will using the {tidyverse} and {shiny} packages, so be sure to install them if you haven’t already. I also like to set my plot theme to classic so that I get rid of grid lines for the {ggplot2} figures that I create.

Finally, I also wrote a custom function for calculating a z-score. This will come in handy when we go to visualize our data.

# custom function for calculating z-score
z_score <- function(x){
  z = (x - mean(x, na.rm = T)) / sd(x, na.rm = T)



Next we simulate a bunch of fake data for a basketball team so that we have something to build our {shiny} app against. We will simulate total weekly training loads for 10 athletes across three different positions (fwd, guard, center), for a 3 week pre-season and a 15 week in-season.


athlete <- rep(LETTERS[1:10], each = 15)
position <- rep(c("fwd", "fwd", "fwd", "fwd", "guard", "guard", "guard", "center", "center", "center"), each = 15)
week <- rep(c("pre_1", "pre_2", "pre_3", 1:12), times = 10)
training_load <- round(rnorm(n = length(athlete), mean = 1500, sd = 350), 1)

df <- data.frame(athlete, position, week, training_load)
df$flag <- factor(df$flag, levels = c("high", "normal", "low"))
df$week <- factor(df$week, levels = c("pre_1", "pre_2", "pre_3", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12")) 

df %>% head()

The first few rows of the data look like this:

Shiny App

There are two components to building a {shiny} app:

1) The user interface, which defines the input that the user specifies on the web app.
2) The server, which reacts to what the user does and then produces the output that the user sees.

The user interface I will define as tl_ui (training load user interface). I start by creating a fluid page (since the user is going to be defining what they want). I set the parameters of what the user can control. Here, I provide them with the ability to select the position group of interest and then the week(s) of training they would like to see in their table and plot.

The server, which I call tl_server (training load server), is a little more involved as we need to have it take an input from the user interface , get the correct data from our stored data frame, and then produce an output. This part can be a bit involved. Below you can see the code and I’ll try and articulate a few of the main things that are taking place.

1) First, I get the data for the visualization the user wants. I’m using our simulated data and I calculate the z-score for each individual athlete, using the custom function I wrote. After that I create my flag (±1 SD from the individual athlete’s mean) and then I finally filter the data specific to the inputs provided in the user interface. This last part is critical. If I filter before applying my z-score and flags, then the mean and SD used to calculate the z-score will only be relative to the data that has been filtered down. This probably isn’t what we want. Rather, we will want to calculate our z-score using all data, season-to-date. So we retain everything and perform our calculation and then we filter the data.

2) After getting the data, I build my plot using {ggplot2}.

3) Then I create another data set specific to the table weeks of training selected by the user. This time, we will retain the raw values (non-standardized) for each athlete as the user may want to see raw values in a table. Notice that I also pivot that data using the pivot_wider() function as the data is currently stored in a long format with each row representing a new week for the athlete. Instead, I’d like the user to see this data across their screen, with each week representing a new column. As the user selects new weeks of data to visualize our server will tell {shiny} to populate the next column for that athlete.

4) Once I have the data we need, I simply render the table.

Those 2 steps complete the building of our {shiny} app!

Finally, the last step is to run the {shiny} app using the shinyApp() function which takes two arguments, the name of your user interface (tl_ui) and the name of your server (tl_server). This will open up a shiny app in a new window within R Studio that you can then open in your browser.

Visually, you can see what the app looks like below. If you want to see it in action, aside from getting my code and running it yourself, I put together this short video >> R Tips & Tricks – Training Load Dashboard


As you can tell, it is pretty easy to build and customize your own {shiny} app. I’ll admit, the syntax for {shiny} can get a bit frustrating but once you get the hang of it, like most things, it isn’t that bad. Also, unless you are doing really complex tasks, most of the syntax for something as simple as a training load dashboard wont be too challenging. Finally, such a quick and easy method not only allows you to customize things exactly as you want them, but it also saves you money from having to purchase an expensive athlete management system.

As always, you can obtain the code for this tutorial on my GITHUB page.

R Tips & Tricks: Force-Velocity-Power Profile Graphs in R Shiny

A colleague of mine was asking about how he could produce plots of force-velocity-power profiles for his coaches based on their Gymaware testing. Rather than plotting static plots for this stuff it is sometimes easier to build out a nice shiny app so that the coaches can interact with it or the practitioner can quickly change between players or position groups when giving a presentation to the staff.

All code and data is available at my GITHUB page (R Script and Data).

This tutorial will cover:

1) Building polynomial regression to represent the average team trend
2) Iterative approaches to building static plots
3) Iterative approaches to building Shiny web apps

This tutorial has a number of different iterations of plots and web apps so working through the R-code on your own is advised so you can see how every step is performed.


After loading the two required packages, {tidyverse} and {shiny}, we load in the data set and we see that it consists of Force, Power, and Velocity values across 5 different external loads for 3 different athletes:

If you want to use the R-script to produce reports for yourself going forward just ensure that your data has the above columns (named the same way, since R is case-sensitive). If you have data with different column names, your two choices are: (1) change my code to match the column names of your data; or, (2) once you pull the data into R change the columns names in your code to match mine.

Average Trend Line (2nd Order Polynomial)

Eventually, we are going to want to compare our athletes to the team average (or position group average if your sport is more heterogeneous). This type of data is often modeled as a. 2nd order polynomial regression. Thus, we will build this type of regression to predict both Velocity and Power from Force. Once I have these two regressions built I can create a data frame that consists of a sequence of Force values (from the minimum to the maximum observed in the team’s data) and predict the velocity and power at each unit of force.

fit_velo <- lm(Velocity ~ poly(Force, 2), data = fv_profile)
fit_power <- lm(Power ~ poly(Force, 2), data = fv_profile)

avg_tbl <- data.frame(Force = seq(from = min(fv_profile$Force),
                                  to = max(fv_profile$Force),
                                  by = 0.5))

avg_tbl$Velocity_Avg <- predict(fit_velo, newdata = avg_tbl)
avg_tbl$Power_Avg <- predict(fit_power, newdata = avg_tbl)
colnames(avg_tbl)[1] <- "Force_Grp"


Static Plots

I’ll walk through a few iterations of static plots. See the GITHUB code to walk through how these were produced.

1) All athletes with an average team trend line

This plot gives us a sense for the trend in Velocity as it relates to increasing amounts of force. Below you will see one with a solid trend line and one with a dashed trend line. Feel free to use which ever one you prefer.

2) Team trend for Velocity and Power

The common way this type of data is visualized in the sports science literature it is a bit tricky in R because it requires a dual y-axis. To obtain this, within my ggplot call I shrink Power down to a scale more similar to Velocity (the main y-axis) by dividing it by 1000. Then when I call the second y-axis with the sec_axis() function I multiply Power by 1000 to put it back on its normal scale.

3) Accounting for individuals

The above plots are a look at the entire team (in this case only 3 athletes). However, we may want to look at the individuals more explicitly. As such, we build four plots to show individual differences:

1) All athletes all on the same plot with their corresponding individualized trend lines (NOTE: if you have a lot of athletes, this plot can get pretty busy and ultimately become useless).
2) Plot each individual as a facet to look at the athletes separately.
3) Create the same plot as #2 but add in the group average trend line (which we created using our 2nd order polynomial regression) to allow us to compare each athlete to the groupx.
4) Plot each individual as a facet with velocity and power on separate y-axes.

Shiny App Development

The still figures above are nice but making things interactive is much more useful. I have four {shiny} web iterations that we can go through. Again, using the R code while reading this will help you understand what is going on under the hood. Additionally, you’ll want to run these in R so that R can open up the webpage on your computer and allow you to play with the app. Below is just still shots of each app iteration.

1) Version 1: Independent Player Plots

This version allows you to select one player at a time.

2) Version 2: Add Players to Facets

This version lets you select however many players you want and it will add them in as facets. This is a useful app if you are presenting to the staff and want to select or de-select several players to show how they compare to each other.

3) Version 3: Same as Version 2 but add Power to the Plot on a second y-axis

4) Version 4: Combine all plot details

Version 4 combines everything from this tutorial in one single web app. You can select the  force-velocity-power profile for individual athletes (added to the plot as facets) and see the team average trend line (added to the plot as a dashed line) for velocity and power, to allow you to make comparisons between each player and to the group average.


{tidyverse} makes it incredibly easy to manipulate data and quickly iterate plots to our liking while {shiny} offers an easy way for us to turn our plots into an interactive webpage.

Again, for the code and data see my GITHUB page (R Script and Data).


R Tips & Tricks: Pearson Correlation from First Principles

In our most recent TidyX screen cast we talked a bit about correlation, which compelled me to expand on Pearson’s correlation to show how you can do things from first principles in R, by writing your own custom functions. This tutorial covers:

  • A brief explanation of Pearson’s correlation
  • R functions available to obtain Pearson’s correlation coefficient
  • How to write your own custom functions to understand the underlying principles of Pearson’s correlation coefficient and it’s confidence intervals

Pearson’s Correlation

Read almost any scientific journal article and you are sure to encounter Pearson’s correlation coefficient (denoted as r), as it is a commonly used metric for describing relationships within ones data. Take any basic stats course and you are bound to hear the phrase, “Correlation does not equal causation”, due to the fact that just because things are correlated does not mean that one definitely caused the other (such a relationship would need to be teased out of more specifically and the natural world is full of all sorts of random correlations).

Pearson’s correlation is a descriptive statistic used to quantify the linear relationship between two variables. It does so by measuring how much two variables covary with each other. The correlation coefficient is scaled between -1 and 1 where 1 means that the two variables have a perfect positive relationship (as one variable increases the other variable also always increases) while -1 means that the two variables have a perfect negative relationship (as one variable decreases the other variable also always decreases). A correlation coefficient of 0 suggests that the two variables do not share a linear relationship. It is rare that we see completely perfect correlations (either 1 or -1) and often our data will present with some scatter suggesting a specific trend (positive or negative) but with some amount of variability.

Data Preparation

R offers a few convenient functions for obtaining the correlation coefficient between two variables. For this tutorial, we will work with the Lahman baseball data set, which is freely available in R once you have installed the {Lahman} package. We will also use the {tidyverse} package for data manipulation and visualization.

We will use the Batting data set, which contains historic batting statistics for players, and the Master data set, which contains meta data on the players (e.g., birth year, death year, hometown, Major League debut, etc).

I do a little bit of data cleaning to obtain only the batting statistics from the 2016 season for those that had at least 200 at bats and I join that data with the player’s birth year from the Master data set. The two hitting variables I’ll use for this tutorial are Hits and RBI. One final thing I do is create a quantile bin for the player’s age so that I can look at correlation across age groups later in the tutorial.

Here is what the data looks like so far:

Data Visualization

Once we have pre-processed the data we can visualize it to see how things look. We will plot two visuals: (1) the number of players per age group; and, (2), a plot of the linear relationship between Hits and RBI.

Correlation in R

R offers a few convenient functions for obtaining the correlation coefficient between two variables. The two we will use are cor() and cor.test(). The only arguments you need to pass to these two functions is the X and Y variables you are interested in. The first function will produce a single correlation coefficient while the latter function will produce the correlation coefficient along with confidence intervals and information about the hypothesis test. Here is what their respective outputs looks like for the correlation between Hits and RBI:

Pretty easy! We can see that the correlation coefficient is the same between both functions, as it should be (r = 0.82). In the bottom output we can see that the 95% Confidence Intervals range from r = 0.78 to r = 0.85. The correlation is high between these two variables but it is not perfect, which we can better appreciate from the scatter plot above showing variability around the regression line.

Correlation from First Principles

One of the best ways to understand what is going on behind the custom functions in your stats program (no matter if it is R, Python, SPSS, SAS, or even Excel) is to try and build your own functions by hand. Doing so gives you an appreciation for some of the inner workings of these statistics and will help you better understand more complex statistics latter on.

I’ll build two functions:

  1.  A function to calculate the Pearson’s correlation coefficient
  2. A function to calculate the confidence intervals around the correlation coefficient

Pearson’s Correlation Coefficient Function

  • Similar to the built in R functions, this function will take inputs of an X and Y variable.
  • You can see the math for calculating the correlation coefficient in the function below. We start by subtracting the mean for each column from each observation. We then multiply the differences for each row and then produce a column of squared differences for each variable. Those values provide the inputs for the correlation coefficient in the second to last line of the function.
cor_function <- function(x, y){
  dat <- data.frame(x, y)
  dat <- dat %>%
    mutate(diff_x = x - mean(x),
           diff_y = y - mean(y),
           xy = diff_x * diff_y,
           x2 = diff_x^2,
           y2 = diff_y^2)
  r <- sum(dat$xy) / sqrt(sum(dat$x2) * sum(dat$y2))

Confidence Interval for Pearson’s R

  • This function takes three inputs: (1) the correlation coefficient between two variables (calculated above); (2) the sample size (the number of observations between X and Y); and, (3) The Confidence Level of Interest.
  • NOTE: To input a confidence level of interest I only set it up for three options, 0.9, 0.95, or 0.99, for the 90%, 95% and 99% Confidence Interval respectively. I could set it up to take any level of confidence and calculate the appropriate critical value but the function (as you can see) was already getting long and messy so I decided to cut it off and keep it simple for illustration purposes here.
  • This function is a bit more involved than the previous one and that’s because we require some transformations of the data. We have to transform the correlation coefficient using Fisher’s Z-Transformation in order to create a normal distribution. From there we can calculate our confidence intervals and then back transform the values so that they are on the scale of r.
cor.CI <- function(r, N, CI){
  fisher.Z <- .5*log((1+r)/(1-r))
  se.Z <- sqrt(1/(N-3))
  if(CI == .9){
    MOE <- 1.65*se.Z}
  else {
    if(CI == .95){
      MOE <- 1.95*se.Z}
    else {
      if(CI ==.99){
        MOE <- 2.58*se.Z}
  Lower.Z <- fisher.Z - MOE
  Upper.Z <- fisher.Z + MOE
  Lower.cor.CI <- (exp(2*Lower.Z)-1)/(exp(2*Lower.Z)+1)
  Upper.cor.CI <- (exp(2*Upper.Z)-1)/(exp(2*Upper.Z)+1)
  Correlation.Coefficient.CI <- data.frame(r, Lower.cor.CI, Upper.cor.CI)
  Correlation.Coefficient.CI <- round(Correlation.Coefficient.CI, 3)

Seeing our Functions in Action

We’ve obtained similar results to what was produced from the custom R functions!

We can also look at correlation across age bins. To do this, we will use {tidyverse} so that we can group_by() the age bins that we predefined.

First with the built in R functions

  • Using the built in cor.test() function we need to call the confident interval specifically from the output and specifying [1] and [2] tells R that we want the first value (lower confidence interval) and the second value (higher confidence interval), respectively.
yr2016 %>%
  group_by(AgeBin) %>%
  summarize(COR = cor(H, RBI),
            COR_Low_CI = cor.test(H, RBI)$[1],
            COR_High_CI = cor.test(H, RBI)$[2])

Now with our custom R functions

  • Similar to above, we need to extract the confidence interval out of the confidence interval function’s output. In this case, if you recall, there were three outputs (correlation coefficient, Low CI, and High CI). As such, we want to indicate [2] and [3] for the lower and upper confidence interval, respectively, since that is where they are located in the model output.
yr2016 %>%
  group_by(AgeBin) %>%
  summarize(COR = cor_function(H, RBI),
            cor.CI(r = COR,
                   N = n(),
                   CI = 0.95)[2],
            cor.CI(r = COR,
                   N = n(),
                   CI = 0.95)[3])



Hopefully this tutorial was useful in helping you to understand Pearson’s correlation and how easy it is to write functions in R that allow us to explore our data from first principles. All of the code for this tutorial is available on my GitHub page.


R Tips & Tricks: Summarizing & Visualizing Data

Summarizing and visualizing your data is a critical first step of analysis. For new PhD students and those new to R, I’ve put together a few of the common approaches one might take when dealing with data. This tutorial covers:

  • Manipulating/reshaping a data frame (from wide format to long format).
  • Basic functions for obtaining summary statistics in your data .
  • Visualization strategies such as boxplots, histograms, violin plots, joint plots, and plots showing inter-individual differences.

Instead of putting all of the code into this blog post I’ll just highlight some of the key pieces along the way. If you want all of the code, simply head over to my GitHub page.


In this tutorial, we will simulate data where twenty participants are randomized  to either a traditional (IE, linear) or block periodization program for 16-weeks. The participants had their squat tested pre and post training program intervention and the difference between the two tests is our outcome of interest. (NOTE: I drew random values for pre- and post-squat for both groups in this example, so it isn’t completely life like, where the post-test would normally be related in some way to the pre-test. In future simulations that discuss analysis I will use more realistic outcomes).

Here is a snip of what the simulation code and first few rows of the data look like:


This analysis will use three R packages (make sure you have installed them prior to running the code):

  1. {tidyverse} – Used for data manipulation and visualization
  2. {gridExtra} – Used for organizing the plot grid when creating our joint plot
  3. {psych} – Used to produce simple summary statistics

Manipulating Data

Occasionaly, you may need to manipulate your data from a wide format to a long format, or vice versa, as some types of analysis or visualizations are made easier when the data is in a specific format.

In our case, we have the data in a wide format. In this format, we have one row per participant and the 2 columns of interest are the pre- and post-squat columns (see above example of the first 6 rows). This can be accomplished with the pivot_longer() function in {tidyverse}. This function takes four primary arguments:

  1. The data frame where our wide format data is.
  2. The columns of interest from our wide data frame that we want to pivot into a single column. In this case, “pre_squat” and “post_squat” columns are the ones we want to pivot.
  3. the names_to argument is where we specify the name of the column where the pre_squat and post_squat columns will be pivoted into.
  4. The values_to column is where we specify the name of the column that we want the values of our participants pre_squat and post_squat values to reside under in the long data frame.

The code looks like this:

dat_long <- pivot_longer(data = dat, 
             cols = c("pre_squat", "post_squat"),
             names_to = "test",
             values_to = "performance")

The first few rows of the long format data frame look like this:

Notice that now each participant has two rows of data, one for their pre-squat and one for their post-squat.

If we have stored our data in a long format and we need to go back to a wide format, we can simply use pivot_wider() from {tidyverse}. This function takes three primary arguments:

  1. The data frame where our long format data is.
  2. The names_from argument, which specifies the column in the long data frame that has the variables we’d like to pivot out into their own columns in a wide format (in this case, the “test” column contains the information about whether the test performed was pre or post, so we want to pivot that out to two new columns).
  3. The values_from argument specifies which values we want to fill under the corresponding columns we are pivoting. In this case, the ‘performance’ column has the data specific to the pre- and post-squat variables that we are pivoting to their own columns.

The code looks like this:

dat_wide <- pivot_wider(data = dat_long,
                        names_from = test,
                        values_from = performance)

The first few rows of the new wide data frame looks like this:

Notice that the data looks exactly like our initial data frame (as you would expect given that original data was already a wide data frame) with one row per participant.

Producing Summary Statistics

Now that we’ve walked through some simple approaches to manipulating data frames, I’ll detail a few easy ways of summarizing your data.

The {psych} package has two very simple functions, describe() and describeBy(), which come in handy when you need summary statistics, info about the range of the data, the standard error of the data, and some metrics about the distribution of your data.

describe() works with all of the data in a column. All you need to do is specify the data frame and the column of interest (separate those two by a $ sign). In this case, we will use the long data frame and produce summary statistics for the performance of pre-and post-squat performance across all subjects.


functions similar to describe(); however, it allows for a second argument identifying the group that you’d like to produce summary statistics for. In this case, we will produce summary statistics for all pre- and post-squat performance data described by group (traditional or block periodization).

describeBy(dat_long$performance, dat_long$group)

What if we want to look at summary statistics by group but further group by pre- and post-squat performance? In this instance, we can code our own summary statistics using {tidyverse}. {tidyverse} is one of the best packages for data manipulation, data clean up, and data visualization as it contains a host of other packages that contain functions for these tasks.

In the below code I start with the data frame that has the data I want to summarize and I ‘pipe’ together different functions using the %>% command. This allows me to iterate very quickly on a data set and intuitively build analysis as I go. In this example I first call the mutate() function (used to add a new column to my already existing data set) and inside of it I re-level my factors (since R automatically stores them in alphabetical order) as I want my results returned with the pre-squat performance first followed by the post-squat performance. After that step, I indicate that I want to group_by() both my periodization groups (traditional and block) and my test (pre and post). Finally, I use the summarize() function to create a summary data frame where I am specifying N = n(), in order to get the sample size counted in each group of my group_by(), as well as the mean and the standard deviation for each of the group_by() groups.

dat_long %>%
  mutate(test = fct_relevel(test, levels = c("pre_squat", "post_squat"))) %>%
  group_by(group, test) %>%
  summarize(N = n(),
            Mean = mean(performance),
            SD = sd(performance))


NOTE: In my R script on GItHub I also create a “difference” column (post – pre) and perform the same descriptive operations on that column as above. We will use this column below to produce quantiles as well as for our data visualization purposes but feel free to work through the examples in the R code for using the above functions.

Producing quantiles is easy with the quantile() function from base R. Similar to describe() above, pass the function the data and the column of interest, separated by a $ sign, to get your result. Here we will get the quantiles of the differences between post and pre-squat strength for all participants.


If we’d like to see these differences by group, we can use the by() function, which allows us to pass a function (in this case quantile()) to an entire data frame. As such, we will get the quantiles for the difference in performance by group.

by(dat$Diff, dat$group, quantile)

Note: In the R script on my GitHub page I walk through how to create a sequence of numbers (form 0 to 1) and specify a broader range of quantiles to be returned, should you need them.

Data Visualization

Below are a few ways that we can visualize the between group differences (the code for these is located at GitHub). All of the coding was done using {tidyverse}.


Boxplots are for showing the quantiles of your data but can lack context. The box represents the interquartile range (25-75) and the black line within the box represents the median value. All of these values as well as the smallest and largest values were obtained in the quantile() function in the previous section. We can see that the median value of the difference in squat performance for the block periodization group is higher than that of the traditional periodization group.

Overlapping Histograms

Overlapping histograms are useful for showing distributions and the difference between two groups. The dotted red line is set at a difference of 0 to represent no change from pre- to post-squat test performance. The alpha argument is set below 1 inside the geom_histogram() function of the code allow some opaqueness of the histograms. This way, we can see both distributions clearly. Similar to the boxplots, we can see that the median value of the difference in squat performance for the block periodization group is higher than that of the traditional periodization group.

Violin Plots with Points

Violin plots are a nice balance between boxplots and histograms as they are essentially two histograms mirroring each other. As such, you get tan appreciation of the data distribution, as you would with a histogram, while visualizing the groups side-by-side, as you would with a boxplot. I kept all of the data ponts on the plot as well, to allow the reader to see each participants performance and I placed a thick black point in the middle of the violin to represent the group median.

Joint Plot

Joint plots are a nice way to visualize continuous data where you have an ‘x’ and ‘y’ variable and want to additionally reflect the distribution of each variable with a histogram in the margins. In this plot, I placed the pre-squat performance on the x-axis and the post-squat performance on the y-axis and colored the points by which group the participant was in. This plot was constructed in {tidyverse} but requires the {gridExtra} package to arrange the histograms for the x and y variables in the margin. Additionally, you’ll want to create an empty plot to take up space on the grid so that everything lines up. This is all detailed in the code.

Inter-individual Differences

Finally, as we can see from all of the above plots, while the average performance in the block periodization group was greater than that of the traditional periodization group, there are large distributions in both groups indicating that some participants improved, others did not, and some stayed the same. We can visualize these inter-individual responses by creating a visaluzation that exposes each participant’s performance on both tests.


Hopefully this post serverse as a jumping off point for those looking to get started with analyzing data in R. As I stated at the start, all data can be obtained at my GitHub page.