R Tips & Tricks: Joining Data Sets

I get a lot of questions from students and colleagues in Sports Science regarding how to do various tasks in R. Most are coming from a strong background in Excel so delving into a language like R can have a steep learning curve. As such, I decided to put together this series called R Tips & Tricks to share some of the different tasks you might already be doing in excel that you can do in R.

Today, I’ll discuss joining different data sets. Excel users would commonly do this with VLOOKUP or some type of INDEX. In R, we will use the series of join functions that can be found in the dplyr package.

First, we load the tidyverse package, which contains a suite of packages (dplyr, ggplot2, purrr, and others) useful for data manipulation, data cleaning, and data visualization. Those interested in some of the capabilities of tidyverse should check out the TidyX Screencast series myself and Ellis Hughes have been doing.

In addition to loading tidyverse we will also simulate two data sets.


## Load tidyverse

library(tidyverse)

## Make two data frames

trainingData <- data.frame(
  Name = c(rep(c("Bob", "Mike", "Jeff"), each = 4), rep("James", each = 3)),
  Day = c(rep(1:4, times = 3), 1:3),
  trainignLoad = round(rnorm(n = 15, mean = 400, sd = 150), 0))

wellnessData <- data.frame(
  Name = c(rep(c("Bob", "Mike", "Jeff"), each = 2), rep("James", each = 4)),
  Day = c(rep(1:2, times = 3), 1:4),
  wellness = round(rnorm(n = 10, mean = 6, sd = 1), 0))

Here is what the two data sets look like:

These data sets represent a common issue in sports science, where you might have training data (e.g., GPS data) on one computer and wellness questionnaire data on another. The goal is to bring them together in a centralized way so that you can do further analysis, build reports, or build visualizations.

We will detail five of the main join functions you can use for this task, depending on your needs:

1) left_join()
2) right_join()
3) inner_join()
4) anti_join()
4) full_join()

left_join()

left_join() looks for all of the matches between the left and right data frames and retains all of the rows in the left data frame, putting NA in any row where there is not a match in the right data frame.

Let’s join the trainingData (left data frame) to the wellnessData (right data frame) on the columns “Name” and “Day”.


trainingData %>%
  left_join(wellnessData, by = c("Name", "Day")) %>%
  as.data.frame()

After running this code, we see that we retain all 15 rows in the trainingData and in instances where an athlete may have forgot to put in the wellness data (e.g., Day 3 and 4 for Bob) R gives us an NA.

right_join()

right_join(), as you would imagine, behaves in the opposite way as left_join(). Here, right_join() looks for all of the matches between the right and left data frames and retains all of the rows in the right data frame, putting NA in any row where there is not a match in the left data frame.

Let’s join the training load data to the wellness data using right_join().

trainingData %>%
  right_join(wellnessData, by = c("Name", "Day")) %>%
  as.data.frame()

The data frame that gets returned after running that code has gotten smaller because we are only retaining rows from the right data frame (wellness) that were the same in the left data frame (training) based on our join criteria (Name and Day).

inner_join()

inner_join() only retains the complete matches between the left and right data frames and discards all other rows.

trainingData %>%
  inner_join(wellnessData, by = c("Name", "Day")) %>%
  as.data.frame()

Running this code returns only the 9 matching rows based on our join criteria (Name and Day).

anti_join()

As the name would imply, anti_join() only returns the rows where there are NO matches between the left and right data frames based on the join criteria (Name and Day) and discards the rest.

trainingData %>%
  anti_join(wellnessData, by = c("Name", "Day")) %>%
  as.data.frame()

After running the code we are returned the 6 rows that were unmatched between the two data sets. There are a few instances where this type of join is useful. One use case was detailed in our TidyX 4 screen cast, where we did some text analysis.

full_join()

Finally, the full_join() will join all rows in the left data frame with all rows in the right data frame and it will put NA in columns for any rows that don’t have a match.

trainingData %>%
  full_join(wellnessData, by = c("Name", "Day")) %>%
  as.data.frame()

Here we see that we are returned all 15 rows of the largest data frame and there are NA place holders anywhere that data was empty when joining on the specified join criteria (Name and Day).

All code is available at my GITHUB page.