{"id":1650,"date":"2020-04-10T00:02:00","date_gmt":"2020-04-10T00:02:00","guid":{"rendered":"http:\/\/optimumsportsperformance.com\/blog\/?p=1650"},"modified":"2020-07-01T16:44:49","modified_gmt":"2020-07-01T16:44:49","slug":"r-tips-tricks-joining-data-sets","status":"publish","type":"post","link":"https:\/\/optimumsportsperformance.com\/blog\/r-tips-tricks-joining-data-sets\/","title":{"rendered":"R Tips &#038; Tricks: Joining Data Sets"},"content":{"rendered":"<p>I get a lot of questions from students and colleagues in Sports Science regarding how to do various tasks in R. Most are coming from a strong background in Excel so delving into a language like R can have a steep learning curve. As such, I decided to put together this series called <span style=\"text-decoration: underline;\"><strong>R Tips &amp; Tricks<\/strong><\/span> to share some of the different tasks you might already be doing in excel that you can do in R.<\/p>\n<p>Today, I&#8217;ll discuss joining different data sets. Excel users would commonly do this with VLOOKUP or some type of INDEX. In R, we will use the series of <strong>join <\/strong>functions that can be found in the <strong>dplyr<\/strong> package.<\/p>\n<p>First, we load the <strong>tidyverse<\/strong> package, which contains a suite of packages (dplyr, ggplot2, purrr, and others) useful for data manipulation, data cleaning, and data visualization. Those interested in some of the capabilities of <strong>tidyverse<\/strong> should check out the <a href=\"https:\/\/www.youtube.com\/channel\/UCP8l94xtoemCH_GxByvTuFQ\">TidyX Screencast<\/a> series myself and <a href=\"https:\/\/twitter.com\/ellis_hughes\">Ellis Hughes<\/a> have been doing.<\/p>\n<p>In addition to loading <strong>tidyverse<\/strong> we will also simulate two data sets.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\n\r\n## Load tidyverse\r\n\r\nlibrary(tidyverse)\r\n\r\n## Make two data frames\r\n\r\ntrainingData &lt;- data.frame(\r\n  Name = c(rep(c(&quot;Bob&quot;, &quot;Mike&quot;, &quot;Jeff&quot;), each = 4), rep(&quot;James&quot;, each = 3)),\r\n  Day = c(rep(1:4, times = 3), 1:3),\r\n  trainignLoad = round(rnorm(n = 15, mean = 400, sd = 150), 0))\r\n\r\nwellnessData &lt;- data.frame(\r\n  Name = c(rep(c(&quot;Bob&quot;, &quot;Mike&quot;, &quot;Jeff&quot;), each = 2), rep(&quot;James&quot;, each = 4)),\r\n  Day = c(rep(1:2, times = 3), 1:4),\r\n  wellness = round(rnorm(n = 10, mean = 6, sd = 1), 0))\r\n\r\n\r\n<\/pre>\n<p>Here is what the two data sets look like:<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.32.52-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1652\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.32.52-PM.png\" alt=\"\" width=\"258\" height=\"661\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.32.52-PM.png 388w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.32.52-PM-117x300.png 117w\" sizes=\"auto, (max-width: 258px) 100vw, 258px\" \/><\/a><\/p>\n<p>These data sets represent a common issue in sports science, where you might have training data (e.g., GPS data) on one computer and wellness questionnaire data on another. The goal is to bring them together in a centralized way so that you can do further analysis, build reports, or build visualizations.<\/p>\n<p>We will detail five of the main <strong>join<\/strong> functions you can use for this task, depending on your needs:<\/p>\n<p>1) left_join()<br \/>\n2) right_join()<br \/>\n3) inner_join()<br \/>\n4) anti_join()<br \/>\n4) full_join()<\/p>\n<p><span style=\"text-decoration: underline;\"><strong>left_join()<\/strong><\/span><\/p>\n<p>left_join() looks for all of the matches between the left and right data frames and retains all of the rows in the left data frame, putting NA in any row where there is not a match in the right data frame.<\/p>\n<p>Let&#8217;s join the trainingData (left data frame) to the wellnessData (right data frame) on the columns &#8220;Name&#8221; and &#8220;Day&#8221;.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\n\r\ntrainingData %&gt;%\r\n  left_join(wellnessData, by = c(&quot;Name&quot;, &quot;Day&quot;)) %&gt;%\r\n  as.data.frame()\r\n\r\n<\/pre>\n<p>After running this code, we see that we retain all 15 rows in the trainingData and in instances where an athlete may have forgot to put in the wellness data (e.g., Day 3 and 4 for Bob) R gives us an NA.<a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.37.57-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1653\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.37.57-PM.png\" alt=\"\" width=\"345\" height=\"279\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.37.57-PM.png 808w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.37.57-PM-300x243.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.37.57-PM-768x622.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.37.57-PM-624x505.png 624w\" sizes=\"auto, (max-width: 345px) 100vw, 345px\" \/><\/a><\/p>\n<p><strong><span style=\"text-decoration: underline;\">right_join()<\/span><\/strong><\/p>\n<p>right_join(), as you would imagine, behaves in the opposite way as left_join(). Here, right_join() looks for all of the matches between the right and left data frames and retains all of the rows in the right data frame, putting NA in any row where there is not a match in the left data frame.<\/p>\n<p>Let&#8217;s join the training load data to the wellness data using right_join().<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\ntrainingData %&gt;%\r\n  right_join(wellnessData, by = c(&quot;Name&quot;, &quot;Day&quot;)) %&gt;%\r\n  as.data.frame()\r\n<\/pre>\n<p>The data frame that gets returned after running that code has gotten smaller because we are only retaining rows from the right data frame (wellness) that were the same in the left data frame (training) based on our join criteria (Name and Day).<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.42.48-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1654\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.42.48-PM.png\" alt=\"\" width=\"459\" height=\"257\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.42.48-PM.png 850w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.42.48-PM-300x168.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.42.48-PM-768x430.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.42.48-PM-624x349.png 624w\" sizes=\"auto, (max-width: 459px) 100vw, 459px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>inner_join()<\/strong><\/span><\/p>\n<p>inner_join() only retains the complete matches between the left and right data frames and discards all other rows.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\ntrainingData %&gt;%\r\n  inner_join(wellnessData, by = c(&quot;Name&quot;, &quot;Day&quot;)) %&gt;%\r\n  as.data.frame()\r\n<\/pre>\n<p>Running this code returns only the 9 matching rows based on our join criteria (Name and Day).<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.47.09-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1655\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.47.09-PM.png\" alt=\"\" width=\"436\" height=\"219\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.47.09-PM.png 876w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.47.09-PM-300x151.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.47.09-PM-768x386.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.47.09-PM-624x313.png 624w\" sizes=\"auto, (max-width: 436px) 100vw, 436px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>anti_join()<\/strong><\/span><\/p>\n<p>As the name would imply, anti_join() only returns the rows where there are NO matches between the left and right data frames based on the join criteria (Name and Day) and discards the rest.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\ntrainingData %&gt;%\r\n  anti_join(wellnessData, by = c(&quot;Name&quot;, &quot;Day&quot;)) %&gt;%\r\n  as.data.frame()\r\n<\/pre>\n<p>After running the code we are returned the 6 rows that were unmatched between the two data sets. There are a few instances where this type of join is useful. One use case was detailed in our <a href=\"https:\/\/optimumsportsperformance.com\/blog\/tidyx-episode-2-sentiment-analysis-word-clouds\/\">TidyX 4 screen cast<\/a>, where we did some text analysis.<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.48.53-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1656\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.48.53-PM.png\" alt=\"\" width=\"503\" height=\"206\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.48.53-PM.png 820w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.48.53-PM-300x123.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.48.53-PM-768x315.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.48.53-PM-624x256.png 624w\" sizes=\"auto, (max-width: 503px) 100vw, 503px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>full_join()<\/strong><\/span><\/p>\n<p>Finally, the full_join() will join all rows in the left data frame with all rows in the right data frame and it will put NA in columns for any rows that don&#8217;t have a match.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\ntrainingData %&gt;%\r\n  full_join(wellnessData, by = c(&quot;Name&quot;, &quot;Day&quot;)) %&gt;%\r\n  as.data.frame()\r\n<\/pre>\n<p>Here we see that we are returned all 15 rows of the largest data frame and there are NA place holders anywhere that data was empty when joining on the specified join criteria (Name and Day).<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.53.15-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1657\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.53.15-PM.png\" alt=\"\" width=\"472\" height=\"393\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.53.15-PM.png 818w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.53.15-PM-300x249.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.53.15-PM-768x638.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/04\/Screen-Shot-2020-04-09-at-4.53.15-PM-624x519.png 624w\" sizes=\"auto, (max-width: 472px) 100vw, 472px\" \/><\/a><\/p>\n<p>All code is available at my <span style=\"color: #0000ff;\"><a style=\"color: #0000ff;\" href=\"https:\/\/github.com\/pw2\/R-Tips-Tricks\/blob\/master\/R%20Tips%20%26%20Tricks%20-%20Joining%20Data%20Sets.R\">GITHUB page<\/a>.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>I get a lot of questions from students and colleagues in Sports Science regarding how to do various tasks in R. Most are coming from a strong background in Excel so delving into a language like R can have a steep learning curve. As such, I decided to put together this series called R Tips [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[45],"tags":[],"class_list":["post-1650","post","type-post","status-publish","format-standard","hentry","category-r-tips-tricks"],"_links":{"self":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/1650","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/comments?post=1650"}],"version-history":[{"count":4,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/1650\/revisions"}],"predecessor-version":[{"id":1660,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/1650\/revisions\/1660"}],"wp:attachment":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/media?parent=1650"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/categories?post=1650"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/tags?post=1650"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}