{"id":1616,"date":"2020-01-25T22:07:20","date_gmt":"2020-01-25T22:07:20","guid":{"rendered":"http:\/\/optimumsportsperformance.com\/blog\/?p=1616"},"modified":"2020-07-01T16:45:51","modified_gmt":"2020-07-01T16:45:51","slug":"creating-a-data-dictionary-function-in-r","status":"publish","type":"post","link":"https:\/\/optimumsportsperformance.com\/blog\/creating-a-data-dictionary-function-in-r\/","title":{"rendered":"Creating A Data Dictionary Function in R"},"content":{"rendered":"<p>In my <a href=\"https:\/\/optimumsportsperformance.com\/blog\/tidytuesday-powerlifting-performance-age\/\">previous post<\/a>, I did a bit of impromptu analysis on some Powerlifting data provided from the <a href=\"https:\/\/thomasmock.netlify.com\/post\/tidytuesday-a-weekly-social-data-project-in-r\/\">TidyTusday<\/a> project.<\/p>\n<p>When sitting down to work with a new data set it is important to familiarize yourself with the variables in each column, get a grasp for what sort of values you may be dealing with, and quickly identify any potential issues with the data that may require your attention.<\/p>\n<p>For looking at the type of variables you are dealing with the functions <span style=\"color: #0000ff;\">str()<\/span> in base R or <span style=\"color: #0000ff;\">glimpse() <\/span>in tidyverse can be useful. If it&#8217;s summary statistics you&#8217;re after, the psych package&#8217;s <span style=\"color: #0000ff;\">describe()<\/span> function will do the trick. The <span style=\"color: #0000ff;\">summary()<\/span> function in base R can also be useful for getting min, max, mean, median, IQR, and the number of missing values (NA) in each column.<\/p>\n<p>The issue with this is that you have to go through a few steps to get the info you want &#8212; variable types, number of missing values, and summary statistics. Thus, I decided to create my own data dictionary function. After passing your data frame to the function, you will get the name of each variable, the variable type, the number of missing values for each variable, the total amount of data (rows) for each value, and a host of summary statistics such as mean, standard deviation, median, standard error, min, max, and range. While the function defaults to printing the results in your R console you can choose to set the argument <span style=\"color: #0000ff;\">print_table = &#8220;Yes&#8221;<\/span> and the results will be returned in a nice table that you can use for reports or presentations to colleagues.<\/p>\n<p>Let&#8217;s take a look at function in action.<\/p>\n<p>First, we will create some fake data:<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\n\r\nNames &lt;- c(&quot;Sal&quot;, &quot;John&quot;, &quot;Jeff&quot;, &quot;Karl&quot;, &quot;Ben&quot;)\r\nHomeTown &lt;- c(&quot;CLE&quot;, &quot;NYC&quot;, &quot;CHI&quot;, &quot;DEN&quot;, &quot;SEA&quot;)\r\nvar1 &lt;- rnorm(n = length(Names), mean = 10, sd = 2)\r\nvar2 &lt;- rnorm(n = length(Names), mean = 300, sd = 150)\r\nvar3 &lt;- rnorm(n = length(Names), mean = 1000, sd = 350)\r\nvar4 &lt;- c(6, 7, NA, 3, NA)\r\n\r\ndf &lt;- data.frame(Names, HomeTown, var1, var2, var3, var4)\r\ndf\r\n\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.16.54-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-1619\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.16.54-PM.png\" alt=\"\" width=\"744\" height=\"208\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.16.54-PM.png 744w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.16.54-PM-300x84.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.16.54-PM-624x174.png 624w\" sizes=\"auto, (max-width: 744px) 100vw, 744px\" \/><\/a><\/p>\n<p>We can see from the output that the code includes a few NA values in the var4 column. Additionally, the first two columns are not numeric values. We can run the <span style=\"color: #0000ff;\">data_dict()<\/span> function I&#8217;ve created to get a read out of the data we are looking at.<\/p>\n<p>First, let&#8217;s look at the output in the R console:<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\n\r\n# without table\r\ndata_dict(df, print_table = &quot;No&quot;)\r\n\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.22.39-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1620\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.22.39-PM-1024x186.png\" alt=\"\" width=\"652\" height=\"119\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.22.39-PM-1024x186.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.22.39-PM-300x54.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.22.39-PM-768x139.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.22.39-PM-624x113.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.22.39-PM.png 1346w\" sizes=\"auto, (max-width: 652px) 100vw, 652px\" \/><\/a><\/p>\n<p>We are immediately returned an output that consolidates some key information for helping us quickly evaluate our data set.<\/p>\n<p>By setting the argument <span style=\"color: #0000ff;\">print_table = &#8220;Yes&#8221;<\/span> we will get our result in a nice table format.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\n\r\n# with table\r\ndata_dict(df, print_table = &quot;Yes&quot;)\r\n\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.26.30-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-1621\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.26.30-PM-1024x276.png\" alt=\"\" width=\"625\" height=\"168\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.26.30-PM-1024x276.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.26.30-PM-300x81.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.26.30-PM-768x207.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.26.30-PM-624x168.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.26.30-PM.png 1620w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p>Let&#8217;s look at the results in table format for a much larger data set &#8212; the Lahman Baseball Batting data set.<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.28.51-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-1622\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.28.51-PM.png\" alt=\"\" width=\"783\" height=\"610\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.28.51-PM.png 783w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.28.51-PM-300x234.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.28.51-PM-768x598.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2020\/01\/Screen-Shot-2020-01-25-at-2.28.51-PM-624x486.png 624w\" sizes=\"auto, (max-width: 783px) 100vw, 783px\" \/><\/a><\/p>\n<p>As you can see, it is a pretty handy function. Very quickly we can identify:<\/p>\n<p>1) The types of variables in our data<br \/>\n2) The amount of data in each column<br \/>\n3) The number of missing values in each column<br \/>\n4) A variety of summary statistics<\/p>\n<p>If you&#8217;re interested in using the function, you can obtain it on my <a href=\"https:\/\/github.com\/pw2\/Data-Dictionary-Function\">GitHub page<\/a>.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In my previous post, I did a bit of impromptu analysis on some Powerlifting data provided from the TidyTusday project. When sitting down to work with a new data set it is important to familiarize yourself with the variables in each column, get a grasp for what sort of values you may be dealing with, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[45],"tags":[],"class_list":["post-1616","post","type-post","status-publish","format-standard","hentry","category-r-tips-tricks"],"_links":{"self":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/1616","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/comments?post=1616"}],"version-history":[{"count":3,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/1616\/revisions"}],"predecessor-version":[{"id":1618,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/1616\/revisions\/1618"}],"wp:attachment":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/media?parent=1616"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/categories?post=1616"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/tags?post=1616"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}