# R Tips & Tricks: Dealing With Missing Values

A colleague contacted me last week asking about how they could fill in missing values in their data frame with 0’s. Let’s face it, missing values are an unfortunate reality when working with data and they can be a real pain in the butt. There have been a number of papers that discuss ways of handing missing values (e.g., various imputation approaches) and it is not my intention to address those here as the approach you choose to use would depend on the data and what you are trying to do.

My goal with this installment of R Tips & Tricks is to provide you with a way to remove NA’s, NaN’s, and Inf’s from your data set or to replace them in with some specified value (e.g., the column median or 0). The latter approach can also be used if you wanted to impute values, as suggested above (e.g, use a regression equation to forecast what the value would be given other data, used as independent variables in the model, in your data set).

Create Some Data

First we need to load the tidyverse package and create some data.

```## Load tidyverse
library(tidyverse)

## Create fake data
df <- tibble(
var1 = c(4, 2, Inf, 3, NA, 6, NA, NaN, 44, 23, 78),
var2 = c(7.8, NaN, 70, 1, 8, -Inf, NA, 99, 12, 3, 2.2))
```

As we can see, our data has a bunch of NA, NaN, Inf, and -Inf characters in it. Creating the data as a tibble allows us to quickly identify these characters as they are colored in red: How many missing values and Inf’s are we dealing with?

Before removing rows with NA’s and Inf’s or trying to impute values it is best to see how many of these characters are in each columns. Below is some code to quickly evaluate this but if you have a larger data set and want to get other summary measures for each column you may want to check out one of my older posts on how to build a data dictionary.

```## Get a count of the number of NA, NaN, and Inf in each column
data.frame(NA_cols = sapply(df, function(x) sum(length(which(is.na(x))))),
NaN_cols = sapply(df, function(x) sum(length(which(is.nan(x))))),
Inf_cols = sapply(df, function(x) sum(length(which(is.infinite(x)))))) %>%
t()
```

The above code produces a table of counts of NA, NaN, and Inf’s in both of our columns (var1, var2). Removing the rows with NA, NaN, and Inf

Sometimes, you might just want to remove the rows with missing values or Inf.

NOTE: If you are going to do this it is critical that you have a good reason to do so and you know exactly why you are doing it and how it might change any outcomes generated from your analysis!

There are two easy ways to do this in tidyverse. One way is to simply remove these characters one-by-one. In this way, you can simply indicate which column you want and which rows you’d like to remove.

```# One row at a time
df %>%
filter(!is.na(var1),
!is.na(var2),
!is.infinite(var1),
!is.infinite(var2))
```

The other option (useful if you have a data set with a large number of columns) is to pass the above functions to any columns that are integers.

```# All numeric rows at once
df %>%
filter_if(is.numeric, all_vars(!is.na(.))) %>%
filter_if(is.numeric, all_vars(!is.infinite(.)))
```

Both approaches produce the same result: Replace NA, NaN, and Inf with a specific value

Finally, we can replace the NA, NaN, or Inf with a specific value. In the below code, I replace Na and NaN with the median value of the respective column and I replace Inf and -Inf with a 0. If you had an equation for imputing values, this would also be a way that you could do it.

```## Use tidyverse to convert NA and NaN to the median of column 1 and Inf to 0 in both columns

df <- df %>% mutate(var1_new = ifelse(is.na(var1), median(var1, na.rm = T), ifelse(
is.infinite(var1), 0, var1)),
var2_new = ifelse(is.na(var2), median(var2, na.rm = T), ifelse(
is.infinite(var2), 0, var2)))
```

After running that code we can see in the two new columns where the characters were replaced with the specified values. Missing values can bring all sorts of issues in data clean up and analysis. Think hard before you decide to remove them or what approach you might take to impute values. In this case, we used median imputation for NA and NaN but there are a number of other approaches that are more rigorous and will offer more reliable results.

The code for this blog is available on my GITHUB Page.

# R Tips & Tricks: Setting up your data to compare yesterday to today

A common question sports scientists have is, “how did something that happened yesterday effect today?” For example, the sport scientist might be interested to know how yesterday’s training load influences tomorrow’s level of subjective soreness. In this case, the data is usually offset by a day, as we will see in the example.

Rather than wasting time each day copying and pasting values in excel (and potentially making a mistake), we can use the lag() function from the tidyverse package to manipulate our data into the format we need for analysis.

In today’s R Tips & Tricks blog post I’ll walk through three different approaches to doing this, each a little more complex than the last.

First, let’s load the packages we will need to manipulate our data set.

```## Load packages
library(tidyverse)
library(lubridate)
```

Example 1: Simple Example

First simulate some fake data

```day <- 1:10
trainingLoad <- round(rnorm(n = length(day), mean = 460, sd = 60))
soreness <- round(runif(n = length(day), min = 3, max = 6), 0)
df
``` We have 10 days recorded and we want to evaluate the training load from the previous day with the soreness of the next day. For example, the training load on day 1 needs to be compared to the level of soreness the next morning, on day 2. As such, the values of interest are offset by one day.

To solve this issue we can use the lag() function for training load. What this will do is take the value one row up and move it one row down.

```df %>%
``` Notice that now the 416 training load units on day 1 are on the same row as the soreness on day 2 in our new column, trainingLoad_lag.

Example 2: Working Across Weeks

The above example is rather simple and assumes that all training and soreness reporting take place on consecutive days. Unfortunately, in real life we are often dealing with training across multiple weeks where there may be days off between training sessions.

For example, the data might look like this:

```date <- c(seq(as.Date("2020/01/05"), as.Date("2020/01/08"), by = "days"),
seq(as.Date("2020/01/12"), as.Date("2020/01/15"), by = "days"))
trainingLoad <- round(rnorm(n = length(date), mean = 460, sd = 60), 0)
soreness <- round(runif(n = length(date), min = 3, max = 6), 0)
df
``` Let’s look what happens if we blindly apply the lag() function

```df %>%
``` Notice the issue here. We have a group of 4 consecutive training sessions that ends on 1/8/2020 and a second group of 4 consecutive sessions starting on 1/12/2020. As such, the lag function just works across the data set and makes the assumption that these are all consecutive days. If we analyzed this data in this fashion we might come up with strange outcomes given that the soreness experienced on 1/12/2020 might not be due to the session that happened 4 days ago on 1/8/2020.

We can solve this issue in one of two ways.

Fix #1: Always create a mesocycle variable in your data to represent the weeks. This will allow you to group_by() that variable.

```mesocycle <- rep(c(1, 2), each = 4)
df <- data.frame(mesocycle, df) df # Group by Mesocycle df %>%
group_by(mesocycle) %>%
``` After adding the mesocycle variable and then using it the group_by() function we achieve the correct data manipulation where day 1 of each of the training mesocycles starts with NA in the trainingLoad_lag column indicating that no session occurred the day prior.

Fix #2: Use the week() function from the lubridate package and have R automatically find the week of the year corresponding to the date of the training session.

```date <- c(seq(as.Date("2020/01/05"), as.Date("2020/01/08"), by = "days"),
seq(as.Date("2020/01/11"), as.Date("2020/01/14"), by = "days"))
trainingLoad <- round(rnorm(n = length(date), mean = 460, sd = 60), 0)
soreness <- round(runif(n = length(date), min = 3, max = 6), 0)
df

df <- df %>%
mutate(trainingWeek = week(date))

# Group by trainingWeek

df %>%
group_by(trainingWeek) %>%
``` Running all of the above code we find that the week() function identified the week of the year based on the date and then we were able to group_by() the training week variable to come to the same outcome as we did in Fix 1.

Example 3: Working Across Weeks with Multiple Athletes

```athlete <- rep(LETTERS[1:3], each = 8)
mesocycle <- rep(rep(c(1, 2), each = 4), times = 3)
date <- rep(c(seq(as.Date("2020/01/05"), as.Date("2020/01/08"), by = "days"),
seq(as.Date("2020/01/12"), as.Date("2020/01/15"), by = "days")), times = 3)
trainingLoad <- round(rnorm(n = length(date), mean = 460, sd = 60), 0)
soreness <- round(runif(n = length(date), min = 3, max = 6), 0)
df <- data.frame(athlete, mesocycle, date, trainingLoad, soreness)

# Group by athlete and mesocycle
df %>%
group_by(athlete, mesocycle) %>%
as.data.frame()
```

All we need to do is pass the group_by() function athlete and mesocycle and R will apply the lag() function to our training_load variable based on these parameters. Notice that R correctly grouped by the 3 athletes and the 2 mesocycles (4 sessions per mesocycle) for each athlete. In doing so, we have an NA for the first day of each mesocycle for each athlete.

# R Tips & Tricks: Joining Data Sets

I get a lot of questions from students and colleagues in Sports Science regarding how to do various tasks in R. Most are coming from a strong background in Excel so delving into a language like R can have a steep learning curve. As such, I decided to put together this series called R Tips & Tricks to share some of the different tasks you might already be doing in excel that you can do in R.

Today, I’ll discuss joining different data sets. Excel users would commonly do this with VLOOKUP or some type of INDEX. In R, we will use the series of join functions that can be found in the dplyr package.

First, we load the tidyverse package, which contains a suite of packages (dplyr, ggplot2, purrr, and others) useful for data manipulation, data cleaning, and data visualization. Those interested in some of the capabilities of tidyverse should check out the TidyX Screencast series myself and Ellis Hughes have been doing.

```

library(tidyverse)

## Make two data frames

trainingData <- data.frame(
Name = c(rep(c("Bob", "Mike", "Jeff"), each = 4), rep("James", each = 3)),
Day = c(rep(1:4, times = 3), 1:3),
trainignLoad = round(rnorm(n = 15, mean = 400, sd = 150), 0))

wellnessData <- data.frame(
Name = c(rep(c("Bob", "Mike", "Jeff"), each = 2), rep("James", each = 4)),
Day = c(rep(1:2, times = 3), 1:4),
wellness = round(rnorm(n = 10, mean = 6, sd = 1), 0))

```

Here is what the two data sets look like: These data sets represent a common issue in sports science, where you might have training data (e.g., GPS data) on one computer and wellness questionnaire data on another. The goal is to bring them together in a centralized way so that you can do further analysis, build reports, or build visualizations.

We will detail five of the main join functions you can use for this task, depending on your needs:

1) left_join()
2) right_join()
3) inner_join()
4) anti_join()
4) full_join()

left_join()

left_join() looks for all of the matches between the left and right data frames and retains all of the rows in the left data frame, putting NA in any row where there is not a match in the right data frame.

Let’s join the trainingData (left data frame) to the wellnessData (right data frame) on the columns “Name” and “Day”.

```
trainingData %>%
left_join(wellnessData, by = c("Name", "Day")) %>%
as.data.frame()

```

After running this code, we see that we retain all 15 rows in the trainingData and in instances where an athlete may have forgot to put in the wellness data (e.g., Day 3 and 4 for Bob) R gives us an NA. right_join()

right_join(), as you would imagine, behaves in the opposite way as left_join(). Here, right_join() looks for all of the matches between the right and left data frames and retains all of the rows in the right data frame, putting NA in any row where there is not a match in the left data frame.

Let’s join the training load data to the wellness data using right_join().

```trainingData %>%
right_join(wellnessData, by = c("Name", "Day")) %>%
as.data.frame()
```

The data frame that gets returned after running that code has gotten smaller because we are only retaining rows from the right data frame (wellness) that were the same in the left data frame (training) based on our join criteria (Name and Day). inner_join()

inner_join() only retains the complete matches between the left and right data frames and discards all other rows.

```trainingData %>%
inner_join(wellnessData, by = c("Name", "Day")) %>%
as.data.frame()
```

Running this code returns only the 9 matching rows based on our join criteria (Name and Day). anti_join()

As the name would imply, anti_join() only returns the rows where there are NO matches between the left and right data frames based on the join criteria (Name and Day) and discards the rest.

```trainingData %>%
anti_join(wellnessData, by = c("Name", "Day")) %>%
as.data.frame()
```

After running the code we are returned the 6 rows that were unmatched between the two data sets. There are a few instances where this type of join is useful. One use case was detailed in our TidyX 4 screen cast, where we did some text analysis. full_join()

Finally, the full_join() will join all rows in the left data frame with all rows in the right data frame and it will put NA in columns for any rows that don’t have a match.

```trainingData %>%
full_join(wellnessData, by = c("Name", "Day")) %>%
as.data.frame()
```

Here we see that we are returned all 15 rows of the largest data frame and there are NA place holders anywhere that data was empty when joining on the specified join criteria (Name and Day). All code is available at my

# Creating A Data Dictionary Function in R

In my previous post, I did a bit of impromptu analysis on some Powerlifting data provided from the TidyTusday project.

When sitting down to work with a new data set it is important to familiarize yourself with the variables in each column, get a grasp for what sort of values you may be dealing with, and quickly identify any potential issues with the data that may require your attention.

For looking at the type of variables you are dealing with the functions str() in base R or glimpse() in tidyverse can be useful. If it’s summary statistics you’re after, the psych package’s describe() function will do the trick. The summary() function in base R can also be useful for getting min, max, mean, median, IQR, and the number of missing values (NA) in each column.

The issue with this is that you have to go through a few steps to get the info you want — variable types, number of missing values, and summary statistics. Thus, I decided to create my own data dictionary function. After passing your data frame to the function, you will get the name of each variable, the variable type, the number of missing values for each variable, the total amount of data (rows) for each value, and a host of summary statistics such as mean, standard deviation, median, standard error, min, max, and range. While the function defaults to printing the results in your R console you can choose to set the argument print_table = “Yes” and the results will be returned in a nice table that you can use for reports or presentations to colleagues.

Let’s take a look at function in action.

First, we will create some fake data:

```
Names <- c("Sal", "John", "Jeff", "Karl", "Ben")
HomeTown <- c("CLE", "NYC", "CHI", "DEN", "SEA")
var1 <- rnorm(n = length(Names), mean = 10, sd = 2)
var2 <- rnorm(n = length(Names), mean = 300, sd = 150)
var3 <- rnorm(n = length(Names), mean = 1000, sd = 350)
var4 <- c(6, 7, NA, 3, NA)

df <- data.frame(Names, HomeTown, var1, var2, var3, var4)
df

``` We can see from the output that the code includes a few NA values in the var4 column. Additionally, the first two columns are not numeric values. We can run the data_dict() function I’ve created to get a read out of the data we are looking at.

First, let’s look at the output in the R console:

```
# without table
data_dict(df, print_table = "No")

``` We are immediately returned an output that consolidates some key information for helping us quickly evaluate our data set.

By setting the argument print_table = “Yes” we will get our result in a nice table format.

```
# with table
data_dict(df, print_table = "Yes")

``` Let’s look at the results in table format for a much larger data set — the Lahman Baseball Batting data set. As you can see, it is a pretty handy function. Very quickly we can identify:

1) The types of variables in our data
2) The amount of data in each column
3) The number of missing values in each column
4) A variety of summary statistics

If you’re interested in using the function, you can obtain it on my GitHub page.

# Data Analysis Template in R Markdown & Jupyter Notebook

The nice thing about working on a team with other analysts, working as part of a research group, or working on your PhD is the ability to share analysis with other colleagues, get feedback, and learn new ways of thinking about things.

Interestingly, when I’ve inquired to colleagues at some teams about how they share their analysis with their group they often say that, “people do their analysis and just present the results”. I think this is a big miss in terms of being able to have transparency in the research process, sharing so that others can learn or help to provide constructive feedback, and walking through the steps you went through (data retrieval,  data cleaning, analysis, model testing, etc) to ensure that things make sense to the group before being shared with the end user.

For the PhD student, a more streamlined approach to the entire analysis can help them talk through what they did with their advisors, ensure that all the correct steps were taken during the analysis, and have greater confidence about what their data is and is not saying (which can really come in handy when it is time to defend the thesis!). When I was doing my PhD I would often try and put all my steps into a power point presentation to walk through with my supervisors. I never liked that, however, because it always felt clumsy and I was never really getting to the guts of the analysis as much as I was just sharing the outcomes of what I did and why I did it and talking through how I did it. A template that allows for a clear presentation would have made things much easier for both myself and my supervisors

In my last post, I used R Markdown to create a report that allows the sport scientist to share some basic data analysis with the strength and conditioning staff and other coaches. As I said in that post, R Markdown is a wonderful resource for creating reports where you can hide your code and simply show visualizations of the data and model outputs. But, what if we don’t want to hide our code?! In this sense, R Markdown is extremely useful for setting up a data analysis template to allow you to walk through all the steps in your project and share the results with your colleagues or PhD supervisors. Additionally, you could also keep R Studio open when presenting your findings and address any changes/suggestions that people may have, in real time before, “knitting” the markdown file into the final html or pdf document. This last part allows the analysis to come to life and allows you to make direct changes and immediately show how they impact the outcome of the analysis!

Data Analysis Templates

There are a number of different data analysis frameworks one could follow. Two that immediately come to mind are the Cross Industry Standard Process for Data Mining (CRISP-DM) and the Problem, Plan, Data, Analysis, and Conclusion (PPDAC) Cycle.

Although they come from different industries — CRSIP-DM from the business world  and PPDAC from more of the statistics education world — there is considerable overlap and both have the aim of providing the analyst with a clear path to answering their research question.

The objectives of each phase within these two frameworks is shown below.

As you can see, the end goal of the analysis is different between the two frameworks: CRISP-DM being targeted at deploying a model specific to business use cases and PPDAC providing more of a runway for scientific publication. However, both can provide us with an appreciation for creating a systematic process around data analysis, allowing for a clear explanation of our approach when discussing with colleagues or PhD supervisors.

In an attempt to create something more generic and less specific to a certain industry or field, I came up with my own framework: The framework is freely available on my GitHub page in both an R Markdown and Jupyter Notebook (if you prefer Python) formats. If you’d like to see with the R Markdown HTML looks like, click here >> PWard_-_Data_Analysis_Framework.

All you have to do is take the template (either R Markdown or Jupyter Notebook), delete the comments that I have under each section and fill in your own comments and your R or Python script, where applicable, to conduct your analysis. Once complete, you will have a clean looking file that details your entire approach.

I’ve made a simple example of what using the template could look like. If you are interested in seeing the result in R Markdown, CLICK HERE >> Data_Analysis_Framework_Example_–_MLB_Hitting. If you are interested in seeing the result in a Python Jupyter Notebook, CLICK HERE >> Data Analysis Framework Example — MLB Hitting (Jupyter).

All of the code and the templates for use are available on my GitHub page.