TidyX Episode 12: Data Cleaning & Thomas Mock

This week, Ellis Hughes and I welcome our first guest to the episode, TidyTuesday creator, Thomas Mock!

Collectively, we go over a code submission by Joshua de la Bruere, who shares a code that provides a good lesson in data cleaning for a rather messy data set. We then have a discussion with Thomas regarding all things R, TidyTuesday, and a bit about his PhD in Neuroscience!

To watch the screen cast, CLICK HERE.

To obtain the code, CLICK HERE.

TidyX Episode 11: Time Series Plots

This week, Ellis Hughes and I use pro beach volleyball data, provided by TidyTuesday, to visualize winning percentage for different teams over time. We then show a simple Bayes adjustment for winning percentage when teams have small sample sizes (similar to what I discussed in an older blog article). Finally, we plot distributions of winning percentage to compare different teams. But before that, we go over code from Eric Ekholm, who produced a really cool plot of the performance metrics of Kerri Walsh-Jennings and Misty May-Treanor across their respective careers.

To watch the screen cast, CLICK HERE.

For our code, CLICK HERE.

TidyX Episode 10: Exploratory Data Analysis & ggplotly

This week, instead of working through someone else’s code, Ellis Hughes and I decided to use the TidyTuesday data set on volcano eruptions and walk through how we might start to approach looking at the data from an exploratory data analysis perspective.

Along the way, we not only show how we approach cleaning the data and creating new features but also touch on the value and ease of using the ggplotly() to quickly construct interactive visualizations from the ggplot2 visuals you’ve already built. These types of visuals provide a useful and interesting way to exploring your data are can be valuable when sharing information with decision makers to assist discussions around how to proceed with further analysis.

To watch the screen cast, CLICK HERE.

To view our code, CLICK HERE.

 

 

R Tips & Tricks: Dealing With Missing Values

A colleague contacted me last week asking about how they could fill in missing values in their data frame with 0’s. Let’s face it, missing values are an unfortunate reality when working with data and they can be a real pain in the butt. There have been a number of papers that discuss ways of handing missing values (e.g., various imputation approaches) and it is not my intention to address those here as the approach you choose to use would depend on the data and what you are trying to do.

My goal with this installment of R Tips & Tricks is to provide you with a way to remove NA’s, NaN’s, and Inf’s from your data set or to replace them in with some specified value (e.g., the column median or 0). The latter approach can also be used if you wanted to impute values, as suggested above (e.g, use a regression equation to forecast what the value would be given other data, used as independent variables in the model, in your data set).

Create Some Data

First we need to load the tidyverse package and create some data.

## Load tidyverse
library(tidyverse)


## Create fake data
df <- tibble(
  var1 = c(4, 2, Inf, 3, NA, 6, NA, NaN, 44, 23, 78),
  var2 = c(7.8, NaN, 70, 1, 8, -Inf, NA, 99, 12, 3, 2.2))

As we can see, our data has a bunch of NA, NaN, Inf, and -Inf characters in it. Creating the data as a tibble allows us to quickly identify these characters as they are colored in red:

How many missing values and Inf’s are we dealing with?

Before removing rows with NA’s and Inf’s or trying to impute values it is best to see how many of these characters are in each columns. Below is some code to quickly evaluate this but if you have a larger data set and want to get other summary measures for each column you may want to check out one of my older posts on how to build a data dictionary.

## Get a count of the number of NA, NaN, and Inf in each column
data.frame(NA_cols = sapply(df, function(x) sum(length(which(is.na(x))))),
           NaN_cols = sapply(df, function(x) sum(length(which(is.nan(x))))),
           Inf_cols = sapply(df, function(x) sum(length(which(is.infinite(x)))))) %>%
  t()

The above code produces a table of counts of NA, NaN, and Inf’s in both of our columns (var1, var2).

Removing the rows with NA, NaN, and Inf

Sometimes, you might just want to remove the rows with missing values or Inf.

NOTE: If you are going to do this it is critical that you have a good reason to do so and you know exactly why you are doing it and how it might change any outcomes generated from your analysis!

There are two easy ways to do this in tidyverse. One way is to simply remove these characters one-by-one. In this way, you can simply indicate which column you want and which rows you’d like to remove.

# One row at a time
df %>%
  filter(!is.na(var1),
         !is.na(var2),
         !is.infinite(var1),
         !is.infinite(var2))

The other option (useful if you have a data set with a large number of columns) is to pass the above functions to any columns that are integers.

# All numeric rows at once
df %>%
  filter_if(is.numeric, all_vars(!is.na(.))) %>%
  filter_if(is.numeric, all_vars(!is.infinite(.)))

Both approaches produce the same result:

Replace NA, NaN, and Inf with a specific value

Finally, we can replace the NA, NaN, or Inf with a specific value. In the below code, I replace Na and NaN with the median value of the respective column and I replace Inf and -Inf with a 0. If you had an equation for imputing values, this would also be a way that you could do it.

## Use tidyverse to convert NA and NaN to the median of column 1 and Inf to 0 in both columns

df <- df %>% mutate(var1_new = ifelse(is.na(var1), median(var1, na.rm = T), ifelse(
    is.infinite(var1), 0, var1)), 
  var2_new = ifelse(is.na(var2), median(var2, na.rm = T), ifelse(
    is.infinite(var2), 0, var2)))

After running that code we can see in the two new columns where the characters were replaced with the specified values.

Missing values can bring all sorts of issues in data clean up and analysis. Think hard before you decide to remove them or what approach you might take to impute values. In this case, we used median imputation for NA and NaN but there are a number of other approaches that are more rigorous and will offer more reliable results.

The code for this blog is available on my GITHUB Page.

TidyX Episode 9: Creating Data Tables

In this episode, Ellis Hughes and I discuss the use of data tables to accompany your visuals in rmarkdown reports.

First we go over code provided by Ted Laderas, who used the gt package to create some nice data tables on the Animal Crossings data set from the TidyTuesday Project.

Then, we apply a similar approach to NFL Kicker data and show how you can add conditional formatting to your table to aid in data presentation.

For the screen cast, CLICK HERE.

For our code, CLICK HERE.