Category Archives: TidyX Screen Cast

TidyX 68: Cleaning Messy Data, Part 5 – Writing Results to an External File

For the last 4 screen casts, Ellis Hughes and I have been showing methods of cleaning messy excel data. Now that you have the data cleaned and structured in a data frame, you might need to write that data back out to an external file. Perhaps you need to save it as a csv and upload it to a database or maybe you want to write it out to a cleaner excel file that has the data in an accessible manner for your end users.

This week, we will write our cleaned data out to csv and excel in four formats:

  1. Write to csv
  2. Write the flat file out to a single excel sheet
  3. Write the flat file out to individual athlete tabs within an excel sheet
  4. Write the flat file out to individual athlete tabs within an excel sheet along with a bunch of custom excel formatting, available via the {openxlsx} package.

To watch the screen cast, CLICK HERE.

To access our code, CLICK HERE.

 

TidyX 66: Cleaning ugly excel data, Part 3

This week, Ellis Hughes and I finish cleaning our ugly excel data. In parts 1 and 2 we concentrated on:

  • Bringing the data into R
  • Reading multiple excel tabs
  • Cleaning up merged columns
  • Transforming the data into a long format
  • Dealing with the structure of an excel sheet were pieces of information are contained in different areas on the sheet
  • Organizing the data into a logical data frame for analysis
  • Writing a function to automate all of the above processes across all excel tabs

This week, handle the last portion of our ugly excel data — dealing with dates and the training duration column.

  • Dates can be tricky and, if you recall from previous episodes, R read them in as excel numeric values. So we discuss how to turn the numeric values into actual date values by understanding the origin argument within the as.Date() function.
  • The training duration column in our ugly data file had various different values recorded in it (1:05, :30, 45, etc.). So, we have to figure out a way to handle this character string and covert it into a numeric value, representing the minutes of training that the individual performed.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

TidyX 65: Cleaning ugly excel data, Part 2

In the last episode, I took my crack at cleaning some messy excel data. This week, it is Ellis’ turn!

While we took two different approaches to clean the data, we end up with the same result — a data frame of data that can be analyzed.

The key take-a-ways from this two part series are:

  1. There are more than one ways to scale the mountain in R.
  2. We both take a chunking approach,  where we work with a single excel sheet first and try to wrap our head around the problem and develop an approach to solve it.
  3. Once we’ve setup an approach that we think will be useful, we both built functions that could parse all of the sheets in the excel file, returning our cleaned data frame.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

Happy cleaning!

TidyX 64: Cleaning ugly excel data, Part 1

Continuing on with the data cleaning theme of our previous 3 episodes, Ellis Hughes and I work through messy excel data that was simulated to look like something we’d see in practice.

We decided to take turns with this, to show how both of us might handle the problem. This week is my approach to handling the issue and next week will be Ellis approach.

Things that we cover:

  1. Reading in excel files to R, selecting the tabs to read data from, and setting the data types
  2. Turning messy data in a data frame that can be analyzed
  3. Writing a for loop to operationalize the approach over a larger excel file, with many tabs

To access our code, CLICK HERE.

To watch the screen cast, CLICK HERE.

TidyX 63: Regex Lookarounds

Continuing with our series on regex expressions (TidyX 61: Regex 101, TidyX 62: Applied Regex), this week Ellis Hughes and I discuss regex lookarounds, as a way of setting new anchors when parsing text strings. We follow this up by taking the NBA play-by-play data and turning the text into minutes played by players in a game and then plot this data in a gantt chart.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.