Author Archives: Patrick

TidyX 63: Regex Lookarounds

Continuing with our series on regex expressions (TidyX 61: Regex 101, TidyX 62: Applied Regex), this week Ellis Hughes and I discuss regex lookarounds, as a way of setting new anchors when parsing text strings. We follow this up by taking the NBA play-by-play data and turning the text into minutes played by players in a game and then plot this data in a gantt chart.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

TidyX 61: Regular Expressions 101

One of the less glamorous tasks in data science is data cleaning. Because data can come in many different forms, it is off “dirty” and requires some level of treatment prior to analysis. One of the more complex data cleaning tasks is working with strings and regular expressions.

Regular expressions can both look intimidating and daunting as parsing strings requires a lot of weird looking characters. For example, look at this regular expression Joe Cheng shared on Twitter recently:

As such, this week on TidyX, Ellis Hughes and I begin a series on regular expressions and start with Regular Expressions 101. We cover:

  • Searching strings for key words
  • Manipulating the string
  • Extracting components of the string
  • Splitting the string based on a specific character
  • Regular expression anchors
  • Matching within the string

To watch the screen cast, CLICK HERE.

To access our code, CLICK HERE.

TidyX 60: pitchf/x model evaluation

We’ve made it!

Over the past 7-weeks, Ellis Hughes and I have been working on various approaches to building a classification model of pitchf/x data using the {mlbgameday} package in R. We are finally ready to compare all of the models we’ve built, displaying the results in a conditionally formatted {gt} table, and discuss our findings.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

For previous episodes in this series:

  1. Episode 53: Pitch Classification & EDA
  2. Episode 54: KNN & UMAP
  3. Episode 55: Decision Trees, Random Forest, & Optimization
  4. Episode 56: XGBoost
  5. Episode 57: Naive Bayes Classifier
  6. Episode 58: Tensorflow Neural Network
  7. Episode 59: Dealing with Class Imbalance

TidyX 59: pitchf/x classification with class imbalance

In the past 6 episodes of this series, Ellis Hughes and I have been working on developing a pitchf/x classification model using data from the {mlbgameday} package. Along the way we’ve mentioned that the data has a large class imbalance, with the majority class being the four seam fastball (FF).

This week, we discuss the approach of up and down sampling your data as a potential way of trying to address this class imbalance issue. We conclude by comparing our up and down sampled models by giving an intro to log-loss, Brier score, and AUC. Next week, we will wrap up this series by comparing all of our models using those three measures.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

For previous episodes in this series:

  1. Episode 53: Pitch Classification & EDA
  2. Episode 54: KNN & UMAP
  3. Episode 55: Decision Trees, Random Forest, & Optimization
  4. Episode 56: XGBoost
  5. Episode 57: Naive Bayes Classifier
  6. Episode 58: Tensorflow Neural Network