TidyX 59: pitchf/x classification with class imbalance

In the past 6 episodes of this series, Ellis Hughes and I have been working on developing a pitchf/x classification model using data from the {mlbgameday} package. Along the way we’ve mentioned that the data has a large class imbalance, with the majority class being the four seam fastball (FF).

This week, we discuss the approach of up and down sampling your data as a potential way of trying to address this class imbalance issue. We conclude by comparing our up and down sampled models by giving an intro to log-loss, Brier score, and AUC. Next week, we will wrap up this series by comparing all of our models using those three measures.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

For previous episodes in this series:

  1. Episode 53: Pitch Classification & EDA
  2. Episode 54: KNN & UMAP
  3. Episode 55: Decision Trees, Random Forest, & Optimization
  4. Episode 56: XGBoost
  5. Episode 57: Naive Bayes Classifier
  6. Episode 58: Tensorflow Neural Network