Author Archives: Patrick

TidyX Episode 141: Function Factories

Two episodes ago, Ellis Hughes and I briefly discussed function factories — a function that returns a function — and we got a few questions about how to use them and how to construct them. The good new is that you’ve probably been using function factories in your normal R coding and never knew it! This week we go a little deeper into function factories and answer some viewer questions about them.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

Tidymodels Workflow Sets Tutorial

Intro

The purpose of workflow sets are to allow you to seamlessly fit multiply different models (and even tune them) simultaneously. This provide an efficient approach to the model building process as the models can then be compared to each other to determine which model is the optimal model for deployment. Therefore, the aim of this tutorial is to provide a simple walk through of how to set up a workflow_set() and build multiple models simultaneously using the tidymodels framework.

The full code (which will include code not directly embedded in this tutorial) is available on my GITHUB page.

Load Packages & Data

Data comes from the nwslR package, which provides a lot of really nice National Women’s Soccer League data.

We will be using stats for field players to determine those who received the the Best XI award (there will only be 10 players per season since we are dealing with field player stats, no goalies).

## packages
library(tidyverse)
library(tidymodels)
library(nwslR)
library(tictoc)

theme_set(theme_light() +
            theme(strip.background = element_rect(fill = "black"),
                  strip.text = element_text(face = "bold")))


## data sets required
data(player)
data(fieldplayer_overall_season_stats)
data(award)

## join all data sets to make a primary data set
d <- fieldplayer_overall_season_stats %>%
  left_join(player) %>% 
  left_join(award) %>% 
  select(-name_other) %>% 
  mutate(best_11 = case_when(award == "Best XI" ~ 1,
                             TRUE ~ 0)) %>% 
  select(-award)

d %>% 
  head()


Our features will be all of the play stats: mp, starts, min, gls, ast, pk, p_katt and the position (pos) that the player played.

Exploratory Data Analysis

Let’s explore some of the variables that we will be modeling.

How many NAs are there in the data set?

  • It looks like there are some players that matches played (mp) and starts yet the number of minutes was not recorded. We will need to handle this in our pre-processing. The alternative approach would be to just remove those 79 players, however I will add an imputation step in the recipe section of our model building process to show how it works.
  • There are also a number of players that played in games but never attempted a penalty kick. We will set these columns to 0 (the median value).

How many matches did those who have an NA for minutes play in?


Let’s get a look at the relationship between matches played, `mp`, and `min` to see if maybe we can impute the value for those who have NA.

fit_min <- lm(min ~ mp, data = d)
summary(fit_min)

plot(x = d$mp, 
     y = d$min,
     main = "Minutes Played ~ Matches Played",
     xlab = "Matches Played",
     ylab = "Minutes Played",
     col = "light grey",
     pch = 19)
abline(summary(fit_min),
       col = "red",
       lwd = 5,
       lty = 2)

  • There is a large amount of error in this model (residual standard error = 264) and the variance in the relationship appears to increase as matches played increases. This is all we have in this data set to really go on. It is probably best to figure out why no minutes were recorded for those players or see if there are other features in a different data set that can help us out. For now, we will stick with this simple model and use it in our model `recipe` below.

Plot the density of the continuous predictor variables based on the `best_11` award.

d %>%
  select(mp:p_katt, best_11) %>%
  pivot_longer(cols = -best_11) %>%
  ggplot(aes(x = value, fill = as.factor(best_11))) +
  geom_density(alpha = 0.6) +
  facet_wrap(~name, scales = "free") +
  labs(x = "Value",
       y = "Density",
       title = "Distribution of variables relative to Best XI designation",
       subtitle = "NOTE: axes are specific to the value in question")

How many field positions are there?

Some players appear to play multiple positions. Maybe they are more versatile? Have players with position versatility won more Best XI awards?

Data Splitting

First, I’ll create a data set of just the predictors and outcome variables (and get rid of the other variables in the data that we won’t be using). I’ll also convert our binary outcome variable from a number to a factor, for model fitting purposes.

Split the data into train/test splits.

## Train/Test
set.seed(398)
init_split <- initial_split(d_model, prop = 0.7, strat = "best_11")

train <- training(init_split)
test <- testing(init_split)

Further split the training set into 5 cross validation folds.

## Cross Validation Split of Training Data
set.seed(764)
cv_folds <- vfold_cv(
  data = train, 
  v = 5
  ) 


Prepare the data with a recipe

Recipes help us set up the data for modeling purposes. It is here that we can handle missing values, scale/nornmalize our features, and create dummy variables. More importantly, creating the recipe ensure that if we deploy our model for future predictions the steps in the data preparation process will be consistent and standardized with what we did when we fit the model.

You can find all of the recipe options HERE.

The pre-processing steps we will use are:

  • Impute any NA minutes, `min` using the `mp` variable.
  • Create one hot encoded dummy variables for the player’s position
  • Impute the median (0) when penalty kicks attempted and penalty kicks made are NA
  • Normalize the numeric data to have a mean of 0 and standard deviation of 1
nwsl_rec <- recipe(best_11 ~ ., data = train) %>%
  step_impute_linear(min, impute_with = imp_vars(mp)) %>%
  step_dummy(pos, one_hot = TRUE) %>%
  step_impute_median(pk, p_katt, ast) %>%
  step_normalize(mp:p_katt)

nwsl_rec

Here is what the pre-processed training set looks like when we apply this recipe:

Specifying the models

We will fit three models at once:

  1. Random Forest
  2. XGBoost
  3. K-Nearest Neighbor
## Random forest
rf_model <- rand_forest( mtry = tune(), trees = tune(), ) %>%
  set_mode("classification") %>%
  set_engine("randomForest", importance = TRUE)

## XGBoost
xgb_model <- boost_tree( trees = tune(), mtry = tune(), tree_depth = tune(), learn_rate = .01 ) %>%
  set_mode("classification") %>% 
  set_engine("xgboost",importance = TRUE)

## Naive Bayes Classifier
knn_model <- nearest_neighbor(neighbors = 4) %>%
  set_mode("classification")

Workflow Set

We are now ready to combine the pre-processing recipes and the three models together in a workflow_set().

nwsl_wf <-workflow_set(
  preproc = list(nwsl_rec),
  models = list(rf_model, xgb_model, knn_model),
  cross = TRUE
  )

nwsl_wf

Tune & fit the 3 workflows

Once the models are set up we use workflow_map() to fit the workflow to the cross-validated folds we created. We will set up a few tuning parameters for the Random Forest and XGBOOST models so during the fitting process we can determine which of parameter pairings optimize the model performance.

I also use the ‘tic()’ and ‘toc()’ functions from the tictoc package to determine the length of time it takes the model to fit, in case there are potential opportunities to optimize the fitting process.

doParallel::registerDoParallel(cores = 10)

tic()

fit_wf <- nwsl_wf %>%  
  workflow_map(
    seed = 44, 
    fn = "tune_grid",
    grid = 10,           ## parameters to pass to tune grid
    resamples = cv_folds
  )

toc()

# Took 1.6 minutes to fit

doParallel::stopImplicitCluster()

fit_wf


Evaluate each model’s performance on the train set

We can plot the model predictions across the range of models we fit using autoplot(), get a summary of the model predictions with the collect_metrics() function, and rank the results of the model using rank_results().

 

## plot each of the model's performance and ROC
autoplot(fit_wf)

## Look at the model metrics for each of the models
collect_metrics(fit_wf) 

## Rank the results based on model accuracy
rank_results(fit_wf, rank_metric = "accuracy", select_best = TRUE)


We see that the Random Forest models out performed the XGBOOST and KNN models.

Extract the model with the best performance

Now that we know that the Random Forest performed the best. We will grab the model ID for the Random Forest Models and their corresponding workflows.

## get the workflow ID for the best model
best_model_id <- fit_wf %>% 
  rank_results(
    rank_metric = "accuracy",
    select_best = TRUE
  ) %>% 
  head(1) %>% 
  pull(wflow_id)

best_model_id

## Extract the workflow for the best model
best_model <- extract_workflow(fit_wf, id = best_model_id)
best_model

Extract the tuned results from workflow of the best model

We know the best model was the Random Forest model so we can use the best_model_id to get all of the Random Forest models out and look at how each one did during the tuning process.

First we extract the Random Forest models.

## extract the Random Forest models
best_workflow <- fit_wf[fit_wf$wflow_id == best_model_id,
                               "result"][[1]][[1]]

best_workflow

With the collect_metrics() function we can see the iterations of mtry, trees, and tree_depth that were evaluated in the tuning process. We can also use select_best() to get the model parameters that performed the best of the Random Forest models.

collect_metrics(best_workflow)
select_best(best_workflow, "accuracy")

Fit the final model

We saw above that the best model had the following tuning parameters:

  • mtry = 1
  • trees = 944

We can extract this optimized workflow using the finalize_workflow() function and then fit that final workflow to the initial training split data.

## get the finalized workflow
final_wf <- finalize_workflow(best_model, select_best(best_workflow, "accuracy"))
final_wf

## fit the final workflow to the initial data split
doParallel::registerDoParallel(cores = 8)

final_fit <- final_wf %>% 
  last_fit(
    split = init_split
  )

doParallel::stopImplicitCluster()

final_fit

Extract Predictions on Test Data and evaluate model

First we can evaluate the variable importance plot for the random forest model.

library(vip)

final_fit %>%
  extract_fit_parsnip() %>%
  vip(geom = "col",
      aesthetics = list(
              color = "black",
              fill = "palegreen",
              alpha = 0.5)) +
  theme_classic()

Next we will look at the accuracy and ROC on the test set by using the collect_metrics() function on the final_fit. Additionally, if we use the collect_predictions() function we will get the predicted class and predicted probabilities for each row of the test set.

## Look at the accuracy and ROC on the test data
final_fit %>% 
  collect_metrics()

## Get the model predictions on the test data
fit_test <- final_fit %>% 
  collect_predictions()

fit_test %>%
  head()

Next, create a confusion matrix of the class of interest, best_11 and our predicted class, .pred_class.

fit_test %>% 
  count(.pred_class, best_11)

table(fit_test$best_11, fit_test$.pred_class)

We see that the model never actually predicted a person to be in class 1, indicating that they would be ranked as one of the Best X1 for a given season. We have such substantial class imbalance that the model can basically guess that no one will will Best XI and end up with a high accuracy.

The predicted class for a binary prediction ends up coming from a default threshold of 0.5, meaning that the predicted probability of being one of the Best XI needs to exceed 50% in order for that class to be predicted. This might be a bit high/extreme for our data! Additionally, in many instances we may not care so much about a specific predicted class but instead we want to just understand the probability of being predicted in one class or another.

Let’s plot the distribution of Best XI predicted probabilities colored by whether the individual was actually one of the Best XI players.

fit_test %>%
  ggplot(aes(x = .pred_1, fill = best_11)) +
  geom_density(alpha = 0.6)

We can see that those who were actually given the Best XI designation had a higher probability of being indicated as Best XI, just not high enough to exceed the 0.5 default threshold. What if we set the threshold for being classified as Best XI at 0.08?

fit_test %>%
  mutate(pred_best_11_v2 = ifelse(.pred_1 > 0.08, 1, 0)) %>%
  count(pred_best_11_v2, best_11)

Wrapping Up

In the final code output above we see that there are 20 total instances where the model predicted the individual would be a Best XI player. Some of those instances the model correctly identified one of the Best XI and other times the model prediction led to a false positive (the model thought the person had a Best XI season but it was incorrect). There is a lot to unpack here. Binary thresholds like this can often be messy as predicting one class or another can be weird as you get close to the threshold line. Additionally, changing the threshold line will change the classification outcome. This would need to be considered based on your tolerance for risk of committing a Type I or Type II error, which may depend on the goal of your model, among other things. Finally, we often care more about the probability of being in one class or another versus a specific class outcome. All of these things need to be considered and thought through and are out of the scope of this tutorial, which had the aim of simply walking through how to set up a workflow_set() and fit multiple models simultaneously. Perhaps a future tutorial can cover such matters more in depth.

The complete code for this tutorial is available on my GITHUB page.

Rolling Mean and SD not including the most recent observation

A colleague recently asked me a good question regarding some feature engineering for some data he was working with. He was collecting training load data and wanted to create a z-score for each observation, BUT, he didn’t want the most recent observation to be included into the calculation of the mean and standard deviation. Basically, he wanted to represent the z-score for the most recent observation normalized to the observations that came before it.

This is an interesting issue because it makes me think of sports science research that uses z-scores to calculate the relationship between training load and injury. If the z-score is calculated retrospectively on the season then the observed z-scores and their relationship to the outcome of interest (injury) is a bit misleading as the mean and standard deviation of the full season data is not information one would have had on the day in which the injury occurred. All the practitioner would know, as the season progresses along, is the mean and standard deviation of the data up to the most recent observation.

So, let’s calculate some lagged mean and standard deviation values! The full code is available on my GITHUB page.

Loading Packages and Simulating Data

Aside from loading {tidyverse} we will also load the {zoo} package, which is a common package used for constructing rolling window functions (this is useful as it prevents us from having to write our own custom function).

We will start with a simple data set of just 10 training load observations.

## Load Libraries
library(tidyverse)
library(zoo)

## Simulate Data
set.seed(45)
d <- tibble(
  day = 1:10,
  training_load = round(rnorm(n = 10, min = 250, max = 350), 0)
)

d

 

Calculate the z-score with the mean and standard deviation of all data that came before it

To do this we use the rollapplyr() function from the {zoo} package. If we want to include the most recent observation in the mean and standard deviation we can run the function as is. However, because we want the mean and standard deviation of all data previous to, but not including, the most recent observation we wrap this entire function in the lag() function, which will take the data in the row directly above the recent observation. The width argument indicates the width of the window we want to calculate the function over. In this case, since we have a day variable we can use that number as our window width to ensure we are getting all observed data prior to the most recent observation.

d <- d %>%
  mutate(lag_mean = lag(rollapplyr(data = training_load, width = day, FUN = mean, fill = NA)),
         lag_sd = lag(rollapplyr(data = training_load, width = day, FUN = sd, fill = NA)),
         z_score = (training_load - lag_mean) / lag_sd)

d

 

This looks correct. As a sanity check, let’s calculate the mean and standard deviation of the first 4 rows of training load observations and see if those values correspond to what is in the lag_mean and lag_sd columns at row 5.

first_four <- d %>%
  slice(1:4) %>%
  pull(training_load)

mean(first_four)
sd(first_four)

It worked!

A more complicated example

Okay, that was an easy one. We had one athlete and we had a training day column, which we could use for the window with, already built for us. What if we have a data set with multiple athletes and only training dates, representing the day that training happened?

To make this work we will group_by() the athlete, and use the row_number() function to calculate a training day variable that represents our window width. Then, we simply use the same code above.

Let’s simulate some data.

set.seed(67)
d <- tibble(
  training_date = rep(seq(as.Date("2023-01-01"),as.Date("2023-01-05"), by = 1), times = 3),
  athlete = rep(c("Karl", "Bonnie", "Thomas"), each = 5),
  training_load = round(runif(n = 15, min = 250, max = 350), 0)
)

d

Now we run all of our functions for each athlete.

d <- d %>%
  group_by(athlete) %>%
  mutate(training_day = row_number(), 
         lag_mean = lag(rollapplyr(data = training_load, width = training_day, FUN = mean, fill = NA)),
         lag_sd = lag(rollapplyr(data = training_load, width = training_day, FUN = sd, fill = NA)),
         z_score = (training_load - lag_mean) / lag_sd)


d

Wrapping Up

There we have it, a simple way of calculating rolling z-scores while using the mean and standard deviation of the observations that came before the most recent observation!

If you’d like the entire code, check out my GITHUB page.

TidyX Episode 140: Data restructuring via splits

This week, Ellis Hughes and I answer a viewer question about data restructuring using splits in the data.

The viewer has some data sets that include information about competitions, the name of the athlete, the team the athlete is on, and the athlete gender. The goal was to help the viewer write some code that could output this information into a clean looking list with all of the information arranged appropriately within each competition.

Ellis and I both approached it (and interpreted it) in different ways, so you get to see multiple different ways of solving the problem.

To watch the screen cast, CLICK HERE.

To access our code, CLICK HERE.

From tidyverse to python

Anyone who reads this blog or watches our Tidy Explained Screen Cast knows that I am a massive R user and R fan. I can work fast in it and I find it to be one of the best tools for building statistical models. That said, there are times when python comes in handy and there are a number of software and web applications that interact and play nicer with python compared to R. For example, R can be run within AWS Sagemaker, but python seems to be more efficient. I’ve recently been doing a few projects in Databricks as well and, while I can use R within their system, I’ve been trying to code the projects using python.

For those of us trying to learn a bit of python to be somewhat useful in that language (or for pythonistas who may need to learn a little bit of R) I’ve put together the following tutorial that shows how to do some of the common stuff you’d use R’s tidyverse for, in python.

The codes for both the RMarkdown file and the Jupyter Notebook are available on my GITHUB page. The codes has many more examples than I will go over here (for space reasons), so be sure to check it out!

Load Libraries and Data

We will be using the famous Palmer Penguins data. Here is a side-by-side look at how we load the libraries and data set in tidyverse and what those same steps look like in Python.

Exploratory Data Analysis

One of the most popular features of the suite of tidyverse libraries is the ability to nicely summarize and plot data.

I won’t go through every possible EDA example contained in the notebooks but here are a few side-by-side.

Create a Count of the Number of Samplers for Each Species

Create a barplot of the count of Species types

Scatter plot of flipper length and body mass colored by sex with linear regression lines

Group By Summarize

In tidyverse we often will group our data by a specific condition and then summarize data points relative to that condition. In tidyverse we use pipes, %>%, to connect together lines of code. In the pandas library (the common python library for working with data frames) we use dots to connect these lines of code.

Group By Mutate

In addition to summarizing by group, sometimes we need to add additional columns to the existing data set but those columns need to have the new data conditional on a specific grouping variable. In tidyverse we use mutate and in pandas we use transform.

We can also build columns using grouping and custom functions. In tidyverse we do this inside of the mutate but in pandas we need to set up a lambda function. For example, we can build the z-score of each sample grouped by Species (meaning the observation is divided by the mean and standard deviation for that Species population).

ifelse / case_when

Another task commonly performed in data cleaning is to assign values to specific cases. For example, we have three Islands in the data set and we want to assign them Island1, Island2, and Island3. In tidyverse we could use either ifelse or case_when() to solve this task. In pandas we need to either set up a custom function and then map that function to the data set or we can use the numpy library, which has a function called where(), which behaves like case_when() in tidyverse.

Linear Regression Model

To finish, I’ll provide some code to write a linear model in both languages.

Wrapping Up

Hopefully the tutorial will help with folks going from R to Python or vice versa. Generally, I suggest picking one or the other and trying to really dig into being good at it. But, there are times where you might need to delve into another language and produce some code. This tutorial provides a mirror image of code between R and Python.

As stated earlier, there are a number of extra code examples in the RMarkdown and Jupyter Notebook files available on my GITHUB page.