Author Archives: Patrick

Making Predictions from Cross-Validated Workflow Using tidymodels

In yesterday’s post, I offered an approach to using {tidymodes} when you don’t want to split your data into training and testing sets, but rather, you want to fit all of your data using cross-validated folds and then save the model and deploy it later on with new data sets. Recall that the reason you’d want to do this is because you might not have enough data where you feel good about removing some of it for a testing set, ultimately decreasing the number of observations your model can learn from.

After that post, I got an email from someone asking how they could save the entire workflow for later deployment as yesterday’s article only saved the model fit following cross-validation. Storing the workflow can be critical when you have a data and.or a model that requires various preprocessing steps prior to making forecasts. One of the advantages of the {tidymodels} framework is the ability to combine the preprocessing tasks in one step and then fit the model all at once. This keeps your script nice and tidy and makes it easy to see what is taking place at each step.

The issue we have when working with just the cross-validated folds is that you aren’t fitting the model to a training/testing split of data once you are done fitting it. In most {tidymodels} examples, model fitting is done with the last_fit() function, which requires a split data set, which was generated via the initial_split() function. Form there you can extract the workflow and save it for later deployment. Consequently, there are a few extra steps to make this work smoothly when saving a model that was fit using only cross-validated folds.

So, to follow up on yesterday, I’ll walk through a random forest classification example where I’ll fit a model to cross-validation folds of the mtcars data set, I’ll save the entire recipe (where preprocessing takes place), I’ll save the model, and then I’ll show how you can use both the saved recipe and model to make predictions on a new data set. Additionally, to make things more interesting, I will tune the random forest model and show how to extract the tuned parameters and re-fit the model before saving it.

Load packages & data

### get data
df <- mtcars
head(df)
table(df$cyl)

df$cyl <- as.factor(df$cyl)

Create cross-validated folds & specify a random forest classifier

### specify random forest model
rf_spec_with_tuning <-rand_forest(mtry = tune(), trees = tune()) %>%
  set_engine("randomForest", importance = TRUE) %>%
  set_mode("classification")

Build a tuning grid

We will allow {tidymodels} to optimize both mtry and the trees.

### build a tuning grid
rf_tune_grid <- grid_regular(
  mtry(range = c(1, 10)),
  trees(range = c(500, 800)),
  levels = 5
)

Create the model recipe

The mtcars data set is complete and has no missing values. But, that doesn’t mean that future data that we will be deploying the model on will be free from missing values. So, to be sure that we can handle this in the future, if we need to, I’m going to create an imputation step in the recipe that will use k-nearest neighbor, which will you the 3 nearest neighbors, to impute any NA values of the independent variables.

NOTE: I’m not saying this is the best imputation approach here. I’m simply creating an additional step in the model recipe that can be deployed later to show how it works.

### recipe -- set up imputation for new data that might have missing values
cyl_rec <- recipe(cyl ~ mpg + disp + wt, data = df) %>%
  step_impute_knn(mpg, disp, wt,
                  neighbors = 3)

Set up the workflow

Combine the preprocessing recipe and the random forest model, which still needs to be tuned, into a single workflow.

### workflow
cyl_wf <- workflow() %>%
  add_recipe(cyl_rec) %>%
  add_model(rf_spec_with_tuning)

Set up a control function for storing the model predictions on the cross-validated folds

### set a control function to save the predictions from the model fit to the CV-folds
ctrl <- control_resamples(save_pred = TRUE)

Tune the model parameters during the model fitting process

### fit model
cyl_tune <- tune_grid(
  cyl_wf,
  resamples = df_cv,
  grid = rf_tune_grid,
  control = ctrl
)

Get the model predictions from the cross-validated tunning

### get predictions
pred_group <- cyl_tune %>%
  unnest(cols = .predictions) %>%
  select(.pred_4, .pred_6, .pred_8, .pred_class, cyl) 

pred_group

table('predicted' = pred_group$.pred_class, 'observed' = pred_group$cyl)

 

Get the optimized values for mtry and trees

After tuning the model, we want to get the mtry and trees parameters that produced the best ROC/AUC, so we will pull those values out of our tuning grid, cyl_tune. In this instance, an mtry of 1 and 500 trees appear to be the optimal values.

NOTE: We could have extracted the mtry and trees parameters that optimized model accuracy instead.

# get the optimized numbers for mtry and trees
m <- select_best(cyl_tune, "roc_auc") %>% pull(mtry)
t <- select_best(cyl_tune, "roc_auc") %>% pull(trees)

m
t


Re-specify the model and re-fit the workflow

Since we are working with cross-validated samples and not a training/testing set, we can’t just fit the last or fit the best model because this isn’t split data. To get the optimal model we need to actual re-specify the random forest model with the mtry and trees values we extracted above. So re-fit a new workflow with the optiized mtry and trees parameters to ensure that the tuned model is used for our final model fit.

# re-specify the model with the optimized values
rf_spec_tuned <-rand_forest(mtry = m, trees = t) %>%
  set_engine("randomForest", importance = TRUE) %>%
  set_mode("classification")

# re-set workflow
cyl_wf_tuned <- workflow() %>%
  add_recipe(cyl_rec) %>%
  add_model(rf_spec_tuned)

Extract & save the final recipe

Now that the tuned model has been fit we will need to extract the final recipe and then save it as an .RDA file so that we can load it for deployment later on. To do this, we use the extract_recipe() function after fitting the tuned model to our original data set.

# extract the final recipe for pre-processing of new data
cyl_final_rec <- cyl_wf_tuned %>%
  fit(df) %>%
  extract_recipe()

save(cyl_final_rec, file = "cyl_final_rec.rda")
load("cyl_final_rec.rda")
cyl_final_rec


Extract & save the final model fit

Once we have the recipe, which holds all of our preprocessing steps, we then need to extract the actual model fit so that we can make future predictions on new data.

# extract final model
cyl_final_tuned <- cyl_wf_tuned %>% 
  fit(df) %>%
  extract_fit_parsnip()

save(cyl_final_tuned, file = "cyl_final_tuned.rda")
load("cyl_final_tuned.rda")

cyl_final_tuned

Create a new data set and add missing values to some of the independent variables

Now that our recipe and model fit are saved we will create some new data and add some missing values to show how the impute function, which we created in the recipe, works.

We will also save the cyl value for this new data (the truth) so we can check our work once the model predictions are done. Prior to making predictions on this new data we will drop the cyl column from this new data set to make it look more realistic to what we would see in the real world (IE, we’d never have the actual output we are trying to forecast).

### Create New Data with NAs
set.seed(947)
row_ids <- sample(x = 1:nrow(mtcars), size = 5, replace = TRUE)

df_new <-mtcars[row_ids, ]
df_new[2, 3] <- NA
df_new[c(1,5), 1] <- NA
df_new[3, 6] <- NA

# get the actual cyl values for this new data
truth <- df_new$cyl
truth

# drop the cyl column to pretend like this is new data
df_new$cyl <- NULL
df_new


Apply the recipe to the new data

Notice that we have some NAs in a few of the predictor columns (mpg, disp, and wt). We apply the recipe to the new data set by using the bake() function to impute those values using information about the data that was gained during the model workflow that we build back when we fit the model.

### Apply the pre-processing recipe to the new data
df_new_pre_processed <- cyl_final_rec %>%
  bake(new_data = df_new)

df_new_pre_processed


Make the final predictions using the saved model

Now that we have imputed NAs using the recipe we are ready to make cyl predictions on the new data. After making predictions we will combine the predicted class and the probability of each of the three classes with the new data set.

### Make a prediction for cyl
pred_cyl <- predict(cyl_final_tuned, new_data = df_new_pre_processed, type = "class")
df_new_pre_processed <- cbind(df_new_pre_processed, pred_cyl)

### get probability of each class
pred_probs <- predict(cyl_final_tuned, new_data = df_new_pre_processed, type = "prob")
df_new_pre_processed <- cbind(df_new_pre_processed, pred_probs)
df_new_pre_processed

See how well the model predicted on this new data, even after having to impute some values for each observation.

table('predicted' = df_new_pre_processed$.pred_class, 'actual' = truth)


The model predicted all 5 new observations correctly.

And that’s it! Those are the steps to follow for using all of your data to fit and tune a model with cross-validated folds, save the preprocessing steps and tuned model, and then apply it to a new set of observations.

All of the code for this blog article is available on my GITHUB page.

Fitting, Saving, and Deploying tidymodels with Cross Validated Data

I’ve talked about {tidymodels} previously when I laid out a {tidymodels} model fitting template, which serves as a framework to wrap up the 10 series screen cast we did on {tidymodels} for Tidy Explained.

During all 10 of our episodes, within my model fitting template, and pretty much every single tutorial I’ve seen online, people follow the same initial steps, which are to split the data into a training and testing set and then split the training data into cross validation sets.

This approach is fine when you have enough data to actually perform a training and testing split. But, there are times where we don’t really have enough data to do this, meaning we are fitting a model to a small training set and then hoping it picks up all of the necessary information in order to generalize well to external data.

In these instances, we may prefer to use all of our available data, split it into cross validation sets, fit and test the model, and then save the model workflow so that it can be deployed later on and used in production.

To cover this issue, I’ve put together a template for taking a data set, creating cross validation folds, fitting the model, and then saving the model. The code has both a regression and random forest classification model on the mtcars data set. I’ll only show the regression example below, but all code is available on my GITHUB page.

Load Packages & Data

### load packages
library(tidymodels)
library(tidyverse)

############ Regression Example ############
### get data
df <- mtcars
head(df)

Create Cross  Validation Folds & Specify Linear Model

### cross validation folds
df_cv <- vfold_cv(df, v = 10)
df_cv

### specify linear model
lm_spec <- linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

Create the Model Recipe and Workflow

To keep things simple, I wont do any pre-processing of the data. I’ll just set the recipe with the regression model I am fitting.

### recipe
mpg_rec <- recipe(mpg ~ cyl + disp + wt, data = df)
mpg_rec

### workflow
mpg_wf <- workflow() %>%
  add_recipe(mpg_rec) %>%
  add_model(lm_spec)

Control Function to Save Predictions

To save our model predictions using only cross-validated folds, we need to set a control function that will be passed as an argument when we fit our model. Without this argument, we can fit the model using the cross-validated folds but we wont be able to extract the predictions.

### set a control function to save the predictions from the model fit to the CV-folds
ctrl <- control_resamples(save_pred = TRUE)

Fit the Model

Evaluate model performance

### view model metrics
collect_metrics(mpg_lm)

Unnest the .predictions column from the model fit and look at the predicted mpg versus actual mpg

### get predictions
mpg_lm %>%
  unnest(cols = .predictions) %>%
  select(.pred, mpg)

Fit the final model and extract the workflow

If we are happy with our model performance and the workflow that we’ve built (which contains our pre-processing steps) we can fit final model to the data set.

To do this, we use the function fit() and pass it our data set and then we use extract_fit_parsnip() to extract the workflow that you’ve created. Then save the workflow as an RDA file to be loaded and used at a later time.

## Fit the final model & extract the workflow
mpg_final <- mpg_wf %>% 
  fit(df) %>%
  extract_fit_parsnip()

mpg_final

## Save model to use later
# save(mpg_final, file = "mpg_final.rda")

To access all of the code for this template and see an example with a random forest classifier go to my GITHUB page.

TidyX 97: Simulating Distributions in R

This week, Ellis Hughes and I begin a new series titled, Sampling, Simulations, and Intro to Bayes. To get us started we begin this first episode by discussing how to use the various distribution functions in R for doing things like simulating a random draw, cumulative density function, probability mass/probability density functions, and quantiles. We then walk through an example of how to use these functions to create random samples of intercepts and slopes from a regression model and plot our uncertainty.

To watch the screen cast, CLICK HERE.

To access our code, CLICK HERE.

Weakley et al. (2022). Velocity-Based Training: From Theory to Application – R Workbook

Velocity-based training (VBT) is a method employed by strength coaches to prescribe training intensity and volume based off of an individual athlete’s load-velocity profiles. I discussed VBT last year when I used {shiny} to build an interactive web application for visualizing and comparing athlete outputs.

Specific to this topic, I recently read the following publication: Weakley, Mann, Banyard, McLaren, Scott, and Garcia-Ramos. (2022). Velocity-based training: From theory to application. Strength Cond J; 43(2): 31-49.

The paper aimed to provide some solutions for analyzing, visualizing, and presenting feedback around training prescription and performance improvement when using VBT. I enjoyed the paper and decided to write an R Markdown file to provide code that can accompany it and (hopefully) assist strength coaches in applying some of the concepts in practice. I’ll summarize some notes and thoughts below, but if you’d like to read the full R Markdown file that explains and codes all of the approaches in the paper, CLICK HERE>> Weakley–2021—-Velocity-Based-Training—From-Theory-to-Application—Strength-Cond-J.

If you’d like the CODE and DATA to run the analysis yourself, they are available on my GitHub page.

Paper/R Markdown Overview

Technical Note: I don’t have the actual data from the paper. Therefore, I took a screen shot of Figure 3 in the text and used an open source web application for extracting data from figures in research papers. This requires me to go through and manually click on the points of the plot itself. Consequently, I’m not 100% perfect, so there may be subtle differences in my data set compared to what was used for the paper.

The data used in the paper reflect 17-weeks of mean concentric velocity (MCV) in the 100-kg back squat for a competitive powerlifter, tested once a week. The two main figures, which, along with the analysis, I will recreate are Figure 3 and Figure 5.

Figure 3 is a time series visual of the athlete while Figure 5 provides an analysis and visual for the athlete’s change across the weeks in the training phase.

Figure 3

The first 10-weeks represent the maintenance phase for the athlete, which was followed by a 7-week training phase. The maintenance phase sessions were used to build a linear regression model which was then used to visualize the athlete’s change over time along with corresponding confidence interval around each MCV observation. The model output looks like this:

The standard (typical) error was used to calculate confidence intervals around the observations. To calculate the standard error, the authors’ recommend one of two approaches:

1) If you have group-based test-retest data, they recommend taking the difference between the test-retest outcomes and calculating the standard error as follows:

  • SE.group = sd(differences) / sqrt(2)

2) If you have individual observations, they recommend calculating the standard error like this:

  • SE.individual = sqrt(sum.squared.residuals) / (n2))

Since we have individual athlete data, we will use the second option, along with the t-critical value for 80% CI, to produce Figure 3 from the paper :

The plot provides a nice visual of the athlete over time. We see that, because the linear model is calculated for the maintenance phase, as time goes on, the shaded standard error region gets wider. The confidence intervals around each point estimate are there to encourage us to think past just a point estimate and recognize that there is some uncertainty in every test outcome that cannot be captured in a single value.

Figure 5

This figure visualizes the change in squat velocity for the powerlifter in weeks 11-17 (the training phase) relative to the mean squat velocity form the maintenance phase, representing the athlete’s baseline performance.

Producing this plot requires five pieces of information:

  1. Baseline average for the maintenance phase
  2. The difference between the observed MVC in each training week and the maintenance average
  3. Calculate the t-critical value for the 90% CI
  4. Calculate the Lower 90% CI
  5. Calculate the Upper 90% CI

Obtaining this information allows us to produce the following table of results and figure:

Are the changes meaningful?

One thing the authors’ mention in the paper are some approaches to evaluating whether the observed changes are meaningful. They recommend using either equivalence tests or second generation p-values. However, they don’t go into calculating such things on their data. I honestly am not familiar with the latter option, so I’ll instead create an example of using an equivalence test for the data and show how we can color the points within the plot to represent their meaningfulness.

Equivalence testing has been discussed by Daniel Lakens and colleagues in their tutorial paper, Lakens, D., Scheel, AM., Isager, PM. (2018). Equivalence testing for psychological reserach: A tutorial. Advances in Methods and Practices in Psychological Science. 2018; 1(2): 259-269.

Briefly, equivalence testing uses one-sided t-tests to evaluate whether the observed effect is larger or smaller than a pre-specified range of values surrounding the effect of interest, termed the smallest effect size of interest (SESOI).

In our above plot, we can consider the shaded range of values around 0 (-0.03 to 0.03, NOTE: The value 0.03 was provided in the text as the meaningful change for this athlete to see an ~1% increase in his 1-RM max) as the region where an observed effect would not be deemed interesting. Outside of those ranges is a change in performance that we would be most interested in. In addition to being outside of the SESOI region, the observed effect should be substantially large enough relative to the standard error around each point, which we calculated from our regression model earlier.

Putting all of this together, we obtain a the same figure above but now with the points colored specific to the p-value provided from our equivalence test:

Warpping Up

Again, if you’d like the full markdown file with code (click the ‘code’ button to display each code chunk) CLICK HERE >> Weakley–2021—-Velocity-Based-Training—From-Theory-to-Application—Strength-Cond-J

There are always a number of ways that analysis can unfold and provide valuable insights and this paper reflects just one approach. As with most things, I’m left with more questions than answers.

For example, Figure 3, I’m not sure if linear regression is the best approach. As we can see, the grey shaded region increases in width overtime because time is on the x-axis (independent variable) and the model was built on a small portion (the first 10-weeks) of the data. As such, with every subsequent week, uncertainty gets larger. How long would one continue to use the baseline model? At some point, the grey shaded region would be so wide that it would probably be useless. Are we too believe that the baseline model is truly representative of the athlete’s baseline? What if the baseline phase contained some amount of trend — how would the model then be used to quantify whatever takes place in the training phase? Maybe training isn’t linear? Maybe there is other conditional information that could be used?

In Figure 5, I wonder about the equivalence testing used in this single observation approach. I’ve generally thought of equivalence testing as a method comparing groups to determine if the effect from an intervention in one group is larger or smaller than the SESOI. Can it really work in an example like this, for an individual? I’m not sure. I need to think about it a bit. Maybe there is a different way such an analysis could be conceptualized? A lot of these issues come back to the problem of defining the baseline or some group of comparative observations that we are checking our most recent observation against.

My ponderings aside, I enjoyed the paper and the attempt to provide practitioners with some methods for delivering feedback when using VBT.

TidyX Episode 96: Evaluate Results ‘ASIS’ in RMarkdown

Wrapping up our RMarkdown series, Ellis Hughes and I discuss how to report RMarkdown chunk results asis. What asis does is allows you to type comments directly into the chunk, via the cat() function, and have those comments reported in your RMarkdown report as normal text, meaning it it wont look like a code chuck (with a grey background and border around it). This can be incredibly useful, as you will see with our NBA example, if you are parsing long strings of text that you then want to automatically turn into readable text in your final report.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.