Fitting, Saving, and Deploying tidymodels with Cross Validated Data

I’ve talked about {tidymodels} previously when I laid out a {tidymodels} model fitting template, which serves as a framework to wrap up the 10 series screen cast we did on {tidymodels} for Tidy Explained.

During all 10 of our episodes, within my model fitting template, and pretty much every single tutorial I’ve seen online, people follow the same initial steps, which are to split the data into a training and testing set and then split the training data into cross validation sets.

This approach is fine when you have enough data to actually perform a training and testing split. But, there are times where we don’t really have enough data to do this, meaning we are fitting a model to a small training set and then hoping it picks up all of the necessary information in order to generalize well to external data.

In these instances, we may prefer to use all of our available data, split it into cross validation sets, fit and test the model, and then save the model workflow so that it can be deployed later on and used in production.

To cover this issue, I’ve put together a template for taking a data set, creating cross validation folds, fitting the model, and then saving the model. The code has both a regression and random forest classification model on the mtcars data set. I’ll only show the regression example below, but all code is available on my GITHUB page.

Load Packages & Data

### load packages
library(tidymodels)
library(tidyverse)

############ Regression Example ############
### get data
df <- mtcars
head(df)

Create Cross  Validation Folds & Specify Linear Model

### cross validation folds
df_cv <- vfold_cv(df, v = 10)
df_cv

### specify linear model
lm_spec <- linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

Create the Model Recipe and Workflow

To keep things simple, I wont do any pre-processing of the data. I’ll just set the recipe with the regression model I am fitting.

### recipe
mpg_rec <- recipe(mpg ~ cyl + disp + wt, data = df)
mpg_rec

### workflow
mpg_wf <- workflow() %>%
  add_recipe(mpg_rec) %>%
  add_model(lm_spec)

Control Function to Save Predictions

To save our model predictions using only cross-validated folds, we need to set a control function that will be passed as an argument when we fit our model. Without this argument, we can fit the model using the cross-validated folds but we wont be able to extract the predictions.

### set a control function to save the predictions from the model fit to the CV-folds
ctrl <- control_resamples(save_pred = TRUE)

Fit the Model

Evaluate model performance

### view model metrics
collect_metrics(mpg_lm)

Unnest the .predictions column from the model fit and look at the predicted mpg versus actual mpg

### get predictions
mpg_lm %>%
  unnest(cols = .predictions) %>%
  select(.pred, mpg)

Fit the final model and extract the workflow

If we are happy with our model performance and the workflow that we’ve built (which contains our pre-processing steps) we can fit final model to the data set.

To do this, we use the function fit() and pass it our data set and then we use extract_fit_parsnip() to extract the workflow that you’ve created. Then save the workflow as an RDA file to be loaded and used at a later time.

## Fit the final model & extract the workflow
mpg_final <- mpg_wf %>% 
  fit(df) %>%
  extract_fit_parsnip()

mpg_final

## Save model to use later
# save(mpg_final, file = "mpg_final.rda")

To access all of the code for this template and see an example with a random forest classifier go to my GITHUB page.