tidymodels train/test set without initial_split

Introduction

In {tidymodels} it is often recommended to split the data using the initial_split() function. This is useful when you are interested in a random sample from the data. As such, the initial_split() function produces a list of information that is used downstream in the model fitting and model prediction process. However, sometimes we have data that we want to fit specifically to a training set and then test on data set that we define. For example, training a model on years 2010-2015 and then testing a model on years 2016-2019.

This tutorial walks through creating your own bespoke train/test sets, fitting a model, and then making predictions, while circumventing the issues that may arise from not having the initial_split() object.

Load the AirQuality Data

library(tidyverse)
library(tidymodels)
library(datasets)

data("airquality")

airquality %>% count(Month)

 

Train/Test Split

We want to use `tidymodels` to build a model on months 5-7 and test the model on months 8 and 9?

Currently the initial_split() function only takes a random sample of the data.

set.seed(192)
split_rand <- initial_split(airquality ,prop = 3/4)
split_rand

train_rand <- training(split_rand)
test_rand <- testing(split_rand) train_rand %>%
  count(Month)

test_rand %>%
  count(Month)

The strat argument within initial_split() only ensures that we get an even sample across our strat (in this case, Month).

split_strata <- initial_split(airquality ,prop = 3/4, strata = Month)
split_strata

train_strata <- training(split_strata)
test_strata <- testing(split_strata) train_strata %>%
  count(Month)

test_strata %>%
  count(Month)

  • Create our own train/test split, unique to the conditions we are interested in specifying.
train <- airquality %>%
  filter(Month < 8)

test <- airquality %>%
  filter(Month >= 8)
  • Create 5-fold cross validation for tuning our random forest model
set.seed(567)
cv_folds <- vfold_cv(data = train, v = 5)

Set up the model specification

  • We will use random forest
## model specification
aq_rf <- rand_forest(mtry = tune()) %>%
  set_engine("randomForest") %>%
  set_mode("regression")

Create a model recipe

There are some NA’s in a few of the columns. We will impute those and we will also normalize the three numeric predictors in our model.

## recipe
aq_recipe <- recipe( Ozone ~ Solar.R + Wind + Temp + Month, data = train ) %>%
step_impute_median(Ozone, Solar.R) %>%
step_normalize(Solar.R, Wind, Temp)

aq_recipe

## check that normalization and NA imputation occurred in the training data
aq_recipe %>%
prep() %>%
bake(new_data = NULL)

## check that normalization and NA imputation occurred in the testing data
aq_recipe %>%
prep() %>%
bake(new_data = test)

 

Set up workflow

  • Compile all of our components above together into a single workflow.
## Workflow
aq_workflow <- workflow() %>%
add_model(aq_rf) %>%
add_recipe(aq_recipe)

aq_workflow

Tune the random forest model

  • We set up one hyperparmaeter to tune, mtry, in our model specification.
## tuning grid
rf_tune_grid <- grid_regular(
  mtry(range = c(1, 4))
)

rf_tune_grid

rf_tune <- tune_grid(
  aq_workflow,
  resamples = cv_folds,
  grid = rf_tune_grid
)

rf_tune

Get the model with the optimum mtry

## view model metrics
collect_metrics(rf_tune)

## Which is the best model?
select_best(rf_tune, "rmse")

  • Looks like an mtry = 1 was the best option as it had the lowest RMSE and highest r-squared.

Fit the final tuned model

  • model specification with mtry = 1
aq_rf_tuned <- rand_forest(mtry = 1) %>%
  set_engine("randomForest") %>%
  set_mode("regression")

Tuned Workflow

  • the recipe steps are the same
aq_workflow_tuned <- workflow() %>%
  add_model(aq_rf_tuned) %>%
  add_recipe(aq_recipe) 

aq_workflow_tuned

Final Fit

aq_final <- aq_workflow_tuned %>%
  fit(data = train)

Evaluate the final model

aq_final %>%
  extract_fit_parsnip()

Predict on test set

ozone_pred_rf <- predict(
aq_final,
test
)

ozone_pred_rf

Conclusion

Pretty easy to fit a model to bespoke train/test split that doesn’t require {tidymodels} initial_split() function. Simply construct the model, do any hyperparameter tuning, fit a final model, and make predictions.

Below I’ve added the code for anyone interested in seeing this same process using linear regression, which is easier than the random forest model since there are no hyperparameters to tune.

If you’d like to have the full code in one concise place, check out my GITHUB page.

Doing the same tasks with linear regression

  • This is a bit easier since it doesn’t require hyperparameter tuning.

 

## Model specification
aq_linear <- linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

## Model Recipe (same as above)
aq_recipe <- recipe( Ozone ~ Solar.R + Wind + Temp + Month, data = train ) %>%
  step_impute_median(Ozone, Solar.R) %>% 
  step_normalize(Solar.R, Wind, Temp)

## Workflow
aq_wf_linear <- workflow() %>%
  add_recipe(aq_recipe) %>%
  add_model(aq_linear)

## Fit the model to the training data
lm_fit <- aq_wf_linear %>%
  fit(data = train)

## Get the model output
lm_fit %>%
  extract_fit_parsnip()

## Model output with traditional summary() function
lm_fit %>%
  extract_fit_parsnip() %>% 
  .$fit %>%
  summary()

## Model output in tidy format  
lm_fit %>%
  tidy()

## Make predictions on test set
ozone_pred_lm <- predict(
  lm_fit,
  test
  )

ozone_pred_lm

TidyX Episode 124: Combining Multiple Conditions in a Single Column

Some of the best TidyX episodes are born out of questions from viewers or colleagues. This week, Ellis Hughes and I dip into the mailbag and answer a question from a colleague about identifying multiple conditions.

The colleague that had a data set with specific observations in one column. Their goal was to create a new column which put those observations into one of several conditional groups. The issue they were running into was that case_when() and ifelse() only output the first condition that is met. The colleague was looking for a way to append the conditional groups column with a comma separate string, so that some of the observations could match multiple conditional groups.

In this episode, we solve the problem in three different ways.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

TidyX Episode 123: Using crossing to build data sets for simulation or player tracking data analysis

This week, Ellis and I discuss the {tidyverse} function crossing() and show how it can be used to construct data sets of every possible combination of input variables (Cartesian product).

This function is very powerful when attempting to create data sets, in particular for simulation purposes or for building a data set of all paired permutations of model input variables to test a model’s predictions and evaluate how it behaves under every circumstance.

We end with a simple example of how to use crossing() and left_join() to build a data set for player tracking data that allows you to calculate the Euclidean distance between all players on the field/pitch/court/ice.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

TidyX Episode 122: Advanced Filtering with Event-Based Data

This week, Ellis and I talk through some advanced filtering techniques for dealing with event-based data. We talk through filtering data when dealing with time-to-event and survival analysis as well as when working with sensor data that requires you to filter between two discrete events of interest. For example, filtering all the accelerometer data occurring between events such as starting to walk and finishing the walk.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

TidyX Episode 121: Cleaning Research Data for Analysis – Viewer Question Answered

This week, Ellis Hughes and I work on cleaning some data for a research who messaged us asking about how to get a data set in a format that he can analyze. He created a file of some fake data that resembled the data that he received from practitioners working in various treatment centers. The data was in a very wide format, serial measurements going horizontally with one row per patient. To analyze such data the researcher required the data to be in a long format. However, this was trickier than just doing a simple pivot_longer() because there were multiple columns requiring pivoting. Ellis and I cover two different approaches to handling this issue, both resulting in the same final data set.

To watch the screen cast, CLICK HERE.

To access the data and our code, CLICK HERE.