A colleague recently asked me about XGBOOST (Extreme Gradient Boosting) models so I figured I’d put together a short tutorial of using XGBOOST both with the {xgboost} package and within the {tidymodels} environment.
Like all machine learning models, XGBOOST, has a number of different knobs and levers we can twist, push, and pull in order to tune the model and identify the best parameters for learning from the data. One can get a look at these parameters by loading the {xgboost} package and then typing ?xgboost into the console.
XGBOOST falls in the class of tree-based models. It is an effective machine learning algorithm for identifying patterns in data and it can be used for regression or classification problems.
When fitting the model in {tidymodels} it is pretty straight forward as far as the {tidymodels} syntax goes. However, if fitting models with the {xgboost} package, there are a number of considerations we need to make with the data. For example. XGBOOST can only use numeric data, so any categorical features will need to be one-hot-encoded. Additionally, the data cannot be in a data frame form, but rather needs to be turned into a matrix.
Let’s jump in!
(NOTE: All code is available on my Github page.)
Load Data & Packages
We will keep the data simple and use the mtcars data from base R. XGBOOST is well suited for complex data with a lot of interactions and non-linearity, but I’d like to use a simple data set so that the models can run fast for you and you get the gist of fitting them. Tuning all of the parameters can take a lot of time if you are dealing with a big data set.
Our goal is to predict mpg using the other features in the data.
{xgboost} for a regression problem
First we split our data into a training and testing data set.
set.seed(2025) N <- nrow(mtcars) ids <- sample(x = 1:N, size = floor(N * 0.7), replace = FALSE) train <- mtcars[ids, ] test <- mtcars[-ids, ]
Prepare the data
For the {xgboost} package to run we need the predictor variables to be in a data matrix and the outcome variables to be a vector. We then place the prepared data into a dense matrix so we can run our XGBOOST model.
train_x <- data.matrix(train[, -1]) train_y <- train$mpg test_x <- data.matrix(test[, -1]) test_y <- test$mpg xgb_train <- xgb.DMatrix(data = train_x, label = train_y) xgb_test <- xgb.DMatrix(data = test_x, label = test_y)
Tune the model
First, we walk through an example of of how the xgb.cv function works and what it outputs. Then we can write a loop to extract the relevant information for the purposes of the tuning the hyperparameters.
Notice that we specified a variety of hyperparameters (e.g., max.depth, eta, subsample, and colsample_bytree). These aren’t all of the possible hyperparameters that can be tuned, though. Rather than hash out them all out here, you can find a list of their definitions by either using the help function in the R console, ?xgb.cv() or checking out THIS webpage.
fit_example <- xgb.cv(data = xgb_train, params = list(booster = "gbtree", objective = "reg:squarederror", max.depth = 15, eta = 0.1, subsample = 0.1, colsample_bytree = 0.5), eval_metric = "rmse", nfold = 5, watchlist = list(train = xgb_train, test = xgb_test), early_stopping_rounds = TRUE, nrounds = 1000)
There are a number of features within the model that we can extract:
- best_iteration indicates the best iteration of the model fit, represented as the nrounds argument in our function.
- evaluation_log provides us with information about the RMSE for all iterations of model fitting.
We can directly select the best iteration from the evaluation log.
Here we see the RMSE for the training and testing data at the best iteration.
Now that we know what we are looking at with the xgb.cv() output, let’s write a grid of possible values for the hyperparameters and create a data frame, grid_params that stores this information.
eta <- seq(0.1, 0.9, 0.1) max_depth <- c(5, 10, 15) subsample <- c(0.5, 0.75, 1) colsample_bytree <- c(0.5, 0.75, 1) nrounds <- seq(100, 500, 100) grid_params <- expand.grid(nrounds = nrounds, max_depth = max_depth, subsample = subsample, colsample_bytree = colsample_bytree, eta = eta) head(grid_params) n <- nrow(grid_params)
Next, we need an empty data frame to store the best iteration for each row of our grid parameters data frame.
output_df <- data.frame("iter" = rep(NA, n), "train_rmse_mean" = rep(NA, n), "train_rmse_std" = rep(NA, n), "test_rmse_mean" = rep(NA, n), "test_rmse_std" = rep(NA, n))
Now run a for() loop to find the best parameters for the data!
for(i in 1:n){ fit <- xgb.cv(data = xgb_train, params = list(booster = "gbtree", objective = "reg:squarederror", max.depth = grid_params$max_depth[i], eta = grid_params$eta[i], subsample = grid_params$subsample[i], colsample_bytree = grid_params$colsample_bytree[i]), eval_metric = "rmse", nfold = 5, nrounds = grid_params$nrounds[i], watchlist = list(train = xgb_train, test = xgb_test), early_stopping_rounds = TRUE, grid_params$nrounds[i], verbose = FALSE) best_fit <- fit$best_iteration output_df[i, 1] <- fit$evaluation_log %>% as.data.frame() %>% filter(iter == best_fit) %>% pull(iter) output_df[i, 2] <- fit$evaluation_log %>% as.data.frame() %>% filter(iter == best_fit) %>% pull(train_rmse_mean) output_df[i, 3] <- fit$evaluation_log %>% as.data.frame() %>% filter(iter == best_fit) %>% pull(train_rmse_std) output_df[i, 4] <- fit$evaluation_log %>% as.data.frame() %>% filter(iter == best_fit) %>% pull(test_rmse_mean) output_df[i, 5] <- fit$evaluation_log %>% as.data.frame() %>% filter(iter == best_fit) %>% pull(test_rmse_std) }
Join the grid_params data set with the output_df` of our for() loop and select the output that had the lowest train set RMSE.
Fit the model
After tuning, we want to fit the model to the optimized hyper-parameters stored in the best_params element.
## fit model fit_mpg <- xgboost(data = xgb_train, params = list(booster = "gbtree", objective = "reg:squarederror", max.depth = best_params$max_depth, eta = best_params$eta, subsample = best_params$subsample, colsample_bytree = best_params$colsample_bytree), eval_metric = "rmse", nrounds = best_params$nrounds)
Variables of Importance
Plot the variables of importance from the tuned model.
Predictions
We can plot the predictions and obtain the RMSE on our test set.
train_pred_y <- predict(fit_mpg, xgb_train) test_pred_y <- predict(fit_mpg, xgb_test) test_x %>% bind_cols(mpg = test_y) %>% mutate(pred_mpg = test_pred_y) %>% ggplot(aes(x = pred_mpg, y = mpg)) + geom_point() + geom_abline() sqrt(mean((test_y - test_pred_y)^2))
{tidymodels} XGBOOST for regression
We can do the same type of XGBOOST model fitting in {tidymodels}.
Train/Test Split
set.seed(3333) car_split <- initial_split(mtcars) car_split train <- training(car_split) test <- testing(car_split)
Cross Validation Folds
set.seed(34) cv_folds <- vfold_cv( data = train, v = 5 )
Model Specification
In the boost_tree() function we specify which hyperparameters we want to tune.
boost_spec <- boost_tree( trees = tune(), tree_depth = tune(), min_n = tune(), loss_reduction = tune(), sample_size = tune(), mtry = tune(), learn_rate = tune() ) %>% set_engine("xgboost") %>% set_mode("regression")
Workflow
Create the workflow, add the formula we want for the model, and add the model specifications.
xgb_wf <- workflow() %>% add_formula(mpg ~ .) %>% add_model(boost_spec)
Set up tuning grid
Instead of specifying specific values for each of the tuning parameters, as we did in the {xgboost} package example, we will use the grid_latin_hypercube() function to construct the parameter grid for us. We pass it a size argument to indicate how large (how many rows) we want the grid to be.
xgb_grid <- grid_latin_hypercube( tree_depth(), trees(), min_n(), loss_reduction(), sample_size = sample_prop(), finalize(mtry(), train), learn_rate(), size = 40 )
Hyper parameter tuning on cross validated folds
doParallel::registerDoParallel() set.seed(66) xgb_res <- tune_grid( xgb_wf, resamples = cv_folds, grid = xgb_grid, control = control_grid(save_pred = TRUE) )
Obtaining cross validated outputs and identify the best model
boost_spec <- boost_tree(mtry = best_xgb %>% pull(mtry), trees = best_xgb %>% pull(trees), min_n = best_xgb %>% pull(min_n), tree_depth = best_xgb %>% pull(tree_depth), learn_rate = best_xgb %>% pull(learn_rate), loss_reduction = best_xgb %>% pull(loss_reduction), sample_size = best_xgb %>% pull(sample_size)) %>% set_engine("xgboost") %>% set_mode("regression") boost_fit <- boost_spec %>% fit(mpg ~ ., data = train)
Evaluate Test Set Performance
augment(boost_fit, new_data = test) %>% rmse(truth = mpg, estimate = .pred) augment(boost_fit, new_data = test) %>% rsq(truth = mpg, estimate = .pred) augment(boost_fit, new_data = test) %>% ggplot(aes(x = .pred, y = mpg)) + geom_jitter() + geom_abline(intercept = 0, slope = 1)
Wrapping up
XGBOOST is a flexible model that can be used for continuous outcomes, binary outcomes, and multi-class classification tasks. The model does have a number of tuning parameters that can help to optimize performance. However, it is relatively easy to create a grid of potential values and either write a for() loop or use the {tidymodels} framework to do the heavy lifting for you.
To obtain the full code and RMarkdown file check out my Github page.