{"id":3510,"date":"2025-04-07T02:48:10","date_gmt":"2025-04-07T02:48:10","guid":{"rendered":"http:\/\/optimumsportsperformance.com\/blog\/?p=3510"},"modified":"2025-04-07T12:13:21","modified_gmt":"2025-04-07T12:13:21","slug":"xgboost-tuning-tutorial-in-xgboost-and-tidymodels","status":"publish","type":"post","link":"https:\/\/optimumsportsperformance.com\/blog\/xgboost-tuning-tutorial-in-xgboost-and-tidymodels\/","title":{"rendered":"XGBOOST Tuning &#8211; Tutorial in {xgboost} and {tidymodels}"},"content":{"rendered":"<p>A colleague recently asked me about XGBOOST (Extreme Gradient Boosting) models so I figured I&#8217;d put together a short tutorial of using XGBOOST both with the {<strong>xgboost<\/strong>} package and within the {<strong>tidymodels<\/strong>} environment.<\/p>\n<p>Like all machine learning models, XGBOOST, has a number of different knobs and levers we can twist, push, and pull in order to tune the model and identify the best parameters for learning from the data. One can get a look at these parameters by loading the {<strong>xgboost<\/strong>} package and then typing <em><strong>?xgboost<\/strong><\/em> into the console.<\/p>\n<p>XGBOOST falls in the class of tree-based models. It is an effective machine learning algorithm for identifying patterns in data and it can be used for regression or classification problems.<\/p>\n<p>When fitting the model in {<strong>tidymodels<\/strong>} it is pretty straight forward as far as the {<strong>tidymodels<\/strong>} syntax goes. However, if fitting models with the {<strong>xgboost<\/strong>} package, there are a number of considerations we need to make with the data. For example. XGBOOST can only use numeric data, so any categorical features will need to be one-hot-encoded. Additionally, the data cannot be in a data frame form, but rather needs to be turned into a matrix.<\/p>\n<p>Let&#8217;s jump in!<\/p>\n<p>(<em><strong>NOTE:<\/strong><\/em><em> All code is available on my <strong><span style=\"color: #0000ff;\"><a style=\"color: #0000ff;\" href=\"https:\/\/github.com\/pw2\/XGBOOST-Regression-Tutorial\">Github page<\/a><\/span><\/strong>.)<\/em><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Load Data &amp; Packages<\/strong><\/span><\/p>\n<p>We will keep the data simple and use the <strong>mtcars<\/strong> data from base R. XGBOOST is well suited for complex data with a lot of interactions and non-linearity, but I&#8217;d like to use a simple data set so that the models can run fast for you and you get the gist of fitting them. Tuning all of the parameters can take a lot of time if you are dealing with a big data set.<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.05.49\u202fPM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-3511\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.05.49\u202fPM-1024x292.png\" alt=\"\" width=\"625\" height=\"178\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.05.49\u202fPM-1024x292.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.05.49\u202fPM-300x85.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.05.49\u202fPM-768x219.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.05.49\u202fPM-624x178.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.05.49\u202fPM.png 1096w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p>Our goal is to predict <strong><em>mpg<\/em><\/strong> using the other features in the data.<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.07.12\u202fPM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-3513\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.07.12\u202fPM-943x1024.png\" alt=\"\" width=\"469\" height=\"510\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.07.12\u202fPM-943x1024.png 943w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.07.12\u202fPM-276x300.png 276w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.07.12\u202fPM-768x834.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.07.12\u202fPM-624x678.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.07.12\u202fPM.png 1046w\" sizes=\"auto, (max-width: 469px) 100vw, 469px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>{xgboost} for a regression problem<\/strong><\/span><\/p>\n<p>First we split our data into a training and testing data set.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\nset.seed(2025)\r\nN &lt;- nrow(mtcars)\r\nids &lt;- sample(x = 1:N, size = floor(N * 0.7), replace = FALSE)\r\ntrain &lt;- mtcars&#x5B;ids, ]\r\ntest &lt;- mtcars&#x5B;-ids, ]\r\n<\/pre>\n<p><em>Prepare the data<\/em><\/p>\n<p>For the {<strong>xgboost<\/strong>} package to run we need the predictor variables to be in a data matrix and the outcome variables to be a vector. We then place the prepared data into a dense matrix so we can run our XGBOOST model.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\ntrain_x &lt;- data.matrix(train&#x5B;, -1])\r\ntrain_y &lt;- train$mpg\r\n\r\ntest_x &lt;- data.matrix(test&#x5B;, -1])\r\ntest_y &lt;- test$mpg\r\n\r\nxgb_train &lt;- xgb.DMatrix(data = train_x, label = train_y)\r\nxgb_test &lt;- xgb.DMatrix(data = test_x, label = test_y)\r\n<\/pre>\n<p><em>Tune the model<\/em><\/p>\n<p>First, we walk through an example of of how the <em><strong>xgb.cv<\/strong><\/em> function works and what it outputs. Then we can write a loop to extract the relevant information for the purposes of the tuning the hyperparameters.<\/p>\n<p>Notice that we specified a variety of hyperparameters (e.g., <em><strong>max.depth<\/strong><\/em>, <em><strong>eta<\/strong><\/em>, <em><strong>subsample<\/strong><\/em>, and <em><strong>colsample_bytree<\/strong><\/em>). These aren&#8217;t all of the possible hyperparameters that can be tuned, though. Rather than hash out them all out here, you can find a list of their definitions by either using the help function in the R console, <em><strong>?xgb.cv()<\/strong><\/em> or checking out <span style=\"color: #0000ff;\"><strong><a style=\"color: #0000ff;\" href=\"https:\/\/xgboost.readthedocs.io\/en\/stable\/parameter.html\">THIS<\/a><\/strong><\/span> webpage.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\nfit_example &lt;- xgb.cv(data = xgb_train,\r\n                      params = list(booster = &quot;gbtree&quot;,\r\n                        objective = &quot;reg:squarederror&quot;,\r\n                        max.depth = 15,\r\n                        eta = 0.1,\r\n                        subsample = 0.1,\r\n                        colsample_bytree = 0.5),\r\n                      eval_metric = &quot;rmse&quot;,\r\n                      nfold = 5,\r\n                      watchlist = list(train = xgb_train, test = xgb_test),\r\n                      early_stopping_rounds = TRUE,\r\n                      nrounds = 1000)\r\n<\/pre>\n<p>There are a number of features within the model that we can extract:<\/p>\n<ul>\n<li><em><strong>best_iteration<\/strong><\/em> indicates the best iteration of the model fit, represented as the <em><strong>nrounds <\/strong><\/em>argument in our function.<\/li>\n<\/ul>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.12.44\u202fPM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-3514\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.12.44\u202fPM.png\" alt=\"\" width=\"338\" height=\"64\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.12.44\u202fPM.png 476w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.12.44\u202fPM-300x57.png 300w\" sizes=\"auto, (max-width: 338px) 100vw, 338px\" \/><\/a><\/p>\n<ul>\n<li><em><strong>evaluation_log<\/strong><\/em> provides us with information about the RMSE for all iterations of model fitting.<\/li>\n<\/ul>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.12.58\u202fPM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-3515\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.12.58\u202fPM-1024x242.png\" alt=\"\" width=\"625\" height=\"148\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.12.58\u202fPM-1024x242.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.12.58\u202fPM-300x71.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.12.58\u202fPM-768x182.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.12.58\u202fPM-624x148.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.12.58\u202fPM.png 1116w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p>We can directly select the best iteration from the evaluation log.<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.14.37\u202fPM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-3516\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.14.37\u202fPM-1024x154.png\" alt=\"\" width=\"625\" height=\"94\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.14.37\u202fPM-1024x154.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.14.37\u202fPM-300x45.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.14.37\u202fPM-768x116.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.14.37\u202fPM-624x94.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.14.37\u202fPM.png 1088w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p>Here we see the RMSE for the training and testing data at the best iteration.<\/p>\n<p>Now that we know what we are looking at with the <em><strong>xgb.cv()<\/strong><\/em> output, let&#8217;s write a grid of possible values for the hyperparameters and create a data frame, <em><strong>grid_params<\/strong><\/em> that stores this information.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\neta &lt;- seq(0.1, 0.9, 0.1)\r\nmax_depth &lt;- c(5, 10, 15)\r\nsubsample &lt;- c(0.5, 0.75, 1)\r\ncolsample_bytree &lt;- c(0.5, 0.75, 1)\r\nnrounds &lt;- seq(100, 500, 100)\r\n\r\ngrid_params &lt;- expand.grid(nrounds = nrounds,\r\n                           max_depth = max_depth, \r\n                           subsample = subsample,\r\n                           colsample_bytree = colsample_bytree,\r\n                           eta = eta)\r\n\r\nhead(grid_params)\r\n\r\nn &lt;- nrow(grid_params)\r\n<\/pre>\n<p>Next, we need an empty data frame to store the best iteration for each row of our grid parameters data frame.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\noutput_df &lt;- data.frame(&quot;iter&quot; = rep(NA, n), \r\n                           &quot;train_rmse_mean&quot; = rep(NA, n), \r\n                           &quot;train_rmse_std&quot; = rep(NA, n), \r\n                           &quot;test_rmse_mean&quot; = rep(NA, n), \r\n                           &quot;test_rmse_std&quot; = rep(NA, n))\r\n<\/pre>\n<p>Now run a <em><strong>for()<\/strong><\/em> loop to find the best parameters for the data!<\/p>\n<p>&nbsp;<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\nfor(i in 1:n){\r\n  \r\n  fit &lt;- xgb.cv(data = xgb_train,\r\n                params = list(booster = &quot;gbtree&quot;,\r\n                      objective = &quot;reg:squarederror&quot;,\r\n                      max.depth = grid_params$max_depth&#x5B;i],\r\n                      eta = grid_params$eta&#x5B;i],\r\n                      subsample = grid_params$subsample&#x5B;i],\r\n                      colsample_bytree = grid_params$colsample_bytree&#x5B;i]),\r\n                      eval_metric = &quot;rmse&quot;,\r\n                      nfold = 5,\r\n                      nrounds = grid_params$nrounds&#x5B;i],\r\n                      watchlist = list(train = xgb_train, test = xgb_test),\r\n                      early_stopping_rounds = TRUE,\r\n                      grid_params$nrounds&#x5B;i],\r\n                      verbose = FALSE)\r\n  \r\n  best_fit &lt;- fit$best_iteration\r\n  \r\n  output_df&#x5B;i, 1] &lt;- fit$evaluation_log %&gt;% as.data.frame() %&gt;% filter(iter == best_fit) %&gt;% pull(iter)\r\n  output_df&#x5B;i, 2] &lt;- fit$evaluation_log %&gt;% as.data.frame() %&gt;% filter(iter == best_fit) %&gt;% pull(train_rmse_mean)\r\n  output_df&#x5B;i, 3] &lt;- fit$evaluation_log %&gt;% as.data.frame() %&gt;% filter(iter == best_fit) %&gt;% pull(train_rmse_std)\r\n  output_df&#x5B;i, 4] &lt;- fit$evaluation_log %&gt;% as.data.frame() %&gt;% filter(iter == best_fit) %&gt;% pull(test_rmse_mean)\r\n  output_df&#x5B;i, 5] &lt;- fit$evaluation_log %&gt;% as.data.frame() %&gt;% filter(iter == best_fit) %&gt;% pull(test_rmse_std)\r\n\r\n}\r\n<\/pre>\n<p>Join the <em><strong>grid_params<\/strong><\/em> data set with the <strong><em>output_df<\/em><\/strong>` of our <em><strong>for()<\/strong><\/em> loop and select the output that had the lowest train set RMSE.<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.18.32\u202fPM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-3517\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.18.32\u202fPM-1024x147.png\" alt=\"\" width=\"625\" height=\"90\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.18.32\u202fPM-1024x147.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.18.32\u202fPM-300x43.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.18.32\u202fPM-768x110.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.18.32\u202fPM-624x89.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.18.32\u202fPM.png 1410w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p><em>Fit the model<\/em><\/p>\n<p>After tuning, we want to fit the model to the optimized hyper-parameters stored in the <em><strong>best_params<\/strong><\/em> element.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\n## fit model\r\nfit_mpg &lt;- xgboost(data = xgb_train,\r\n                   params = list(booster = &quot;gbtree&quot;,\r\n                      objective = &quot;reg:squarederror&quot;,\r\n                      max.depth = best_params$max_depth,\r\n                      eta = best_params$eta,\r\n                      subsample = best_params$subsample,\r\n                      colsample_bytree = best_params$colsample_bytree),\r\n                      eval_metric = &quot;rmse&quot;,\r\n                      nrounds = best_params$nrounds)\r\n<\/pre>\n<p><em>Variables of Importance<\/em><\/p>\n<p>Plot the variables of importance from the tuned model.<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.20.36\u202fPM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-3518\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.20.36\u202fPM-1024x1022.png\" alt=\"\" width=\"412\" height=\"411\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.20.36\u202fPM-1024x1022.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.20.36\u202fPM-150x150.png 150w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.20.36\u202fPM-300x300.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.20.36\u202fPM-768x767.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.20.36\u202fPM-624x623.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.20.36\u202fPM.png 1146w\" sizes=\"auto, (max-width: 412px) 100vw, 412px\" \/><\/a><\/p>\n<p><em>Predictions<\/em><\/p>\n<p>We can plot the predictions and obtain the RMSE on our test set.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\ntrain_pred_y &lt;- predict(fit_mpg, xgb_train)\r\ntest_pred_y &lt;- predict(fit_mpg, xgb_test) test_x %&gt;%\r\n  bind_cols(mpg = test_y) %&gt;%\r\n  mutate(pred_mpg = test_pred_y) %&gt;%\r\n  ggplot(aes(x = pred_mpg, y = mpg)) +\r\n  geom_point() +\r\n  geom_abline()\r\n\r\nsqrt(mean((test_y - test_pred_y)^2))\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.21.58\u202fPM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-3519\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.21.58\u202fPM-1024x1024.png\" alt=\"\" width=\"475\" height=\"475\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.21.58\u202fPM-1024x1024.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.21.58\u202fPM-150x150.png 150w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.21.58\u202fPM-300x300.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.21.58\u202fPM-768x768.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.21.58\u202fPM-624x624.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.21.58\u202fPM.png 1146w\" sizes=\"auto, (max-width: 475px) 100vw, 475px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>{tidymodels} XGBOOST for regression<\/strong><\/span><\/p>\n<p>We can do the same type of XGBOOST model fitting in {<strong>tidymodels<\/strong>}.<\/p>\n<p><em>Train\/Test Split<\/em><\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\nset.seed(3333)\r\ncar_split &lt;- initial_split(mtcars)\r\ncar_split\r\n\r\ntrain &lt;- training(car_split)\r\ntest &lt;- testing(car_split)<\/pre>\n<p><em>Cross Validation Folds<\/em><\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\nset.seed(34)\r\ncv_folds &lt;- vfold_cv(\r\n  data = train, \r\n  v = 5\r\n  ) \r\n<\/pre>\n<p><em>Model Specification<\/em><\/p>\n<p>In the <em><strong>boost_tree()<\/strong><\/em> function we specify which hyperparameters we want to tune.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\nboost_spec &lt;- boost_tree( trees = tune(), \r\n      tree_depth = tune(), \r\n      min_n = tune(), \r\n      loss_reduction = tune(), \r\n      sample_size = tune(), \r\n      mtry = tune(), \r\n      learn_rate = tune() ) %&gt;%\r\n  set_engine(&quot;xgboost&quot;) %&gt;%\r\n  set_mode(&quot;regression&quot;)\r\n<\/pre>\n<p><em>Workflow<\/em><\/p>\n<p>Create the workflow, add the formula we want for the model, and add the model specifications.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\nxgb_wf &lt;- workflow() %&gt;%\r\n  add_formula(mpg ~ .) %&gt;%\r\n  add_model(boost_spec)\r\n<\/pre>\n<p><em>Set up tuning grid<\/em><\/p>\n<p>Instead of specifying specific values for each of the tuning parameters, as we did in the {<strong>xgboost<\/strong>} package example, we will use the <em><strong>grid_latin_hypercube()<\/strong><\/em> function to construct the parameter grid for us. We pass it a <strong>size<\/strong> argument to indicate how large (how many rows) we want the grid to be.<\/p>\n<p>&nbsp;<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\nxgb_grid &lt;- grid_latin_hypercube(\r\n  tree_depth(),\r\n  trees(),\r\n  min_n(),\r\n  loss_reduction(),\r\n  sample_size = sample_prop(),\r\n  finalize(mtry(), train),\r\n  learn_rate(),\r\n  size = 40\r\n)\r\n<\/pre>\n<p><em>Hyper parameter tuning on cross validated folds<\/em><\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\ndoParallel::registerDoParallel()\r\n\r\nset.seed(66)\r\nxgb_res &lt;- tune_grid(\r\n  xgb_wf,\r\n  resamples = cv_folds,\r\n  grid = xgb_grid,\r\n  control = control_grid(save_pred = TRUE)\r\n)\r\n<\/pre>\n<p><em>Obtaining cross validated outputs and identify the best model<\/em><\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.30.41\u202fPM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-3520\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.30.41\u202fPM-1024x166.png\" alt=\"\" width=\"625\" height=\"101\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.30.41\u202fPM-1024x166.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.30.41\u202fPM-300x49.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.30.41\u202fPM-768x125.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.30.41\u202fPM-624x101.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.30.41\u202fPM.png 1442w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><br \/>\n<em>Fit the final model<\/em><\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\nboost_spec &lt;- boost_tree(mtry = best_xgb %&gt;% pull(mtry),\r\n                         trees = best_xgb %&gt;% pull(trees),\r\n                         min_n = best_xgb %&gt;% pull(min_n),\r\n                         tree_depth = best_xgb %&gt;% pull(tree_depth),\r\n                         learn_rate = best_xgb %&gt;% pull(learn_rate),\r\n                         loss_reduction = best_xgb %&gt;% pull(loss_reduction),\r\n                         sample_size = best_xgb %&gt;% pull(sample_size)) %&gt;%\r\n  set_engine(&quot;xgboost&quot;) %&gt;%\r\n  set_mode(&quot;regression&quot;)\r\n\r\n\r\nboost_fit &lt;- boost_spec %&gt;%\r\n  fit(mpg ~ ., data = train)\r\n<\/pre>\n<p><em>Evaluate Test Set Performance<\/em><\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\naugment(boost_fit,\r\n        new_data = test) %&gt;%\r\n  rmse(truth = mpg, estimate = .pred)\r\n\r\naugment(boost_fit,\r\n        new_data = test) %&gt;%\r\n  rsq(truth = mpg, estimate = .pred)\r\n\r\n\r\naugment(boost_fit,\r\n        new_data = test) %&gt;%\r\n  ggplot(aes(x = .pred, y = mpg)) +\r\n  geom_jitter() +\r\n  geom_abline(intercept = 0, slope = 1)\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.32.53\u202fPM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-3521\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.32.53\u202fPM.png\" alt=\"\" width=\"375\" height=\"118\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.32.53\u202fPM.png 502w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.32.53\u202fPM-300x94.png 300w\" sizes=\"auto, (max-width: 375px) 100vw, 375px\" \/><\/a> <a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.33.05\u202fPM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-3522\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.33.05\u202fPM-1024x1015.png\" alt=\"\" width=\"466\" height=\"462\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.33.05\u202fPM-1024x1015.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.33.05\u202fPM-150x150.png 150w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.33.05\u202fPM-300x297.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.33.05\u202fPM-768x761.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.33.05\u202fPM-624x619.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-06-at-7.33.05\u202fPM.png 1148w\" sizes=\"auto, (max-width: 466px) 100vw, 466px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Wrapping up<\/strong><\/span><\/p>\n<p>XGBOOST is a flexible model that can be used for continuous outcomes, binary outcomes, and multi-class classification tasks. The model does have a number of tuning parameters that can help to optimize performance. However, it is relatively easy to create a grid of potential values and either write a <em><strong>for()<\/strong><\/em> loop or use the {<strong>tidymodels<\/strong>} framework to do the heavy lifting for you.<\/p>\n<p>To obtain the full code and RMarkdown file check out my <span style=\"color: #0000ff;\"><strong><a style=\"color: #0000ff;\" href=\"https:\/\/github.com\/pw2\/XGBOOST-Regression-Tutorial\">Github page<\/a><\/strong><\/span>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A colleague recently asked me about XGBOOST (Extreme Gradient Boosting) models so I figured I&#8217;d put together a short tutorial of using XGBOOST both with the {xgboost} package and within the {tidymodels} environment. Like all machine learning models, XGBOOST, has a number of different knobs and levers we can twist, push, and pull in order [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[47,43,42],"tags":[],"class_list":["post-3510","post","type-post","status-publish","format-standard","hentry","category-model-building-in-r","category-sports-analytics","category-sports-science"],"_links":{"self":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/3510","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/comments?post=3510"}],"version-history":[{"count":2,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/3510\/revisions"}],"predecessor-version":[{"id":3524,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/3510\/revisions\/3524"}],"wp:attachment":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/media?parent=3510"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/categories?post=3510"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/tags?post=3510"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}