{"id":2991,"date":"2023-04-10T00:30:23","date_gmt":"2023-04-10T00:30:23","guid":{"rendered":"http:\/\/optimumsportsperformance.com\/blog\/?p=2991"},"modified":"2023-04-10T12:29:21","modified_gmt":"2023-04-10T12:29:21","slug":"tidymodels-workflow-sets-tutorial","status":"publish","type":"post","link":"https:\/\/optimumsportsperformance.com\/blog\/tidymodels-workflow-sets-tutorial\/","title":{"rendered":"Tidymodels Workflow Sets Tutorial"},"content":{"rendered":"<p><span style=\"text-decoration: underline;\"><strong>Intro<\/strong><\/span><\/p>\n<p>The purpose of workflow sets are to allow you to seamlessly fit multiply different models (and even tune them) simultaneously. This provide an efficient approach to the model building process as the models can then be compared to each other to determine which model is the optimal model for deployment. Therefore, the aim of this tutorial is to provide a simple walk through of how to set up a <strong>workflow_set()<\/strong> and build multiple models simultaneously using the <strong>tidymodels<\/strong> framework.<\/p>\n<p>The full code (which will include code not directly embedded in this tutorial) is available on my <strong><span style=\"color: #0000ff;\"><a style=\"color: #0000ff;\" href=\"https:\/\/github.com\/pw2\/tidymodels_template\/blob\/main\/Tidymodels%20Workflow%20Sets%20Tutorial.Rmd\">GITHUB page<\/a><\/span><\/strong>.<\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Load Packages &amp; Data<\/strong><\/span><\/p>\n<p>Data comes from the <span style=\"color: #0000ff;\"><strong><a style=\"color: #0000ff;\" href=\"https:\/\/github.com\/adror1\/nwslR\">nwslR package<\/a><\/strong><\/span>, which provides a lot of really nice National Women&#8217;s Soccer League data.<\/p>\n<p>We will be using stats for field players to determine those who received the the <strong>Best XI<\/strong> award (there will only be 10 players per season since we are dealing with field player stats, no goalies).<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\n## packages\r\nlibrary(tidyverse)\r\nlibrary(tidymodels)\r\nlibrary(nwslR)\r\nlibrary(tictoc)\r\n\r\ntheme_set(theme_light() +\r\n            theme(strip.background = element_rect(fill = &quot;black&quot;),\r\n                  strip.text = element_text(face = &quot;bold&quot;)))\r\n\r\n\r\n## data sets required\r\ndata(player)\r\ndata(fieldplayer_overall_season_stats)\r\ndata(award)\r\n\r\n## join all data sets to make a primary data set\r\nd &lt;- fieldplayer_overall_season_stats %&gt;%\r\n  left_join(player) %&gt;% \r\n  left_join(award) %&gt;% \r\n  select(-name_other) %&gt;% \r\n  mutate(best_11 = case_when(award == &quot;Best XI&quot; ~ 1,\r\n                             TRUE ~ 0)) %&gt;% \r\n  select(-award)\r\n\r\nd %&gt;% \r\n  head()\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.23.57-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-2992\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.23.57-PM-1024x184.png\" alt=\"\" width=\"625\" height=\"112\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.23.57-PM-1024x184.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.23.57-PM-300x54.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.23.57-PM-768x138.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.23.57-PM-624x112.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.23.57-PM.png 1724w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><br \/>\nOur features will be all of the play stats: <strong>mp<\/strong>, <strong>starts<\/strong>, <strong>min<\/strong>, <strong>gls<\/strong>, <strong>ast<\/strong>, <strong>pk<\/strong>, <strong>p_katt<\/strong> and the position (<strong>pos<\/strong>) that the player played.<\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Exploratory Data Analysis<\/strong><\/span><\/p>\n<p>Let&#8217;s explore some of the variables that we will be modeling.<\/p>\n<p>How many NAs are there in the data set?<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.25.47-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-2993\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.25.47-PM.png\" alt=\"\" width=\"194\" height=\"457\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.25.47-PM.png 264w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.25.47-PM-127x300.png 127w\" sizes=\"auto, (max-width: 194px) 100vw, 194px\" \/><\/a><\/p>\n<ul>\n<li>It looks like there are some players that matches played (<strong>mp<\/strong>) and <strong>starts<\/strong> yet the number of minutes was not recorded. We will need to handle this in our pre-processing. The alternative approach would be to just remove those 79 players, however I will add an imputation step in the <strong>recipe<\/strong> section of our model building process to show how it works.<\/li>\n<li>There are also a number of players that played in games but never attempted a penalty kick. We will set these columns to 0 (the median value).<\/li>\n<\/ul>\n<p>How many matches did those who have an NA for minutes play in?<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.27.57-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-2994\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.27.57-PM.png\" alt=\"\" width=\"438\" height=\"82\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.27.57-PM.png 640w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.27.57-PM-300x56.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.27.57-PM-624x117.png 624w\" sizes=\"auto, (max-width: 438px) 100vw, 438px\" \/><\/a><br \/>\nLet&#8217;s get a look at the relationship between matches played, `mp`, and `min` to see if maybe we can impute the value for those who have NA.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\nfit_min &lt;- lm(min ~ mp, data = d)\r\nsummary(fit_min)\r\n\r\nplot(x = d$mp, \r\n     y = d$min,\r\n     main = &quot;Minutes Played ~ Matches Played&quot;,\r\n     xlab = &quot;Matches Played&quot;,\r\n     ylab = &quot;Minutes Played&quot;,\r\n     col = &quot;light grey&quot;,\r\n     pch = 19)\r\nabline(summary(fit_min),\r\n       col = &quot;red&quot;,\r\n       lwd = 5,\r\n       lty = 2)\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.30.28-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-2995\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.30.28-PM.png\" alt=\"\" width=\"498\" height=\"344\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.30.28-PM.png 912w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.30.28-PM-300x207.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.30.28-PM-768x531.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.30.28-PM-624x431.png 624w\" sizes=\"auto, (max-width: 498px) 100vw, 498px\" \/><\/a> <a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.30.37-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-2996\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.30.37-PM-790x1024.png\" alt=\"\" width=\"474\" height=\"614\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.30.37-PM-790x1024.png 790w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.30.37-PM-231x300.png 231w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.30.37-PM-768x996.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.30.37-PM-624x809.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.30.37-PM.png 990w\" sizes=\"auto, (max-width: 474px) 100vw, 474px\" \/><\/a><\/p>\n<ul>\n<li>There is a large amount of error in this model (residual standard error = 264) and the variance in the relationship appears to increase as matches played increases. This is all we have in this data set to really go on. It is probably best to figure out why no minutes were recorded for those players or see if there are other features in a different data set that can help us out. For now, we will stick with this simple model and use it in our model `recipe` below.<\/li>\n<\/ul>\n<p>Plot the density of the continuous predictor variables based on the `best_11` award.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\nd %&gt;%\r\n  select(mp:p_katt, best_11) %&gt;%\r\n  pivot_longer(cols = -best_11) %&gt;%\r\n  ggplot(aes(x = value, fill = as.factor(best_11))) +\r\n  geom_density(alpha = 0.6) +\r\n  facet_wrap(~name, scales = &quot;free&quot;) +\r\n  labs(x = &quot;Value&quot;,\r\n       y = &quot;Density&quot;,\r\n       title = &quot;Distribution of variables relative to Best XI designation&quot;,\r\n       subtitle = &quot;NOTE: axes are specific to the value in question&quot;)\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.32.46-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-2997\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.32.46-PM-1024x876.png\" alt=\"\" width=\"475\" height=\"407\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.32.46-PM-1024x876.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.32.46-PM-300x256.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.32.46-PM-768x657.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.32.46-PM-624x534.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.32.46-PM.png 1462w\" sizes=\"auto, (max-width: 475px) 100vw, 475px\" \/><\/a><\/p>\n<p>How many field positions are there?<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.33.19-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-2998\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.33.19-PM.png\" alt=\"\" width=\"448\" height=\"208\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.33.19-PM.png 896w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.33.19-PM-300x139.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.33.19-PM-768x357.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.33.19-PM-624x290.png 624w\" sizes=\"auto, (max-width: 448px) 100vw, 448px\" \/><\/a><\/p>\n<p>Some players appear to play multiple positions. Maybe they are more versatile? Have players with position versatility won more Best XI awards?<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.34.04-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-2999\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.34.04-PM.png\" alt=\"\" width=\"474\" height=\"295\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.34.04-PM.png 896w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.34.04-PM-300x187.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.34.04-PM-768x478.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.34.04-PM-624x389.png 624w\" sizes=\"auto, (max-width: 474px) 100vw, 474px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Data Splitting<\/strong><\/span><\/p>\n<p>First, I&#8217;ll create a data set of just the predictors and outcome variables (and get rid of the other variables in the data that we won&#8217;t be using). I&#8217;ll also convert our binary outcome variable from a number to a factor, for model fitting purposes.<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.36.31-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-3000\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.36.31-PM.png\" alt=\"\" width=\"485\" height=\"212\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.36.31-PM.png 878w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.36.31-PM-300x131.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.36.31-PM-768x336.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.36.31-PM-624x273.png 624w\" sizes=\"auto, (max-width: 485px) 100vw, 485px\" \/><\/a><\/p>\n<p>Split the data into train\/test splits.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\n## Train\/Test\r\nset.seed(398)\r\ninit_split &lt;- initial_split(d_model, prop = 0.7, strat = &quot;best_11&quot;)\r\n\r\ntrain &lt;- training(init_split)\r\ntest &lt;- testing(init_split)\r\n<\/pre>\n<p>Further split the training set into 5 cross validation folds.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\n## Cross Validation Split of Training Data\r\nset.seed(764)\r\ncv_folds &lt;- vfold_cv(\r\n  data = train, \r\n  v = 5\r\n  ) \r\n<\/pre>\n<p><span style=\"text-decoration: underline;\"><strong><br \/>\nPrepare the data with a recipe<\/strong><\/span><\/p>\n<p>Recipes help us set up the data for modeling purposes. It is here that we can handle missing values, scale\/nornmalize our features, and create dummy variables. More importantly, creating the recipe ensure that if we deploy our model for future predictions the steps in the data preparation process will be consistent and standardized with what we did when we fit the model.<\/p>\n<p>You can find all of the <strong>recipe<\/strong> options <strong><span style=\"color: #0000ff;\"><a style=\"color: #0000ff;\" href=\"https:\/\/recipes.tidymodels.org\/reference\/index.html#step-functions-imputation\">HERE<\/a><\/span><\/strong>.<\/p>\n<p>The pre-processing steps we will use are:<\/p>\n<ul>\n<li>Impute any NA minutes, `min` using the `mp` variable.<\/li>\n<li>Create one hot encoded dummy variables for the player&#8217;s position<\/li>\n<li>Impute the median (0) when penalty kicks attempted and penalty kicks made are NA<\/li>\n<li>Normalize the numeric data to have a mean of 0 and standard deviation of 1<\/li>\n<\/ul>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\nnwsl_rec &lt;- recipe(best_11 ~ ., data = train) %&gt;%\r\n  step_impute_linear(min, impute_with = imp_vars(mp)) %&gt;%\r\n  step_dummy(pos, one_hot = TRUE) %&gt;%\r\n  step_impute_median(pk, p_katt, ast) %&gt;%\r\n  step_normalize(mp:p_katt)\r\n\r\nnwsl_rec\r\n<\/pre>\n<p>Here is what the pre-processed training set looks like when we apply this recipe:<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.43.05-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-3001\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.43.05-PM-1024x297.png\" alt=\"\" width=\"625\" height=\"181\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.43.05-PM-1024x297.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.43.05-PM-300x87.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.43.05-PM-768x223.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.43.05-PM-624x181.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.43.05-PM.png 1670w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Specifying the models<\/strong><\/span><\/p>\n<p>We will fit three models at once:<\/p>\n<ol>\n<li>Random Forest<\/li>\n<li>XGBoost<\/li>\n<li>K-Nearest Neighbor<\/li>\n<\/ol>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\n## Random forest\r\nrf_model &lt;- rand_forest( mtry = tune(), trees = tune(), ) %&gt;%\r\n  set_mode(&quot;classification&quot;) %&gt;%\r\n  set_engine(&quot;randomForest&quot;, importance = TRUE)\r\n\r\n## XGBoost\r\nxgb_model &lt;- boost_tree( trees = tune(), mtry = tune(), tree_depth = tune(), learn_rate = .01 ) %&gt;%\r\n  set_mode(&quot;classification&quot;) %&gt;% \r\n  set_engine(&quot;xgboost&quot;,importance = TRUE)\r\n\r\n## Naive Bayes Classifier\r\nknn_model &lt;- nearest_neighbor(neighbors = 4) %&gt;%\r\n  set_mode(&quot;classification&quot;)\r\n<\/pre>\n<p><span style=\"text-decoration: underline;\"><strong>Workflow Set<\/strong><\/span><\/p>\n<p>We are now ready to combine the pre-processing recipes and the three models together in a <strong>workflow_set()<\/strong>.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\nnwsl_wf &lt;-workflow_set(\r\n  preproc = list(nwsl_rec),\r\n  models = list(rf_model, xgb_model, knn_model),\r\n  cross = TRUE\r\n  )\r\n\r\nnwsl_wf\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.46.42-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-3002\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.46.42-PM.png\" alt=\"\" width=\"934\" height=\"254\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.46.42-PM.png 934w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.46.42-PM-300x82.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.46.42-PM-768x209.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.46.42-PM-624x170.png 624w\" sizes=\"auto, (max-width: 934px) 100vw, 934px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Tune &amp; fit the 3 workflows<\/strong><\/span><\/p>\n<p>Once the models are set up we use <strong>workflow_map()<\/strong> to fit the workflow to the cross-validated folds we created. We will set up a few tuning parameters for the Random Forest and XGBOOST models so during the fitting process we can determine which of parameter pairings optimize the model performance.<\/p>\n<p>I also use the &#8216;tic()&#8217; and &#8216;toc()&#8217; functions from the <strong>tictoc<\/strong> package to determine the length of time it takes the model to fit, in case there are potential opportunities to optimize the fitting process.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\ndoParallel::registerDoParallel(cores = 10)\r\n\r\ntic()\r\n\r\nfit_wf &lt;- nwsl_wf %&gt;%  \r\n  workflow_map(\r\n    seed = 44, \r\n    fn = &quot;tune_grid&quot;,\r\n    grid = 10,           ## parameters to pass to tune grid\r\n    resamples = cv_folds\r\n  )\r\n\r\ntoc()\r\n\r\n# Took 1.6 minutes to fit\r\n\r\ndoParallel::stopImplicitCluster()\r\n\r\nfit_wf\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.51.35-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-3003\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.51.35-PM.png\" alt=\"\" width=\"530\" height=\"151\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.51.35-PM.png 898w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.51.35-PM-300x86.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.51.35-PM-768x219.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-4.51.35-PM-624x178.png 624w\" sizes=\"auto, (max-width: 530px) 100vw, 530px\" \/><\/a><br \/>\n<span style=\"text-decoration: underline;\"><strong>Evaluate each model&#8217;s performance on the train set<\/strong><\/span><\/p>\n<p>We can plot the model predictions across the range of models we fit using <strong>autoplot()<\/strong>, get a summary of the model predictions with the <strong>collect_metrics()<\/strong> function, and rank the results of the model using <strong>rank_results()<\/strong>.<\/p>\n<p>&nbsp;<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\n## plot each of the model's performance and ROC\r\nautoplot(fit_wf)\r\n\r\n## Look at the model metrics for each of the models\r\ncollect_metrics(fit_wf) \r\n\r\n## Rank the results based on model accuracy\r\nrank_results(fit_wf, rank_metric = &quot;accuracy&quot;, select_best = TRUE)\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.03.28-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-3004\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.03.28-PM-1024x733.png\" alt=\"\" width=\"625\" height=\"447\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.03.28-PM-1024x733.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.03.28-PM-300x215.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.03.28-PM-768x550.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.03.28-PM-624x447.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.03.28-PM.png 1466w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><br \/>\n<a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.03.44-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-3005\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.03.44-PM-1024x343.png\" alt=\"\" width=\"625\" height=\"209\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.03.44-PM-1024x343.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.03.44-PM-300x100.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.03.44-PM-768x257.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.03.44-PM-624x209.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.03.44-PM.png 1530w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p>We see that the Random Forest models out performed the XGBOOST and KNN models.<\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Extract the model with the best performance<\/strong><\/span><\/p>\n<p>Now that we know that the Random Forest performed the best. We will grab the model ID for the Random Forest Models and their corresponding workflows.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\n## get the workflow ID for the best model\r\nbest_model_id &lt;- fit_wf %&gt;% \r\n  rank_results(\r\n    rank_metric = &quot;accuracy&quot;,\r\n    select_best = TRUE\r\n  ) %&gt;% \r\n  head(1) %&gt;% \r\n  pull(wflow_id)\r\n\r\nbest_model_id\r\n\r\n## Extract the workflow for the best model\r\nbest_model &lt;- extract_workflow(fit_wf, id = best_model_id)\r\nbest_model\r\n<\/pre>\n<p><span style=\"text-decoration: underline;\"><strong>Extract the tuned results from workflow of the best model<\/strong><\/span><\/p>\n<p>We know the best model was the Random Forest model so we can use the <strong>best_model_id<\/strong> to get all of the Random Forest models out and look at how each one did during the tuning process.<\/p>\n<p>First we extract the Random Forest models.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\n## extract the Random Forest models\r\nbest_workflow &lt;- fit_wf&#x5B;fit_wf$wflow_id == best_model_id,\r\n                               &quot;result&quot;]&#x5B;&#x5B;1]]&#x5B;&#x5B;1]]\r\n\r\nbest_workflow\r\n<\/pre>\n<p>With the <strong>collect_metrics()<\/strong> function we can see the iterations of <strong>mtry<\/strong>, <strong>trees<\/strong>, and <strong>tree_depth<\/strong> that were evaluated in the tuning process. We can also use <strong>select_best()<\/strong> to get the model parameters that performed the best of the Random Forest models.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\ncollect_metrics(best_workflow)\r\nselect_best(best_workflow, &quot;accuracy&quot;)\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.10.34-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-3006\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.10.34-PM.png\" alt=\"\" width=\"484\" height=\"142\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.10.34-PM.png 600w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.10.34-PM-300x88.png 300w\" sizes=\"auto, (max-width: 484px) 100vw, 484px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Fit the final model<\/strong><\/span><\/p>\n<p>We saw above that the best model had the following tuning parameters:<\/p>\n<ul>\n<li><strong>mtry<\/strong> = 1<\/li>\n<li><strong>trees<\/strong> = 944<\/li>\n<\/ul>\n<p>We can extract this optimized workflow using the <strong>finalize_workflow()<\/strong> function and then fit that final workflow to the initial training split data.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\n## get the finalized workflow\r\nfinal_wf &lt;- finalize_workflow(best_model, select_best(best_workflow, &quot;accuracy&quot;))\r\nfinal_wf\r\n\r\n## fit the final workflow to the initial data split\r\ndoParallel::registerDoParallel(cores = 8)\r\n\r\nfinal_fit &lt;- final_wf %&gt;% \r\n  last_fit(\r\n    split = init_split\r\n  )\r\n\r\ndoParallel::stopImplicitCluster()\r\n\r\nfinal_fit\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.12.44-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-3007\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.12.44-PM.png\" alt=\"\" width=\"423\" height=\"416\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.12.44-PM.png 838w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.12.44-PM-300x295.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.12.44-PM-768x755.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.12.44-PM-624x614.png 624w\" sizes=\"auto, (max-width: 423px) 100vw, 423px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Extract Predictions on Test Data and evaluate model<\/strong><\/span><\/p>\n<p>First we can evaluate the variable importance plot for the random forest model.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\nlibrary(vip)\r\n\r\nfinal_fit %&gt;%\r\n  extract_fit_parsnip() %&gt;%\r\n  vip(geom = &quot;col&quot;,\r\n      aesthetics = list(\r\n              color = &quot;black&quot;,\r\n              fill = &quot;palegreen&quot;,\r\n              alpha = 0.5)) +\r\n  theme_classic()\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.27.55-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-3010\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.27.55-PM-1024x858.png\" alt=\"\" width=\"508\" height=\"426\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.27.55-PM-1024x858.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.27.55-PM-300x251.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.27.55-PM-768x643.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.27.55-PM-624x523.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.27.55-PM.png 1036w\" sizes=\"auto, (max-width: 508px) 100vw, 508px\" \/><\/a><\/p>\n<p>Next we will look at the accuracy and ROC on the test set by using the <strong>collect_metrics()<\/strong> function on the <strong>final_fit<\/strong>. Additionally, if we use the <strong>collect_predictions()<\/strong> function we will get the predicted class and predicted probabilities for each row of the test set.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\n## Look at the accuracy and ROC on the test data\r\nfinal_fit %&gt;% \r\n  collect_metrics()\r\n\r\n## Get the model predictions on the test data\r\nfit_test &lt;- final_fit %&gt;% \r\n  collect_predictions()\r\n\r\nfit_test %&gt;%\r\n  head()\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.14.42-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-3008\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.14.42-PM-1024x315.png\" alt=\"\" width=\"625\" height=\"192\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.14.42-PM-1024x315.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.14.42-PM-300x92.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.14.42-PM-768x236.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.14.42-PM-624x192.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.14.42-PM.png 1210w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p>Next, create a confusion matrix of the class of interest, <strong>best_11<\/strong> and our predicted class, <strong>.pred_class<\/strong>.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\nfit_test %&gt;% \r\n  count(.pred_class, best_11)\r\n\r\ntable(fit_test$best_11, fit_test$.pred_class)\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.16.08-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-3009\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.16.08-PM.png\" alt=\"\" width=\"553\" height=\"147\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.16.08-PM.png 738w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.16.08-PM-300x80.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/04\/Screenshot-2023-04-09-at-5.16.08-PM-624x166.png 624w\" sizes=\"auto, (max-width: 553px) 100vw, 553px\" \/><\/a><\/p>\n<p>We see that the model never actually predicted a person to be in class 1, indicating that they would be ranked as one of the Best X1 for a given season. We have such substantial class imbalance that the model can basically guess that no one will will Best XI and end up with a high accuracy.<\/p>\n<p>The predicted class for a binary prediction ends up coming from a default threshold of 0.5, meaning that the predicted probability of being one of the Best XI needs to exceed 50% in order for that class to be predicted. This might be a bit high\/extreme for our data! Additionally, in many instances we may not care so much about a specific predicted class but instead we want to just understand the probability of being predicted in one class or another.<\/p>\n<p>Let&#8217;s plot the distribution of Best XI predicted probabilities colored by whether the individual was actually one of the Best XI players.<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\nfit_test %&gt;%\r\n  ggplot(aes(x = .pred_1, fill = best_11)) +\r\n  geom_density(alpha = 0.6)\r\n<\/pre>\n<p>We can see that those who were actually given the Best XI designation had a higher probability of being indicated as Best XI, just not high enough to exceed the 0.5 default threshold. What if we set the threshold for being classified as Best XI at 0.08?<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\r\nfit_test %&gt;%\r\n  mutate(pred_best_11_v2 = ifelse(.pred_1 &gt; 0.08, 1, 0)) %&gt;%\r\n  count(pred_best_11_v2, best_11)\r\n<\/pre>\n<p><span style=\"text-decoration: underline;\"><strong>Wrapping Up<\/strong><\/span><\/p>\n<p>In the final code output above we see that there are 20 total instances where the model predicted the individual would be a Best XI player. Some of those instances the model correctly identified one of the Best XI and other times the model prediction led to a false positive (the model thought the person had a Best XI season but it was incorrect). There is a lot to unpack here. Binary thresholds like this can often be messy as predicting one class or another can be weird as you get close to the threshold line. Additionally, changing the threshold line will change the classification outcome. This would need to be considered based on your tolerance for risk of committing a Type I or Type II error, which may depend on the goal of your model, among other things. Finally, we often care more about the probability of being in one class or another versus a specific class outcome. All of these things need to be considered and thought through and are out of the scope of this tutorial, which had the aim of simply walking through how to set up a <strong>workflow_set<\/strong>() and fit multiple models simultaneously. Perhaps a future tutorial can cover such matters more in depth.<\/p>\n<p>The complete code for this tutorial is available on my <strong><span style=\"color: #0000ff;\"><a style=\"color: #0000ff;\" href=\"https:\/\/github.com\/pw2\/tidymodels_template\/blob\/main\/Tidymodels%20Workflow%20Sets%20Tutorial.Rmd\">GITHUB page<\/a><\/span><\/strong>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Intro The purpose of workflow sets are to allow you to seamlessly fit multiply different models (and even tune them) simultaneously. This provide an efficient approach to the model building process as the models can then be compared to each other to determine which model is the optimal model for deployment. Therefore, the aim of [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[47],"tags":[],"class_list":["post-2991","post","type-post","status-publish","format-standard","hentry","category-model-building-in-r"],"_links":{"self":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/2991","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/comments?post=2991"}],"version-history":[{"count":3,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/2991\/revisions"}],"predecessor-version":[{"id":3013,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/2991\/revisions\/3013"}],"wp:attachment":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/media?parent=2991"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/categories?post=2991"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/tags?post=2991"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}