{"id":3225,"date":"2023-09-03T22:35:34","date_gmt":"2023-09-03T22:35:34","guid":{"rendered":"http:\/\/optimumsportsperformance.com\/blog\/?p=3225"},"modified":"2023-09-03T22:35:34","modified_gmt":"2023-09-03T22:35:34","slug":"comparing-tidymodels-in-r-to-scikit-learn-in-python","status":"publish","type":"post","link":"https:\/\/optimumsportsperformance.com\/blog\/comparing-tidymodels-in-r-to-scikit-learn-in-python\/","title":{"rendered":"Comparing Tidymodels in R to Scikit Learn in Python"},"content":{"rendered":"<p>I did a previous blog providing a <strong><span style=\"color: #0000ff;\"><a style=\"color: #0000ff;\" href=\"https:\/\/optimumsportsperformance.com\/blog\/from-tidyverse-to-python\/\">side-by-side comparisons of R&#8217;s {tidyverse} to Python<\/a><\/span><\/strong> for various data manipulation techniques. I&#8217;ve also talked extensively about using {tidymodels} for model fitting (see <strong><span style=\"color: #0000ff;\"><a style=\"color: #0000ff;\" href=\"https:\/\/optimumsportsperformance.com\/blog\/tidymodels-model-fitting-template\/\">HERE<\/a><\/span><\/strong> and <strong><span style=\"color: #0000ff;\"><a style=\"color: #0000ff;\" href=\"https:\/\/optimumsportsperformance.com\/blog\/tidymodels-workflowsets-tutorials\/\">HERE<\/a><\/span><\/strong>). Today, we will work through a tutorial on how to fit the same random forest model in {tidyverse} and Scikit Learn.<\/p>\n<p>This will be a side-by-side view of both coding languages.The tutorial will cover:<\/p>\n<ul>\n<li>Loading the data<\/li>\n<li>Basic exploratory data analysis<\/li>\n<li>Creating a train\/test split<\/li>\n<li>Hyperparameter tuning by creating cross-validated folds on the training data<\/li>\n<li>Identifying the optimal hyperparameters and fitting the final model<\/li>\n<li>Applying the final model to the test data and evaluating model performance<\/li>\n<li>Saving the model for downstream use<\/li>\n<li>Loading the saved model and applying it to new data<\/li>\n<\/ul>\n<p>To get the full code for each language and follow along with the tutorial visit my <span style=\"color: #0000ff;\"><strong><a style=\"color: #0000ff;\" href=\"https:\/\/github.com\/pw2\/compare_tidymodels_to_scikit_learn\">GITHUB page<\/a><\/strong><\/span>.<\/p>\n<p><span style=\"text-decoration: underline;\"><strong>The Data<\/strong><\/span><\/p>\n<p>The data comes the <strong><span style=\"color: #0000ff;\"><a style=\"color: #0000ff;\" href=\"https:\/\/github.com\/rfordatascience\/tidytuesday\/blob\/master\/data\/2023\/2023-04-04\/readme.md\"><em><span style=\"color: #0000ff;\">tidytuesday<\/span> <\/em>project from 4\/4\/2023<\/a><\/span><\/strong>. The data set is Premier League match data (2021 &#8211; 2022) that provides a series of features with the goal of predicting the final result (Full Time Result, FTR) as to whether the home team won, the away team won, or the match resulted in a draw.<\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Load Data &amp; Packages<\/strong><\/span><\/p>\n<p>First, we load the data directly from the <em><strong>tidytuesday<\/strong><\/em> website in both languages.<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.08.11-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-3226\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.08.11-PM.png\" alt=\"\" width=\"762\" height=\"628\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.08.11-PM.png 762w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.08.11-PM-300x247.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.08.11-PM-624x514.png 624w\" sizes=\"auto, (max-width: 762px) 100vw, 762px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Exploratory Data Analysis<\/strong><\/span><\/p>\n<p>Next, we perform some exploratory data analysis to understand the potential features for our model.<\/p>\n<ul>\n<li>Check each column for NAs<\/li>\n<li>Plot a count of the outcome variable across the three levels (H = home team wins, A = away team wins, D = draw)<\/li>\n<li>Select a few features for our model and then create box plots for each feature relative to the 3 levels of our outcome variable<\/li>\n<\/ul>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.15.58-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-3227\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.15.58-PM.png\" alt=\"\" width=\"806\" height=\"471\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.15.58-PM.png 806w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.15.58-PM-300x175.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.15.58-PM-768x449.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.15.58-PM-624x365.png 624w\" sizes=\"auto, (max-width: 806px) 100vw, 806px\" \/><\/a> <a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.16.04-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-3228\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.16.04-PM-1024x429.png\" alt=\"\" width=\"625\" height=\"262\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.16.04-PM-1024x429.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.16.04-PM-300x126.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.16.04-PM-768x322.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.16.04-PM-624x262.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.16.04-PM.png 1069w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a> <a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.16.17-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-3229\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.16.17-PM-1024x497.png\" alt=\"\" width=\"625\" height=\"303\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.16.17-PM-1024x497.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.16.17-PM-300x146.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.16.17-PM-768x373.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.16.17-PM-624x303.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.16.17-PM.png 1126w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Train\/Test Split<\/strong><\/span><\/p>\n<p>We being the model building process by creating a train\/test split of the data.<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.20.04-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-3230\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.20.04-PM-1024x540.png\" alt=\"\" width=\"625\" height=\"330\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.20.04-PM-1024x540.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.20.04-PM-300x158.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.20.04-PM-768x405.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.20.04-PM-624x329.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.20.04-PM.png 1094w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Create a Random Forest Classifier Instance<\/strong><\/span><\/p>\n<p>This is basically telling R and python that we want to build a random forest classifier. In {tidymodels} this is referred to as &#8220;specifying the model engine&#8221;.<a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.22.08-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-3231\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.22.08-PM-1024x154.png\" alt=\"\" width=\"625\" height=\"94\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.22.08-PM-1024x154.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.22.08-PM-300x45.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.22.08-PM-768x115.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.22.08-PM-624x94.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.22.08-PM.png 1100w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Hyperparameter Tuning on Cross Validated Folds<\/strong><\/span><\/p>\n<p>The two random forest hyperparameters we will tune are:<\/p>\n<ol>\n<li>The number of variables randomly selected for candidate model at each split (R calls this <em><strong>mtry <\/strong><\/em>while Python calls it <em><strong>max_features<\/strong><\/em><em>)<\/em><\/li>\n<li>The number of trees to grow (R calls this <em><strong>trees <\/strong><\/em>and Python calls it <strong>n_estimators<\/strong>)<\/li>\n<\/ol>\n<p>In {tidymodels} we will specify 5 cross validated folds on the training set, set up a recipe, which explains the model we want (predicting FTR from all of the other variables in the data), put all of this into a single workflow and then set up our tuning parameter grid.<\/p>\n<p>In Scikit Learn, we set up a dictionary of parameters <strong>(NOTE:<\/strong> they must be stored in list format) and we will pass them into a cross validation structure that performs 5-fold cross-validation in parallel (to speed up the process). We then pass this into the <strong>GridSearchCV()<\/strong> function where we specify the model we are fitting (random forest), the parameter grid that we&#8217;ve specified, and how we want to compare the random forest models (scoring). Additionally, we&#8217;ll set <em><strong>n_jobs = -1<\/strong><\/em> to allow Python to use all of the cores on our machine.<\/p>\n<p>While the code looks different, we&#8217;ve essentially set up the same process in both languages.<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.50.52-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-3234\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.50.52-PM-1024x386.png\" alt=\"\" width=\"625\" height=\"236\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.50.52-PM-1024x386.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.50.52-PM-300x113.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.50.52-PM-768x289.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.50.52-PM-624x235.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.50.52-PM.png 1104w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Tune the model on the training data<\/strong><\/span><\/p>\n<p>We can now tune the hyperparameters by applying the cross-validated folds procedure to the training data.<\/p>\n<p>Above, we indicated to Python that we wanted some parallel processing, to speed up the process. In {tidyverse} we specify parallel processing by setting up the number of cores we&#8217;d like to use on our machine. Additionally, we will want to save the results of each cross-validated iteration, so we use the <em><strong>control_sample()<\/strong><\/em> function to do this. All of these steps were specified in Python, above, so we are ready to now apply cross-validation to our training dataset and tune the hyperparameters.<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.45.27-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-3233\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.45.27-PM-1024x301.png\" alt=\"\" width=\"625\" height=\"184\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.45.27-PM-1024x301.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.45.27-PM-300x88.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.45.27-PM-768x226.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.45.27-PM-624x183.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.45.27-PM.png 1095w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Get the best parameters<\/strong><\/span><\/p>\n<p>Both R and Python provide numerous objects to explore the output for each of the cross-validated folds. I&#8217;ve placed some examples in the respective codes in the <strong><span style=\"color: #0000ff;\"><a style=\"color: #0000ff;\" href=\"https:\/\/github.com\/pw2\/compare_tidymodels_to_scikit_learn\">GITHUB page<\/a><\/span><\/strong>. For our purposes, we are most interested in the optimal number of variables and trees. Both coding languages found 4 and 400 to be the optimal number of variables and trees, respectively.<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.52.07-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-3235\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.52.07-PM-1024x343.png\" alt=\"\" width=\"625\" height=\"209\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.52.07-PM-1024x343.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.52.07-PM-300x100.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.52.07-PM-768x257.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.52.07-PM-624x209.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.52.07-PM.png 1081w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Fitting the Final Model<\/strong><\/span><\/p>\n<p>Now that we have the optimal hyperparameter values, we can refit the model. In both {tidymodels} and Scikit learn, we&#8217;ll just refit a random forest with those optimal values specified.<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.57.03-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-3236\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.57.03-PM-1024x223.png\" alt=\"\" width=\"625\" height=\"136\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.57.03-PM-1024x223.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.57.03-PM-300x65.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.57.03-PM-768x167.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.57.03-PM-624x136.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-2.57.03-PM.png 1131w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Variable Importance Plot<\/strong><\/span><\/p>\n<p>It&#8217;s helpful to see which variables were the most important contributors to the model&#8217;s predictions.<\/p>\n<p><em><strong>Side Note:<\/strong><\/em><em> This takes more code in python than in R. This is one of the drawbacks I&#8217;ve found with python compared to R. I can do things more efficiently and with less code in R than in python. I often find I have to work a lot harder in Scikit Learn to get model outputs and information about the model fit. It&#8217;s all in there but it is not clearly accessible (to me at least) and plotting in matplotlib is not as clean as plotting in ggplot2.<\/em><\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.01.09-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-3237\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.01.09-PM.png\" alt=\"\" width=\"1013\" height=\"509\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.01.09-PM.png 1013w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.01.09-PM-300x151.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.01.09-PM-768x386.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.01.09-PM-624x314.png 624w\" sizes=\"auto, (max-width: 1013px) 100vw, 1013px\" \/><\/a><\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.04.58-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-3238\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.04.58-PM-1024x451.png\" alt=\"\" width=\"625\" height=\"275\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.04.58-PM-1024x451.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.04.58-PM-300x132.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.04.58-PM-768x338.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.04.58-PM-624x275.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.04.58-PM.png 1110w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Get Model Predictions on the Test Set<\/strong><\/span><\/p>\n<p>Both languages offer some out of the box options for describing the model fit info. If you want more than this (which you should, because this isn&#8217;t much to go off of), then you&#8217;ll have to extract the predicted probabilities and the actual outcomes and code some additional analysis (potentially a future blog article).<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.08.13-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-3239\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.08.13-PM-1024x424.png\" alt=\"\" width=\"625\" height=\"259\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.08.13-PM-1024x424.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.08.13-PM-300x124.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.08.13-PM-768x318.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.08.13-PM-624x259.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.08.13-PM.png 1115w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Save The Model<\/strong><\/span><\/p>\n<p>If we want to use this model for any downstream analysis we will need to save it.<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.10.05-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-3240\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.10.05-PM.png\" alt=\"\" width=\"627\" height=\"291\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.10.05-PM.png 627w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.10.05-PM-300x139.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.10.05-PM-624x290.png 624w\" sizes=\"auto, (max-width: 627px) 100vw, 627px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Load the Model and Make Predictions<\/strong><\/span><\/p>\n<p>Once we have the model saved we can load it and apply it to any new data that comes in. Here, our <em>new<\/em> data will just be a selection of rows from the original data set (we will pretend it is <em>new<\/em>).<\/p>\n<p><strong>NOTE: <\/strong>Python is 0 indexed while R is indexed starting at 1. So keep that in mind if selecting rows from the original data to make the same comparison in both languages.<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.15.55-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-3241\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.15.55-PM-1024x528.png\" alt=\"\" width=\"625\" height=\"322\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.15.55-PM-1024x528.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.15.55-PM-300x155.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.15.55-PM-768x396.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.15.55-PM-624x321.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-03-at-3.15.55-PM.png 1118w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Wrapping Up<\/strong><\/span><\/p>\n<p>Both {tidymodels} and Scikit Learn provide users with powerful machine learning frameworks for conducting analysis. While the code syntax differs, the general concepts are the same, so bouncing between the two languages shouldn&#8217;t be to cumbersome. Hopefully this tutorial provided a nice overview of how to conduct the same analysis in both languages, offering a bridge for those trying to learn Python from R and vice versa.<\/p>\n<p>All code is available on my <strong><span style=\"color: #0000ff;\"><a style=\"color: #0000ff;\" href=\"https:\/\/github.com\/pw2\/compare_tidymodels_to_scikit_learn\">GITHUB page<\/a><\/span><\/strong>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I did a previous blog providing a side-by-side comparisons of R&#8217;s {tidyverse} to Python for various data manipulation techniques. I&#8217;ve also talked extensively about using {tidymodels} for model fitting (see HERE and HERE). Today, we will work through a tutorial on how to fit the same random forest model in {tidyverse} and Scikit Learn. This [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[48,47,43],"tags":[],"class_list":["post-3225","post","type-post","status-publish","format-standard","hentry","category-model-building-in-python","category-model-building-in-r","category-sports-analytics"],"_links":{"self":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/3225","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/comments?post=3225"}],"version-history":[{"count":1,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/3225\/revisions"}],"predecessor-version":[{"id":3242,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/3225\/revisions\/3242"}],"wp:attachment":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/media?parent=3225"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/categories?post=3225"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/tags?post=3225"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}