Category Archives: Sports Analytics

K-Nearest Neighbor: {tidymodels} tutorial

When working on data science or research teams, it often helps to have a workflow that makes it easy for teammates to review your work and add additional components to the data cleaning or model structure. Additionally, it ensures that the steps in the process are clear so that debugging is easy.

In R, the {tidymodels} package offers one such workflow and in python, folks seem to prefer scikit learn. This week, I’m going to walk through a full workflow in tidymodels using the K-nearest neighbor algorithm. Some of the things I’ll cover include:

  • Splitting the data in test/train sets and cross validation folds
  • Setting up the model structure (what tidymodels refers to as a recipe)
  • Creating preprocessing steps
  • Compiling the preprocessing and model structure together into a single workflow
  • Tuning the KNN model
  • Identifying the best tuned model
  • Evaluating the model on the test set
  • Saving the entire worflow and model to be used later on with new data
  • Making predictions with the saved workflow and model on new data

I’ve provided a number of tidymodels tutorials on this blog so feel free to search the blog for those. Additionally, all of my tidymodels tutorials and templates are available in a GitHub Repo. Alternatively, if Python is more you jam, I did do a previous blog comparing the workflow in tidymodels to Scikit Learn, which can be found HERE.

This entire tutorial and data are available on my GitHub page if you’d like to code along.

Load & Clean Data

For this tutorial I’m going to use the 2021 FIFA Soccer Ratings, which are available on Kaggle. There are several years of ratings there, but we will concentrate on 2021 with the goal being to estimate a player’s contract value based on the various FIFA ratings provided and the position they play.

First, I load the tidyverse and tidymodels libraries and read in the data. I’m going to drop goalies so that we only focus on players who play the field and, since the data has a large number of columns, I’m going to get only the columns we care about: Information about the player (playerID, name, height, weight, club, league, position, and contract value) and then the various FIFA ratings.

The only column with missing data that impacts our analysis is the team position column, as this is a feature in the data set. Additionally, there are 207 players with a contract value of 0 (all 195 players missing a team position have a 0 contract value). So, perhaps these are players who weren’t on a club at the time the ratings were assigned.

I’m going to remove these players from our analysis data set, but I’m going to place them in their own data set because we will use them later to make estimates of their contract value.

Visualizing the Data

Let’s take a look at a visual of the contract value for all of the players in our data set. Since this is severely right skewed, we will plot it on the log scale.

We also notice there are a variety of positions (way more than I would have guessed in soccer!).

Finally, we want to explore how playing position might influence contract value.

tidymodels set up

Now that we have our data organized we want to set up a tidymodels workflow to estimate the contract value of every player using a K-nearest neighbor model.

First, we split the data in train and test splits and then further split the training data into 5 cross validation folds so that we can tune the model to find the best number of K-neighbors.

Next, we specify that we want to use a KNN model and the outcome variable as regression, since it is continuous (as opposed to classification).

Now that the model is specified we need to set up our preprocessing steps. This is one thing that tidymodels is super useful for. Normally, we’d need to do all this preprocessing before fitting the model and, as in the case of KNN and most machine learning models, we’d need to standardize the variables in the training set and then store the means and standard deviations of those variables so that we can use them to standardize the test set or future data. In tidymodels, we set up our preprocessing workflow and store it and it contains all that information for us! We can simply load it and use it for any downstream analysis we want to do. There are three preprocessing steps we want to add to our workflow:

  1. log transform the dependent variable (contract value)
  2. Dummy code the positions
  3. Standardize (z-score) all off the numeric variable in the model (the ratings)

We can view the preprocessing steps using the prep() function.

 

With the model specified and the preprocessing recipe set up we are ready to compile everything into a single workflow.

We can look at the workflow to make sure everything makes sense before we start fitting the model.

Next, we tune the model using the cross validated folds that we set up on our training data set. We are trying to find the optimal number of k-neighbors that minimizes the RMSE of the outcome variable (the log of contract value).

We see the results are stored in a list for every one of our 5 cross validation splits. We can view the model metrics for each of the folds.

The model with 10 neighbors appears to have the lowest RMSE. So, we want to pull that value directly from this table so that we can store it and use it later when we go to fit the final model.

Fitting the Final Model Version 1: Using finalize_workflow()

Version 1 that I’ll show for fitting the final model uses the finalize_workflow() function. I like this approach if I’m building my model in a local session and then want to see how well it performs on the test session (which is what this function does). This isn’t the approach I take when I need to save everything for downstream use (which we will cover in Version 2).

First, fit the best model using the finalize_workflow() function.

Then, we get the model predictions for this final workflow on the the test data set.

We can plot the predicted contract value compared to the actual contract value and calculate the test set RMSE.

Fitting the Final Model Version 2: Re-sepcify the model on the training set, save it, and use it later

In this second version of fitting the final model, we will take the best neighbors from the tuning phase, re-specify the model to the training set, reset the workflow, and then save these components so that we can use them later.

First, re-specify the model and notice I set the neighbors argument to best_neighbors, which we saved above when we tuned the model.

We then use this finalized/re-specified model to reset the workflow. Notice it is added to the add_model() function however the recipe, with our preprocessing steps, does not change.

With the workflow set up we now need to do the final two sets:

  1. Fit the model to the entire data set and extract the recipe using extract_recipe() since we will need to save this to preprocess any new data before making predictions.
  2. Fit the model to the final data and extract the model itself, using extract_fit_parsnip() so we can use the model to make future predictions.

Now we save the recipe and model as .rda files. We can then load these two structures and use them on new data!

We have 207 players with 0 contract value, of which 195 had no team position. So, that leaves us with 12 players that we can estimate contract value for with our saved model.

First, we use the bake() function to apply the saved recipe to the new data so that we can preprocess everything appropriately.

With the new data preprocessed we can now predict the log of contract value with our saved model.


We can add these predictions back to the new data set, exponentiate the predicted contract value to get it back to the normal scale and then create a visual of our estimates for the 12 players.

Wrapping Up

That’s a quick walk through on how to set up a complete tdymodels workflow. This approach would work for any type of model you want to build, not just KNN! As always, the code and data are available on my GitHub page.

Calculating Confidence Intervals in R

A favorite paper of mine is the 1986 paper by Gardner and Altman regarding confidence intervals and estimation as a more useful way of reporting data than a dichotomous p-value:

Gardner, MJ. Altman, DG. (1986). Confidence intervals rather than P values: Estimation rather than hypothesis testing. Brit Med J; 292:746-750.

In this paper, Gardner and Altman discuss three main points for either moving away from or supplementing statistical reporting with p-values:

  1. Research often focuses on null hypothesis significance testing with the goal being to identify statistically significant results.
  2. However, we are often interested in the magnitude of the factor of interest.
  3. Given that research deals with samples of a broader population, the readers are not only interested in the observed magnitude of the estimand but also the degree of variability and plausible range of values for the population. This variability can be quantified using confidence intervals.

Aside from the paper providing a clear explanation of the issue at hand, their appendix offers the equations to calculate confidence intervals for means, differences in means, proportions, and differences in proportions. Thus, I decided to compile the appendix in an R script for those looking to code confidence intervals (and not have to rely on pre-built functions).

All of this code is available on my GITHUB page.

Confidence Intervals for Means and Differences

Single Sample

  1. Obtain the mean
  2. Calculate the standard error (SE) of the mean as SE = SD/sqrt(N)
  3. Multiply by a t-critical value specific to the level of confidence of interest and the degrees of freedom (DF) for the single sample, DF = N – 1

The confidence intervals are calculated as:

Low = mean – t_{crit} * SE
High = mean + t_{crit} * SE

Example

We collect data on 30 participants on a special test of strength and observe a mean of 40 and standard deviation of 10. We want to calculate the 90% confidence interval.

The steps above can easily be computed in R. First, we write down the known info (the data that is provided to us). We then calculate the standard error and degrees of freedom (N – 1). To obtain the critical value for our desired level of interest, we use the t-distribution is specific to the degrees of freedom of our data.

## Known info
N <- 30
avg <- 40
SD <- 10

## Calculate DF and SE
SE <- SD / sqrt(N)
DF <- N - 1

## Calculate the confidence level of interest (90%), the amount of data in each of the the two tails ((1 - level of interest) / 2) and the t-critical value
level_of_interest <- 0.90
tail_value <- (1 - level_of_interest)/2

t_crit <- abs(qt(tail_value, df = DF))
t_crit

## Calculate the 90% CI
low90 <- round(avg - t_crit * SE, 1)
high90 <- round(avg + t_crit * SE, 1)

cat("The 90% Confidence Interval is:", low90, " to ", high90)

Two Samples

  1. Obtain the sample mean and standard deviations for the two samples
  2. Pool the estimate of the standard deviation:s = sqrt(((n_1 – 1)*s^2_1 + n_2 – 1)*s^2_2) / (n_1 + n_2 – 2))
  3. Calculate the SE for the difference:SE_{diff} = s * sqrt(1/n_1 + 1/n_2)
  4. Calculate the confidence interval as:

Low = (x_1 – x_2) – t_{crit} * SE_{diff}
High = (x_1 – x_2) + t_{crit} * SE_{diff}

Example

The example in the paper provides the following info:

  • Blood pressure levels were measured in 100 diabetic and 100 non-diabetic men aged 40-49 years old.
  • Mean systolic blood pressure was 146.4 mmHg (SD = 18.5) in diabetics and 140.4 mmHg (SD = 16.8) in non-diabetics.

Calculate the 95% CI.

## store the known data
N_diabetic <- 100
N_non_diabetic <- 100
diabetic_avg <- 146.4
diabetic_sd <-18.5
non_diabetic_avg <- 140.4
non_diabetic_sd <- 16.8

## calculate the difference in means, the pooled SD, and the SE of diff
group_diff <- diabetic_avg - non_diabetic_avg

pooled_sd <- sqrt(((N_diabetic - 1)*diabetic_sd^2 + (N_non_diabetic - 1)*non_diabetic_sd^2) / (N_diabetic + N_non_diabetic - 2))

se_diff <- pooled_sd * sqrt(1/N_diabetic + 1/N_non_diabetic)

## Calculate the confidence level of interest (95%), the amount of data in each of the the two tails ((1 - level of interest) / 2) and the t-critical value
level_of_interest <- 0.95
tail_value <- (1 - level_of_interest)/2

t_crit <- abs(qt(tail_value, df = N_diabetic + N_non_diabetic - 2))
t_crit

## Calculate the 95% CI
low95 <- round(group_diff - t_crit * se_diff, 1)
high95 <- round(group_diff + t_crit * se_diff, 1)

cat("The 95% Confidence Interval is:", low95, " to ", high95)

Confidence Intervals for Proportions

Single Sample

  1. Obtain the proportion for the population
  2. Calculate the SE of the proportion, SE = sqrt((p * (1-p)) / N)
  3. Obtain the z-critical value from a standard normal distribution for the level of confidence of interest (since the value for a proportion does not depend on sample size as it does for means).
  4. Calculate the confidence interval:

low = p – z_{crit} * SE
high = p + z_{crit} * SE

Example

We observe a basketball player with 80 field goal attempts and a FG% of 39%. Calculate the 90% CI.

## Store the known info
N <- 80
fg_pct <- 0.39

## Calculate SE
se <- sqrt((fg_pct * (1 - fg_pct)) / N)

## Calculate z-critical value for 50% confidence
level_of_interest <- 0.95
tail_value <- (1 - level_of_interest) / 2
z_crit <- qnorm(p = tail_value, lower.tail = FALSE)

## Calculate the 95% CI
low95 <- round(fg_pct - z_crit * se, 3)
high95 <- round(fg_pct + z_crit * se, 3)

cat("The 95% Confidence Interval is:", low95, " to ", high95)

Two Samples

  1. Calculate the difference in proportions between the two groups
  2. Calculate the SE of the difference in proportions:SE_{diff} = sqrt(((p_1 * (1-p_1)) / n_1) + ((p_2 * (1 – p_2)) / n_2))
  3. Calculate the z-critical value for the level of interest
  4. Calculate the confidence interval as:

low = (p_1 – p_2) – (z_{crit} * se_{diff})
high = (p_1 – p_2) + (z_{crit} * se_{diff})

Example of two unpaired samples

The study provides the following table of example data:

data.frame(
  response = c("improvement", "no improvement", "total"),
  treatment_A = c(61, 19, 80),
  treatment_B = c(45, 35, 80)
)

The difference we are interested in is between the proportion who improved in treatment A and the proportion of those who improved in treatment B.

## Obtain the two proportions from the table and total sample sizes
pr_A <- 61/80
n_A <- 80
pr_B <- 45/80
n_B <- 80

## calculate the difference in proportions
diff_pr <- pr_A - pr_B

## Calculate the SE
se_diff <- sqrt((pr_A * (1 - pr_A))/n_A + (pr_B * (1 - pr_B))/n_B)

## Get z-critical value for 95% confidence
level_of_interest <- 0.95
tail_value <- (1 - level_of_interest) / 2
z_crit <- qnorm(p = tail_value, lower.tail = FALSE)

## Calculate the 95% CI
low95 <- round(diff_pr - z_crit * se_diff, 3)
high95 <- round(diff_pr + z_crit * se_diff, 3)

cat("The 95% Confidence Interval is:", low95, " to ", high95)

 

Example for two paired samples

We can organize the data in a table like this:

data.frame(
  test_1 = c("Present", "Present", "Absent", "Absent"),
  test_2 = c("Present", "Absent", "Present", "Absent"),
  number_of_subjects = c("a", "b", "c", "d")
)

Let’s say we measured a group of subjects for a specific disease twice in a study. A subject either has the disease (present) or does not (absent) in the two time points. We observe the following data:

dat <- data.frame(
  test_1 = c("Present", "Present", "Absent", "Absent"),
  test_2 = c("Present", "Absent", "Present", "Absent"),
  number_of_subjects = c(10, 25, 45, 5)
)

dat

## total sample size
N <- sum(dat$number_of_subjects)

If we care about comparing those that had the disease (Present) on both occasions (both Test1 and Test2) we calculate them as:

p_1 = (a + b) / N
p_2 = (a + c) / N
Diff = p_1 – p_2

The SE of the difference is:

SE_{diff} = 1/N * sqrt(b + c – (b-c)^2/N)

## Obtain the info we need from the table
p1 <- (10 + 25) / N
p2 <- (10 + 45) / N
diff_prop <- p1 - p2
diff_prop

## Calculate the SE of the difference
se_diff <- 1 / N * sqrt(25 + 45 - (25+45)^2/N)

## Get z-critical value for 95% confidence
level_of_interest <- 0.95
tail_value <- (1 - level_of_interest) / 2
z_crit <- qnorm(p = tail_value, lower.tail = FALSE)

## Calculate the 95% CI
low95 <- round(diff_prop - z_crit * se_diff, 3)
high95 <- round(diff_prop + z_crit * se_diff, 3)

cat("The 95% Confidence Interval is:", low95, " to ", high95)

As always, all of this code is available on my GITHUB page.

And, if you’d like to read the full paper, you can find it here:

Gardner, MJ. Altman, DG. (1986). Confidence intervals rather than P values: Estimation rather than hypothesis testing. Brit Med J; 292:746-750.

Validity, Reliability, & Responsiveness — A few papers on measurement in sport science

I had the pleasure of speaking at the National Strength and Conditioning Association‘s (NSCA) National Conference this summer and while there I made it a point to attend the Sport Science & Performance Technology Special Interest Group meeting as well.

One thing that immediately stood out to me was the number of questions raised specific to what types of technologies to purchase (e.g. “Which brand of force plates should we buy?”, “Does anyone have a list comparing and contrasting different technologies so that we can determine what would be best for us?”, etc.).

While these are fine questions, I do feel they are a bit like putting the cart before the horse. Before thinking about what technology to purchase, we should spend a good bit of time gaining clarity on the question(s) we are attempting to answer. Once we have a firm understanding of the question we can then begin the process of evaluating whether a technology exists that can help us collect the necessary data to explore that question. In fact, this was the main crux of my lecture at the conference, as I spoke about using the PPDAC Framework in practice (I wrote an article about this framework a couple of years ago).

A force plate, a GPS unit, or an accelerometer won’t solve all of our problems. In fact, depending on our question, they might not solve any of our problems! Moreover, as sport scientists we need to concern ourselves not only with the research question but, also whether the desired technology is useful within our ecological setting. Just because something worked in a controlled lab environment or was valid in a different sport does not mean it will be useful for our sport, or in our setting, or with our athletes, or given our unique constraints.

So, I decided to share a few resources pertaining to measurement theory concepts such as validity, reliability, and responsiveness/sensitivity for those working in the sport science space who are interested in more critical approaches to evaluating the technology we use in practice.

Additionally, for those interested, several years ago I wrote a full R code blog for the last paper above (Swinton et al) ,which can he accessed HERE.

Happy reading!

Simulations in R Part 6: Multicollinearity Assumption in Regression

For the next installment of our simulation blog series we will use simulation to look at the Multicollinearity assumption in linear regression.

For a refresher, here are the previous 5 articles in this series:

  • Part 1 discussed the basic functions for simulating and sampling data in R.
  • Part 2 walked us through how to perform bootstrap resampling and then simulate bivariate and multivariate distributions.
  • Part 3 we worked making group comparisons by simulating thousands of t-tests
  • Part 4 building simulations for linear regression
  • Part 5 using simulation to investigate the homoskedasticity assumption in regression.

The entire code for this series is accessible on my GITHUB page.

Multicollinearity

Multicollinearity occurs when two independent variables in a regression model are highly correlated with each other. Such a situation can produce problems with interpretation of the beta coefficients of the model, may increase standard errors in the model, and can lead to over fitting of the data. We can simulate this issue in order to get a better understanding of how multicollinearity can influence a regression model.

Constructing the simulation

We will use the mvnorm package to help us construct a simulation where the two independent variables share a certain level of correlation between each other.

First, we will create the true model parameters: an intercept of 2, a beta1 of 5, and a beta2 of 10. We also create a vector of correlation coefficients from 0 to 0.99 and a few data frames to store the results of our model. We will also specify that at each correlation coefficient we want 200 random draws from the multivariate normal distribution.

## load packages
library(tidymodels)
library(patchwork)
library(mvtnorm)

set.seed(999)

# create the true model parameters
intercept <- 2
beta1 <- 5
beta2 <- 10

## Number of draws from a multivariate normal distribution
n <- 200

## Create a data frame to store model results
sim_params <- data.frame(intercept = NA,
                      intercept_se = NA,
                      beta1 = NA,
                      beta1_se = NA,
                      beta2 = NA,
                      beta2_se = NA,
                      model_rse = NA)

## create levels of multicollinearity between the two independent variables
cor_coefs <- c(seq(from = 0, to = 0.9, by = 0.1), 0.99)

# data frame to store the average beta coefficient and their standard deviations form the simulation
mean_betas <- data.frame(beta1 = NA,
                       sd_beta1 = NA,
                       beta2 = NA,
                       sd_beta2 = NA)

Next, we will create a nested for() loop to construct out simulations.

  •  The outer part of the loop begins by creating the multivariate normal distribution. We use the rmvnorm(), which means we first specify a correlation matrix using our vector of correlation coefficients we specified above. Once we have the two correlated variables we can put them into the inner loop.
  • The inner loop is where we create a regression equation for the given correlation between the two variables. We create 100 regression simulations for each correlation coefficient.
  • Once the inner loop is finished running we store the results at the bottom of the outer loop. Instead of storing all of the results, we take the average of the 100 beta coefficients and their respective standard errors for each correlation coefficient.
## loop
for(j in 1:length(cor_coefs)){
  
  ## Create a correlation matrix between beta1 and beta2
  beta_corr <- matrix(c(1, cor_coefs[j], cor_coefs[j], 1), nrow = 2, ncol = 2)
  
  ## create a multivariate normal distribution 
  cor_df <- rmvnorm(n = n, mean = c(0, 0), sigma = beta_corr)
  X1 <- cor_df[, 1]
  X2 <- cor_df[, 2]
  
  ## simulate 100 regression simulations
  for(i in 1:100){
    
    # set up the model
    y_hat <- intercept + beta1*X1 + beta2*X2 + rnorm(n = n, mean = 0, sd = 1)
    
    # construct a regression equation
    model <- lm(y_hat ~ X1 + X2)
    
    # store the variance-covariance matrix
    vcv <- vcov(model)
    
    # estimates for the intercept
    sim_params[i, 1] <- model$coef[1]
  
    # estimates for the beta1
    sim_params[i, 3] <- model$coef[2]
    
    # estimates for beta2
    sim_params[i, 5] <- model$coef[3]
  
    # SE for the intercept
    sim_params[i, 2] <- sqrt(diag(vcv)[1])
    
    # SE for beta1
    sim_params[i, 4] <- sqrt(diag(vcv)[2])
    
    # SE for beta2
    sim_params[i, 6] <- sqrt(diag(vcv)[3])
  
    # model RSE
    sim_params[i, 7] <- summary(model)$sigma
  }

  mean_betas[j, ] <- c(mean(sim_params[, 3]), mean(sim_params[, 4]), mean(sim_params[, 5]), mean(sim_params[, 6]))

}

The final result is a data frame with 10 rows, one for each correlation coefficient and the average of 100 regression simulations for the beta coefficients and their standard errors.

# Add the correlation coefficients to the final data frame
mean_betas$cor_coef <- cor_coefs mean_betas %>%
  knitr::kable()

Finally, we plot each of the results with the correlation coefficients on the x-axis.

# Plot the results
par(mfrow = c(2,2))
plot(x = mean_betas$cor_coef,
     y = mean_betas$beta1,
     main = "Beta1",
     lwd = 3,
     type = "b",
     ylim = c(2, 7),
     xlab = "Correlation betwen X1 and X2",
     ylab = "Beta1")
plot(x = mean_betas$cor_coef,
     y = mean_betas$sd_beta1,
     main = "Beta1 Standard Error",
     lwd = 3,
     type = "b",
     xlab = "Correlation betwen X1 and X2",
     ylab = "SE of B1")
plot(x = mean_betas$cor_coef,
     y = mean_betas$beta2,
     main = "Beta2",
     lwd = 3,
     type = "b",
     ylim = c(7, 13),
     xlab = "Correlation betwen X1 and X2",
     ylab = "Beta 2")
plot(x = mean_betas$cor_coef,
     y = mean_betas$sd_beta2,
     main = "Beta2 Standard Error",
     lwd = 3,
     type = "b",
     xlab = "Correlation betwen X1 and X2",
     ylab = "SE of B2")

The beta coefficients themselves remain relatively unchanged in our simulation across the various correlations levels. However, once the correlation between the two independent variables reaches about 0.7 the standard errors around the beta coefficients begin to increase exponentially, increasing our uncertainty about the true parameter values.

Wrapping Up

We’ve now covered using simulation to investigate two assumptions of linear regression. Our next installment will investigate another linear regression assumption before we proceed on to simulating other types of models.

For the full code, check out my GITHUB page.

Comparing Tidymodels in R to Scikit Learn in Python

I did a previous blog providing a side-by-side comparisons of R’s {tidyverse} to Python for various data manipulation techniques. I’ve also talked extensively about using {tidymodels} for model fitting (see HERE and HERE). Today, we will work through a tutorial on how to fit the same random forest model in {tidyverse} and Scikit Learn.

This will be a side-by-side view of both coding languages.The tutorial will cover:

  • Loading the data
  • Basic exploratory data analysis
  • Creating a train/test split
  • Hyperparameter tuning by creating cross-validated folds on the training data
  • Identifying the optimal hyperparameters and fitting the final model
  • Applying the final model to the test data and evaluating model performance
  • Saving the model for downstream use
  • Loading the saved model and applying it to new data

To get the full code for each language and follow along with the tutorial visit my GITHUB page.

The Data

The data comes the tidytuesday project from 4/4/2023. The data set is Premier League match data (2021 – 2022) that provides a series of features with the goal of predicting the final result (Full Time Result, FTR) as to whether the home team won, the away team won, or the match resulted in a draw.

Load Data & Packages

First, we load the data directly from the tidytuesday website in both languages.

Exploratory Data Analysis

Next, we perform some exploratory data analysis to understand the potential features for our model.

  • Check each column for NAs
  • Plot a count of the outcome variable across the three levels (H = home team wins, A = away team wins, D = draw)
  • Select a few features for our model and then create box plots for each feature relative to the 3 levels of our outcome variable

Train/Test Split

We being the model building process by creating a train/test split of the data.

Create a Random Forest Classifier Instance

This is basically telling R and python that we want to build a random forest classifier. In {tidymodels} this is referred to as “specifying the model engine”.

Hyperparameter Tuning on Cross Validated Folds

The two random forest hyperparameters we will tune are:

  1. The number of variables randomly selected for candidate model at each split (R calls this mtry while Python calls it max_features)
  2. The number of trees to grow (R calls this trees and Python calls it n_estimators)

In {tidymodels} we will specify 5 cross validated folds on the training set, set up a recipe, which explains the model we want (predicting FTR from all of the other variables in the data), put all of this into a single workflow and then set up our tuning parameter grid.

In Scikit Learn, we set up a dictionary of parameters (NOTE: they must be stored in list format) and we will pass them into a cross validation structure that performs 5-fold cross-validation in parallel (to speed up the process). We then pass this into the GridSearchCV() function where we specify the model we are fitting (random forest), the parameter grid that we’ve specified, and how we want to compare the random forest models (scoring). Additionally, we’ll set n_jobs = -1 to allow Python to use all of the cores on our machine.

While the code looks different, we’ve essentially set up the same process in both languages.

Tune the model on the training data

We can now tune the hyperparameters by applying the cross-validated folds procedure to the training data.

Above, we indicated to Python that we wanted some parallel processing, to speed up the process. In {tidyverse} we specify parallel processing by setting up the number of cores we’d like to use on our machine. Additionally, we will want to save the results of each cross-validated iteration, so we use the control_sample() function to do this. All of these steps were specified in Python, above, so we are ready to now apply cross-validation to our training dataset and tune the hyperparameters.

Get the best parameters

Both R and Python provide numerous objects to explore the output for each of the cross-validated folds. I’ve placed some examples in the respective codes in the GITHUB page. For our purposes, we are most interested in the optimal number of variables and trees. Both coding languages found 4 and 400 to be the optimal number of variables and trees, respectively.

Fitting the Final Model

Now that we have the optimal hyperparameter values, we can refit the model. In both {tidymodels} and Scikit learn, we’ll just refit a random forest with those optimal values specified.

Variable Importance Plot

It’s helpful to see which variables were the most important contributors to the model’s predictions.

Side Note: This takes more code in python than in R. This is one of the drawbacks I’ve found with python compared to R. I can do things more efficiently and with less code in R than in python. I often find I have to work a lot harder in Scikit Learn to get model outputs and information about the model fit. It’s all in there but it is not clearly accessible (to me at least) and plotting in matplotlib is not as clean as plotting in ggplot2.

Get Model Predictions on the Test Set

Both languages offer some out of the box options for describing the model fit info. If you want more than this (which you should, because this isn’t much to go off of), then you’ll have to extract the predicted probabilities and the actual outcomes and code some additional analysis (potentially a future blog article).

Save The Model

If we want to use this model for any downstream analysis we will need to save it.

Load the Model and Make Predictions

Once we have the model saved we can load it and apply it to any new data that comes in. Here, our new data will just be a selection of rows from the original data set (we will pretend it is new).

NOTE: Python is 0 indexed while R is indexed starting at 1. So keep that in mind if selecting rows from the original data to make the same comparison in both languages.

Wrapping Up

Both {tidymodels} and Scikit Learn provide users with powerful machine learning frameworks for conducting analysis. While the code syntax differs, the general concepts are the same, so bouncing between the two languages shouldn’t be to cumbersome. Hopefully this tutorial provided a nice overview of how to conduct the same analysis in both languages, offering a bridge for those trying to learn Python from R and vice versa.

All code is available on my GITHUB page.