TidyX 126: Keeping duplicates when pivoting wider

Working more on data engineering/data cleaning steps, Ellis and I talk about the pivot_wider() function within {tidyverse}. One of the issues is that when pivoting a data set to a wide format, if you have duplicate rows in the id columnsĀ pivot_wider() will collapse them into a list. This may be problematic if you are needing to retain all rows in the original data set. Thus, Ellis and I discuss a method to pivot_wider() while retaining all rows of data, even when there are duplicate values in the id_cols.

To view the screen cast, CLICK HERE.

To access our code, CLICK HERE.

The Role of Skill vs Luck in Team Sport Winning

This week on the Wharton Moneyball Podcast, the hosts were discussing World Cup results following the group play stage.

At one point, they talked about the variance in performance and the role that luck can play in winning or losing. They felt like they didn’t have as good an intuition about how variable the games were and one of the hosts, Eric Bradlow, said that he’d try and look into it and have an answer for next week’s episode.

The discussion reminded me of a chapter in one of my favorite books, The Success Equation by Michael Mauboussin. The book goes into nice detail about the role of skill and luck in sports, investing, and life. On page 78, Mauboussin provides the following equation:

Skill = Observed Outcome – Luck

From there, he explains the steps for determining the contribution of luck to winning in a variety of team sports. Basically, the amount of luck plays is represented as the ratio of the variance of luck to the variance of observed outcomes.

To calculate the variance of observed outcomes, we find the win% of each team in a given season and calculate the standard deviation of that win%. Squaring this value gives us the variance of observed outcomes.

When calculating the variance of luck we first find the average win% of all of the teams and, treating this as a binomial, we calculate the standard deviation as sqrt((win% * (1 – win%)) / n_games), where n_games is the number of games in a season.

I scraped data for the previous 3 complete seasons for the Premier League, NHL, NBA, NFL, and MLB from sports-reference.com. All of the data and code for calculating the contribution of luck to winning for these sports is available on my GitHub page.

Let’s walk through an example

Since the guys on Wharton Moneyball Podcast were talking about Soccer, I’ll use the Premier League as an example.

First, we create a function that does all the calculations for us. All we need to do is obtain the relevant summary statistics to feed into the function and then let it do all the work.

league_perf <- function(avg_win_pct, obs_sd, luck_sd, league){
  
  ## convert standard deviations to variance
  var_obs <- obs_sd^2
  var_luck <- luck_sd^2
  
  ## calculate the variation due to skill
  var_skill <- var_obs + var_luck
  
  ## calculate the contribution of luck to winning
  luck_contribution <- var_luck / var_obs
  
  
  ## Results table
  output <- tibble(
    league = {{league}},
    avg_win_pct = avg_win_pct,
    luck_contribution = luck_contribution,
  )
  
  return(output)
  
}

 

Using the Premier League data set we calculate the games per season, league average win%, the observed standard deviation, and the luck standard deviation.

NOTE: I’m not much of a soccer guy so I wasn’t sure how to handle draws. I read somewhere that they equate to about 1/3 of a win, so that is what I use here to credit a team for a draw.

## get info for function
# NOTE: Treating Draws as 1/3 of a win, since that reflects the points the team is awarded in the standings
premier_lg %>%
  select(Squad, MP, W, D) %>%
  mutate(win_pct = (W + 1/3*D) / MP) %>%
  summarize(games = max(MP),
            avg_win_pct = mean(win_pct),
            obs_sd = sd(win_pct),
            luck_sd = sqrt((avg_win_pct * (1 - avg_win_pct)) / games))

## Run the function
premier_lg_output <- league_perf(avg_win_pct = 0.462,
                                 obs_sd = 0.155,
                                 luck_sd = 0.0809,
                                 league = "Premier League")

premier_lg_output

Looking at our summary table, it appears that teams have had an average win% over the past 3 seasons of 46.2% (not quite 50%, as we will see in NBA, MLB, or NFL, since draws happen so frequently). The standard deviation of team win% was 15.5% (this is the observed standard deviation), while the luck standard deviation, 8.1%, is the binomial standard deviation using the average win percentage and 38 games in a season.

Feeding these values into the function created above we find that luck contributes approximately 27.2% to winning in the Premier League. In Mauboussin’s book he found that the contribution of luck to winning was 31% in the Premier League from 2007 – 2011. I don’t feel like the below value is too far off of that, though I don’t have a good sense for what sort of magnitude of change would be surprising. That said, the change could be due to improved skill in the Premier League (already one of the most skillful leagues in the world) or perhaps how he handled draws, which was not discussed in the book and may differ from the approach I took here.

Let’s look at the table of results for all sports

Again, you can get the code for calculating the contribution of luck to winning in these sports from my GitHub page, so no need to rehash it all here. Instead, let’s go directly to the results.

  • NBA appears to be the most skill driven league with only about 15% of the contribution to winning coming from luck. Mauboussin found this value to be 12% from 2007 – 2011.
  • The NFL appears to be drive most by luck, contributing 39% to winning. This is identical to what Mauboussin observed using NFL data from 2007 – 2011.
  • The two most surprising outcomes here are the MLB and NHL. Mauboussin found the MLB to have a 34% contribution of luck (2007 – 2011) and the NHL a 53% contribution of luck (2008 – 2012). However, in my table below it would appear that the contribution of luck has decreased substantially in these two sports, using data from previous 3 seasons (NOTE: throwing out the COVID year doesn’t alter this drastically enough to make it close to what Mauhoussin showed).

Digging Deeper into MLB and NHL

I pulled data from the exact same seasons that Mauboussin used in his book and obtained the following results.

  • The results I observed for the MLB are identical to what Mauboussin had in his book (pg. 80)
  • The results for the NFL are slightly off (47.7% compared to 53%) but this might have to do with how I am handling overtime wins and losses (which awards teams points in the standings), as I don’t know enough about hockey to determine what value to assign to them. Perhaps Mauboussin addressed these two outcomes in his calculations in a more specific way.

Additional hypotheses:

  • Maybe skill has improved in hockey over the past decade and a half?
  • Maybe a change in tactics (e.g., the shift) or strategy (e.g., hitters trying to hit more home runs instead of just making contact or pitchers trying to explicitly train to increase max velocity) has taken some of the luck events out of baseball and turned it into more of a zero-sum duel between the batter and pitcher, making game outcomes more dependent on the skill of both players?
  • Maybe I have an error somewhere…let me know if you spot one!

Wrapping Up

Although we marvel at the skill of the athletes we watch in their arena of competition, it’s important to recognize that luck plays a role in the outcomes, and for some sports it plays more of a role than others. But, luck is also what keeps us watching and makes things fun! Sometimes the ball bounces your way and you can steal a win! The one thing I always appreciate form the discussions on the Wharton Moneyball Podcast is that the hosts don’t shy away from explaining things like luck, randomness, variance, regression to the mean, and weighting observed outcomes with prior knowledge. This way of thinking isn’t just about sport, it’s about life. In this sense, we may consider sport to be the vehicle through which these hosts are teaching us about our world.

TidyX 125: Combining Multiple Conditions – Follow Up

In our previous episode, Ellis Hughes and I discussed a few approaches to creating a new column in your data which records multiple conditions that are based on data in a different column.

This week, we are following up that episode because we got some great viewer feedback. One viewer provided an alternative approach to solving the problem and another viewer had a follow up question for a problem he was solving, which had to do with making some joins to a separate table.

If you have questions or want some help on data engineering/data cleaning issues you are dealing with, be sure to reach out by liking and subscribing on the YouTube channel and dropping us a comment. We will try and answer your question in an episode!

To watch the screen cast, CLICK HERE.

To access our code, CLICK HERE.

tidymodels train/test set without initial_split

Introduction

In {tidymodels} it is often recommended to split the data using the initial_split() function. This is useful when you are interested in a random sample from the data. As such, the initial_split() function produces a list of information that is used downstream in the model fitting and model prediction process. However, sometimes we have data that we want to fit specifically to a training set and then test on data set that we define. For example, training a model on years 2010-2015 and then testing a model on years 2016-2019.

This tutorial walks through creating your own bespoke train/test sets, fitting a model, and then making predictions, while circumventing the issues that may arise from not having the initial_split() object.

Load the AirQuality Data

library(tidyverse)
library(tidymodels)
library(datasets)

data("airquality")

airquality %>% count(Month)

 

Train/Test Split

We want to use `tidymodels` to build a model on months 5-7 and test the model on months 8 and 9?

Currently the initial_split() function only takes a random sample of the data.

set.seed(192)
split_rand <- initial_split(airquality ,prop = 3/4)
split_rand

train_rand <- training(split_rand)
test_rand <- testing(split_rand) train_rand %>%
  count(Month)

test_rand %>%
  count(Month)

The strat argument within initial_split() only ensures that we get an even sample across our strat (in this case, Month).

split_strata <- initial_split(airquality ,prop = 3/4, strata = Month)
split_strata

train_strata <- training(split_strata)
test_strata <- testing(split_strata) train_strata %>%
  count(Month)

test_strata %>%
  count(Month)

  • Create our own train/test split, unique to the conditions we are interested in specifying.
train <- airquality %>%
  filter(Month < 8)

test <- airquality %>%
  filter(Month >= 8)
  • Create 5-fold cross validation for tuning our random forest model
set.seed(567)
cv_folds <- vfold_cv(data = train, v = 5)

Set up the model specification

  • We will use random forest
## model specification
aq_rf <- rand_forest(mtry = tune()) %>%
  set_engine("randomForest") %>%
  set_mode("regression")

Create a model recipe

There are some NA’s in a few of the columns. We will impute those and we will also normalize the three numeric predictors in our model.

## recipe
aq_recipe <- recipe( Ozone ~ Solar.R + Wind + Temp + Month, data = train ) %>%
step_impute_median(Ozone, Solar.R) %>%
step_normalize(Solar.R, Wind, Temp)

aq_recipe

## check that normalization and NA imputation occurred in the training data
aq_recipe %>%
prep() %>%
bake(new_data = NULL)

## check that normalization and NA imputation occurred in the testing data
aq_recipe %>%
prep() %>%
bake(new_data = test)

 

Set up workflow

  • Compile all of our components above together into a single workflow.
## Workflow
aq_workflow <- workflow() %>%
add_model(aq_rf) %>%
add_recipe(aq_recipe)

aq_workflow

Tune the random forest model

  • We set up one hyperparmaeter to tune, mtry, in our model specification.
## tuning grid
rf_tune_grid <- grid_regular(
  mtry(range = c(1, 4))
)

rf_tune_grid

rf_tune <- tune_grid(
  aq_workflow,
  resamples = cv_folds,
  grid = rf_tune_grid
)

rf_tune

Get the model with the optimum mtry

## view model metrics
collect_metrics(rf_tune)

## Which is the best model?
select_best(rf_tune, "rmse")

  • Looks like an mtry = 1 was the best option as it had the lowest RMSE and highest r-squared.

Fit the final tuned model

  • model specification with mtry = 1
aq_rf_tuned <- rand_forest(mtry = 1) %>%
  set_engine("randomForest") %>%
  set_mode("regression")

Tuned Workflow

  • the recipe steps are the same
aq_workflow_tuned <- workflow() %>%
  add_model(aq_rf_tuned) %>%
  add_recipe(aq_recipe) 

aq_workflow_tuned

Final Fit

aq_final <- aq_workflow_tuned %>%
  fit(data = train)

Evaluate the final model

aq_final %>%
  extract_fit_parsnip()

Predict on test set

ozone_pred_rf <- predict(
aq_final,
test
)

ozone_pred_rf

Conclusion

Pretty easy to fit a model to bespoke train/test split that doesn’t require {tidymodels} initial_split() function. Simply construct the model, do any hyperparameter tuning, fit a final model, and make predictions.

Below I’ve added the code for anyone interested in seeing this same process using linear regression, which is easier than the random forest model since there are no hyperparameters to tune.

If you’d like to have the full code in one concise place, check out my GITHUB page.

Doing the same tasks with linear regression

  • This is a bit easier since it doesn’t require hyperparameter tuning.

 

## Model specification
aq_linear <- linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

## Model Recipe (same as above)
aq_recipe <- recipe( Ozone ~ Solar.R + Wind + Temp + Month, data = train ) %>%
  step_impute_median(Ozone, Solar.R) %>% 
  step_normalize(Solar.R, Wind, Temp)

## Workflow
aq_wf_linear <- workflow() %>%
  add_recipe(aq_recipe) %>%
  add_model(aq_linear)

## Fit the model to the training data
lm_fit <- aq_wf_linear %>%
  fit(data = train)

## Get the model output
lm_fit %>%
  extract_fit_parsnip()

## Model output with traditional summary() function
lm_fit %>%
  extract_fit_parsnip() %>% 
  .$fit %>%
  summary()

## Model output in tidy format  
lm_fit %>%
  tidy()

## Make predictions on test set
ozone_pred_lm <- predict(
  lm_fit,
  test
  )

ozone_pred_lm

TidyX Episode 124: Combining Multiple Conditions in a Single Column

Some of the best TidyX episodes are born out of questions from viewers or colleagues. This week, Ellis Hughes and I dip into the mailbag and answer a question from a colleague about identifying multiple conditions.

The colleague that had a data set with specific observations in one column. Their goal was to create a new column which put those observations into one of several conditional groups. The issue they were running into was that case_when() and ifelse() only output the first condition that is met. The colleague was looking for a way to append the conditional groups column with a comma separate string, so that some of the observations could match multiple conditional groups.

In this episode, we solve the problem in three different ways.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.