Author Archives: Patrick

TidyX Episode 176: Extracting all predictions from a Random Forest, Building Uncertainty Intervals, and Creating a Player Compare Tool

In this installment of predicting whether MLB pitchers will make the Hall of Fame, Ellis Hughes and I take the Random Forest model we built in Episode 175 and discuss how you can extract the predictions from each of the individual trees and then build uncertainty intervals so that you can construct an entire distribution of predictions. We then take these predictions and build a custom function for comparing the predicted distribution of two players and whether they will make it to the Hall of Fame.

To watch the screen cast, CLICK HERE.

To access our code, CLICK HERE.

TidyX Episode 175: Predicting Hall of Fame Pitchers using Random Forests

Ellis Hughes and I continue to work with the MLB pitcher data, courtesy of the {Lahman} baseball package.

This week we walk through using a random forest model to calculate the probability a pitcher will make it to the Hall of Fame given several different performance stats.

In this episode we cover:

  • Splitting data into training and testing sets
  • Splitting training sets into cross validated folds
  • Using {tidyverse} and {purrr} to construct a tuning grid and tune the random forest models to identify the optimal mtry and ntrees for the prediction task
  • Fitting a final model with the optimized parameters and exploring predictions

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

Bayes Theorem: p(A|B) = p(A) * p(B|A) / p(B) — But why?

A student recently asked me if I could show them why Bayes Theorem looks the way it does. If we are trying to determine the probability A given B, how did we arrive at the theorem in the title?

Let’s create some data and see if we can sort it out. We will simulate a 2×2 table, similar to one we might find in sports medicine journals when looking at some type of test (positive or negative) and some type of outcome (disease or no disease).

dat_tbl <- data.frame(
  test = c("positive", "negative", "total"),
  disease = c(25, 5, 30),
  no_disease = c(3, 40, 43),
  total = c(28, 45, 73)
)

dat_tbl

A logical question we’d like to answer here is, “What is the probability of disease given a positive test.” Written in probability notation we are asking, p(Disease | Positive).

Using the data in the table, we can quickly compute this as 25 people who were positive and had the disease divided by 28 total positive tests. 25/28 = 89.3%

Of course, we could also compute this using Bayes Theorem:

p(Disease | Positive) = p(Disease) * p(Positive | Disease) / p(Positive)

We store the necessary values in R objects and then compute the result

### What we want to know: p(Disease | Positive)
# p(Disease | Positive) = p(Disease) * p(Positive | Disease) / p(Positive)
# p(A | B) = p(A) * p(B|A) / p(B)

p_disease <- 30/73
p_positive_given_disease <- 25/30
p_positive <- 28/73

p_no_disease <- 43/73
p_positive_given_no_disease <- 3/43

p_disease_given_positive <- (p_disease * p_positive_given_disease) / (p_disease * p_positive_given_disease + p_no_disease * p_positive_given_no_disease) 
p_disease_given_positive

This checks out. We get the exact same result as when we did 25/28.

Okay, how did we get here? Why does it work out like that?

The math works out because we start with two different joint probabilities, p(A n B) and p(B n A). Or, in our case, p(Disease n Positive) and p(Positive n Disease). Formally, we can write these as follows:

We’ve already stored the necessary probabilities in specific elements, above. But here it what they both look like using our 2×2. First I’ll calculate it with the counts directly from the table and then calculate it with the R elements that we stored. You’ll see the answers are the same.

## Joint Probability 1: p(Positive n Disease) = p(Positive | Disease) * p(Disease)

25/30 * 30/73

p_positive_given_disease * p_disease

## Joint Probability 2: p(Disease n Positive) = p(Disease | Positive) * p(Positive)

25/28 * 28/73

p_disease_given_positive * p_positive

Well, would you look at that! The two joint probabilities are equal to each other. We can formally test that they are equal to each other by setting up a logical equation in R.

## These two joint probabilities are equal!
#  p(Positive n Disease) = p(Disease n Positive)

(p_positive_given_disease * p_disease) == (p_disease_given_positive * p_positive)

So, if they are equal, what we are saying is this:

p(Disease | Positive) * p(Positive) = p(Positive | Disease) * p(Disease)

Now, with some algebra, we can divide the right side of the equation by p(Positive) and we are left with Bayes Theorem for our problem:

p(Disease | Positive) = p(Disease) * p(Positive | Disease) / p(Positive)

Putting it altogether, it looks like this:

So, all we’ve done is taken two joint probabilities and used some algebra to arrange the terms in order to get us to the conditional probability we were interested in, p(Disease | Positive) and we end up with Bayes Theorem.

TidyX Episode 173: Predicting Hall of Fame MLB Pitchers with Bayesian Logistic Regression

Last week, Ellis Hughes and I did our “Teach me something in R in 20min or less” series on predicting hall of fame MLB pitchers using logistic regression. This week, we complete the same task but instead do it using Bayesian logistic regression with {rstanarm}. We don’t get into priors or anything like that (simply use the default, weakly informative, priors that are specified in the function) since it is a 20min or less episode. However, we do cover:

  • Specifying the syntax of the model
  • Plotting the model results
  • Making out of sample predictions
  • Plotting posterior distributions for individual players

To watch the screen cast, CLICK HERE.

To access the code, CLICK HERE.