# TidyX Episode 175: Predicting Hall of Fame Pitchers using Random Forests

Ellis Hughes and I continue to work with the MLB pitcher data, courtesy of the {Lahman} baseball package.

This week we walk through using a random forest model to calculate the probability a pitcher will make it to the Hall of Fame given several different performance stats.

In this episode we cover:

• Splitting data into training and testing sets
• Splitting training sets into cross validated folds
• Using {tidyverse} and {purrr} to construct a tuning grid and tune the random forest models to identify the optimal mtry and ntrees for the prediction task
• Fitting a final model with the optimized parameters and exploring predictions

# TidyX Episode 174: Forecasting MLB Hall of Fame Pitchers using AI

Continuing to build on our modeling strategy from the previous few weeks, this week Ellis Hughes and I show how to set up a neural network prediction model in R using {tensorflow}.

# Bayes Theorem: p(A|B) = p(A) * p(B|A) / p(B) — But why?

A student recently asked me if I could show them why Bayes Theorem looks the way it does. If we are trying to determine the probability A given B, how did we arrive at the theorem in the title?

Let’s create some data and see if we can sort it out. We will simulate a 2×2 table, similar to one we might find in sports medicine journals when looking at some type of test (positive or negative) and some type of outcome (disease or no disease).

```dat_tbl <- data.frame(
test = c("positive", "negative", "total"),
disease = c(25, 5, 30),
no_disease = c(3, 40, 43),
total = c(28, 45, 73)
)

dat_tbl
```

A logical question we’d like to answer here is, “What is the probability of disease given a positive test.” Written in probability notation we are asking, p(Disease | Positive).

Using the data in the table, we can quickly compute this as 25 people who were positive and had the disease divided by 28 total positive tests. 25/28 = 89.3%

Of course, we could also compute this using Bayes Theorem:

p(Disease | Positive) = p(Disease) * p(Positive | Disease) / p(Positive)

We store the necessary values in R objects and then compute the result

```### What we want to know: p(Disease | Positive)
# p(Disease | Positive) = p(Disease) * p(Positive | Disease) / p(Positive)
# p(A | B) = p(A) * p(B|A) / p(B)

p_disease <- 30/73
p_positive_given_disease <- 25/30
p_positive <- 28/73

p_no_disease <- 43/73
p_positive_given_no_disease <- 3/43

p_disease_given_positive <- (p_disease * p_positive_given_disease) / (p_disease * p_positive_given_disease + p_no_disease * p_positive_given_no_disease)
p_disease_given_positive
```

This checks out. We get the exact same result as when we did 25/28.

Okay, how did we get here? Why does it work out like that?

The math works out because we start with two different joint probabilities, p(A n B) and p(B n A). Or, in our case, p(Disease n Positive) and p(Positive n Disease). Formally, we can write these as follows:

We’ve already stored the necessary probabilities in specific elements, above. But here it what they both look like using our 2×2. First I’ll calculate it with the counts directly from the table and then calculate it with the R elements that we stored. You’ll see the answers are the same.

```## Joint Probability 1: p(Positive n Disease) = p(Positive | Disease) * p(Disease)

25/30 * 30/73

p_positive_given_disease * p_disease

## Joint Probability 2: p(Disease n Positive) = p(Disease | Positive) * p(Positive)

25/28 * 28/73

p_disease_given_positive * p_positive
```

Well, would you look at that! The two joint probabilities are equal to each other. We can formally test that they are equal to each other by setting up a logical equation in R.

```## These two joint probabilities are equal!
#  p(Positive n Disease) = p(Disease n Positive)

(p_positive_given_disease * p_disease) == (p_disease_given_positive * p_positive)
```

So, if they are equal, what we are saying is this:

p(Disease | Positive) * p(Positive) = p(Positive | Disease) * p(Disease)

Now, with some algebra, we can divide the right side of the equation by p(Positive) and we are left with Bayes Theorem for our problem:

p(Disease | Positive) = p(Disease) * p(Positive | Disease) / p(Positive)

Putting it altogether, it looks like this:

So, all we’ve done is taken two joint probabilities and used some algebra to arrange the terms in order to get us to the conditional probability we were interested in, p(Disease | Positive) and we end up with Bayes Theorem.

# TidyX Episode 173: Predicting Hall of Fame MLB Pitchers with Bayesian Logistic Regression

Last week, Ellis Hughes and I did our “Teach me something in R in 20min or less” series on predicting hall of fame MLB pitchers using logistic regression. This week, we complete the same task but instead do it using Bayesian logistic regression with {rstanarm}. We don’t get into priors or anything like that (simply use the default, weakly informative, priors that are specified in the function) since it is a 20min or less episode. However, we do cover:

• Specifying the syntax of the model
• Plotting the model results
• Making out of sample predictions
• Plotting posterior distributions for individual players