TidyX Episode 174: Forecasting MLB Hall of Fame Pitchers using AI

Continuing to build on our modeling strategy from the previous few weeks, this week Ellis Hughes and I show how to set up a neural network prediction model in R using {tensorflow}.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

Bayes Theorem: p(A|B) = p(A) * p(B|A) / p(B) — But why?

A student recently asked me if I could show them why Bayes Theorem looks the way it does. If we are trying to determine the probability A given B, how did we arrive at the theorem in the title?

Let’s create some data and see if we can sort it out. We will simulate a 2×2 table, similar to one we might find in sports medicine journals when looking at some type of test (positive or negative) and some type of outcome (disease or no disease).

dat_tbl <- data.frame(
  test = c("positive", "negative", "total"),
  disease = c(25, 5, 30),
  no_disease = c(3, 40, 43),
  total = c(28, 45, 73)
)
 
dat_tbl

A logical question we’d like to answer here is, “What is the probability of disease given a positive test.” Written in probability notation we are asking, p(Disease | Positive).

Using the data in the table, we can quickly compute this as 25 people who were positive and had the disease divided by 28 total positive tests. 25/28 = 89.3%

Of course, we could also compute this using Bayes Theorem:

p(Disease | Positive) = p(Disease) * p(Positive | Disease) / p(Positive)

We store the necessary values in R objects and then compute the result

### What we want to know: p(Disease | Positive)
# p(Disease | Positive) = p(Disease) * p(Positive | Disease) / p(Positive)
# p(A | B) = p(A) * p(B|A) / p(B)
 
p_disease <- 30/73
p_positive_given_disease <- 25/30
p_positive <- 28/73
 
p_no_disease <- 43/73
p_positive_given_no_disease <- 3/43
 
p_disease_given_positive <- (p_disease * p_positive_given_disease) / (p_disease * p_positive_given_disease + p_no_disease * p_positive_given_no_disease) 
p_disease_given_positive

This checks out. We get the exact same result as when we did 25/28.

Okay, how did we get here? Why does it work out like that?

The math works out because we start with two different joint probabilities, p(A n B) and p(B n A). Or, in our case, p(Disease n Positive) and p(Positive n Disease). Formally, we can write these as follows:

We’ve already stored the necessary probabilities in specific elements, above. But here it what they both look like using our 2×2. First I’ll calculate it with the counts directly from the table and then calculate it with the R elements that we stored. You’ll see the answers are the same.

## Joint Probability 1: p(Positive n Disease) = p(Positive | Disease) * p(Disease)
 
25/30 * 30/73
 
p_positive_given_disease * p_disease
 
## Joint Probability 2: p(Disease n Positive) = p(Disease | Positive) * p(Positive)
 
25/28 * 28/73
 
p_disease_given_positive * p_positive

Well, would you look at that! The two joint probabilities are equal to each other. We can formally test that they are equal to each other by setting up a logical equation in R.

## These two joint probabilities are equal!
#  p(Positive n Disease) = p(Disease n Positive)
 
(p_positive_given_disease * p_disease) == (p_disease_given_positive * p_positive)

So, if they are equal, what we are saying is this:

p(Disease | Positive) * p(Positive) = p(Positive | Disease) * p(Disease)

Now, with some algebra, we can divide the right side of the equation by p(Positive) and we are left with Bayes Theorem for our problem:

p(Disease | Positive) = p(Disease) * p(Positive | Disease) / p(Positive)

Putting it altogether, it looks like this:

So, all we’ve done is taken two joint probabilities and used some algebra to arrange the terms in order to get us to the conditional probability we were interested in, p(Disease | Positive) and we end up with Bayes Theorem.

TidyX Episode 173: Predicting Hall of Fame MLB Pitchers with Bayesian Logistic Regression

Last week, Ellis Hughes and I did our “Teach me something in R in 20min or less” series on predicting hall of fame MLB pitchers using logistic regression. This week, we complete the same task but instead do it using Bayesian logistic regression with {rstanarm}. We don’t get into priors or anything like that (simply use the default, weakly informative, priors that are specified in the function) since it is a 20min or less episode. However, we do cover:

Specifying the syntax of the model
Plotting the model results
Making out of sample predictions
Plotting posterior distributions for individual players

To watch the screen cast, CLICK HERE.

To access the code, CLICK HERE.

TidyX Episode 172: Predicting Hall of Fame MLB Pitchers – Logistic Regression in R in 20min

In this installment of “What can we learn in R in 20min?”, Ellis Hughes and I cover building a logistic regression model to predict Hall of Fame MLB Pitchers.

To watch the screen cast, CLICK HERE.

To access our code, CLICK HERE.

Calculating Confidence Intervals in R

A favorite paper of mine is the 1986 paper by Gardner and Altman regarding confidence intervals and estimation as a more useful way of reporting data than a dichotomous p-value:

Gardner, MJ. Altman, DG. (1986). Confidence intervals rather than P values: Estimation rather than hypothesis testing. Brit Med J; 292:746-750.

In this paper, Gardner and Altman discuss three main points for either moving away from or supplementing statistical reporting with p-values:

Research often focuses on null hypothesis significance testing with the goal being to identify statistically significant results.
However, we are often interested in the magnitude of the factor of interest.
Given that research deals with samples of a broader population, the readers are not only interested in the observed magnitude of the estimand but also the degree of variability and plausible range of values for the population. This variability can be quantified using confidence intervals.

Aside from the paper providing a clear explanation of the issue at hand, their appendix offers the equations to calculate confidence intervals for means, differences in means, proportions, and differences in proportions. Thus, I decided to compile the appendix in an R script for those looking to code confidence intervals (and not have to rely on pre-built functions).

All of this code is available on my GITHUB page.

Confidence Intervals for Means and Differences

Single Sample

Obtain the mean
Calculate the standard error (SE) of the mean as SE = SD/sqrt(N)
Multiply by a t-critical value specific to the level of confidence of interest and the degrees of freedom (DF) for the single sample, DF = N – 1

The confidence intervals are calculated as:

Low = mean – t_{crit} * SE
High = mean + t_{crit} * SE

Example

We collect data on 30 participants on a special test of strength and observe a mean of 40 and standard deviation of 10. We want to calculate the 90% confidence interval.

The steps above can easily be computed in R. First, we write down the known info (the data that is provided to us). We then calculate the standard error and degrees of freedom (N – 1). To obtain the critical value for our desired level of interest, we use the t-distribution is specific to the degrees of freedom of our data.

## Known info
N <- 30
avg <- 40
SD <- 10
 
## Calculate DF and SE
SE <- SD / sqrt(N)
DF <- N - 1
 
## Calculate the confidence level of interest (90%), the amount of data in each of the the two tails ((1 - level of interest) / 2) and the t-critical value
level_of_interest <- 0.90
tail_value <- (1 - level_of_interest)/2
 
t_crit <- abs(qt(tail_value, df = DF))
t_crit
 
## Calculate the 90% CI
low90 <- round(avg - t_crit * SE, 1)
high90 <- round(avg + t_crit * SE, 1)
 
cat("The 90% Confidence Interval is:", low90, " to ", high90)

Two Samples

Obtain the sample mean and standard deviations for the two samples
Pool the estimate of the standard deviation:s = sqrt(((n_1 – 1)*s^2_1 + n_2 – 1)*s^2_2) / (n_1 + n_2 – 2))
Calculate the SE for the difference:SE_{diff} = s * sqrt(1/n_1 + 1/n_2)
Calculate the confidence interval as:

Low = (x_1 – x_2) – t_{crit} * SE_{diff}
High = (x_1 – x_2) + t_{crit} * SE_{diff}

Example

The example in the paper provides the following info:

Blood pressure levels were measured in 100 diabetic and 100 non-diabetic men aged 40-49 years old.
Mean systolic blood pressure was 146.4 mmHg (SD = 18.5) in diabetics and 140.4 mmHg (SD = 16.8) in non-diabetics.

Calculate the 95% CI.

## store the known data
N_diabetic <- 100
N_non_diabetic <- 100
diabetic_avg <- 146.4
diabetic_sd <-18.5
non_diabetic_avg <- 140.4
non_diabetic_sd <- 16.8
 
## calculate the difference in means, the pooled SD, and the SE of diff
group_diff <- diabetic_avg - non_diabetic_avg
 
pooled_sd <- sqrt(((N_diabetic - 1)*diabetic_sd^2 + (N_non_diabetic - 1)*non_diabetic_sd^2) / (N_diabetic + N_non_diabetic - 2))
 
se_diff <- pooled_sd * sqrt(1/N_diabetic + 1/N_non_diabetic)
 
## Calculate the confidence level of interest (95%), the amount of data in each of the the two tails ((1 - level of interest) / 2) and the t-critical value
level_of_interest <- 0.95
tail_value <- (1 - level_of_interest)/2
 
t_crit <- abs(qt(tail_value, df = N_diabetic + N_non_diabetic - 2))
t_crit
 
## Calculate the 95% CI
low95 <- round(group_diff - t_crit * se_diff, 1)
high95 <- round(group_diff + t_crit * se_diff, 1)
 
cat("The 95% Confidence Interval is:", low95, " to ", high95)

Confidence Intervals for Proportions

Single Sample

Obtain the proportion for the population
Calculate the SE of the proportion, SE = sqrt((p * (1-p)) / N)
Obtain the z-critical value from a standard normal distribution for the level of confidence of interest (since the value for a proportion does not depend on sample size as it does for means).
Calculate the confidence interval:

low = p – z_{crit} * SE
high = p + z_{crit} * SE

Example

We observe a basketball player with 80 field goal attempts and a FG% of 39%. Calculate the 90% CI.

## Store the known info
N <- 80
fg_pct <- 0.39
 
## Calculate SE
se <- sqrt((fg_pct * (1 - fg_pct)) / N)
 
## Calculate z-critical value for 50% confidence
level_of_interest <- 0.95
tail_value <- (1 - level_of_interest) / 2
z_crit <- qnorm(p = tail_value, lower.tail = FALSE)
 
## Calculate the 95% CI
low95 <- round(fg_pct - z_crit * se, 3)
high95 <- round(fg_pct + z_crit * se, 3)
 
cat("The 95% Confidence Interval is:", low95, " to ", high95)

Two Samples

Calculate the difference in proportions between the two groups
Calculate the SE of the difference in proportions:SE_{diff} = sqrt(((p_1 * (1-p_1)) / n_1) + ((p_2 * (1 – p_2)) / n_2))
Calculate the z-critical value for the level of interest
Calculate the confidence interval as:

low = (p_1 – p_2) – (z_{crit} * se_{diff})
high = (p_1 – p_2) + (z_{crit} * se_{diff})

Example of two unpaired samples

The study provides the following table of example data:

data.frame(
  response = c("improvement", "no improvement", "total"),
  treatment_A = c(61, 19, 80),
  treatment_B = c(45, 35, 80)
)

The difference we are interested in is between the proportion who improved in treatment A and the proportion of those who improved in treatment B.

## Obtain the two proportions from the table and total sample sizes
pr_A <- 61/80
n_A <- 80
pr_B <- 45/80
n_B <- 80
 
## calculate the difference in proportions
diff_pr <- pr_A - pr_B
 
## Calculate the SE
se_diff <- sqrt((pr_A * (1 - pr_A))/n_A + (pr_B * (1 - pr_B))/n_B)
 
## Get z-critical value for 95% confidence
level_of_interest <- 0.95
tail_value <- (1 - level_of_interest) / 2
z_crit <- qnorm(p = tail_value, lower.tail = FALSE)
 
## Calculate the 95% CI
low95 <- round(diff_pr - z_crit * se_diff, 3)
high95 <- round(diff_pr + z_crit * se_diff, 3)
 
cat("The 95% Confidence Interval is:", low95, " to ", high95)

Example for two paired samples

We can organize the data in a table like this:

data.frame(
  test_1 = c("Present", "Present", "Absent", "Absent"),
  test_2 = c("Present", "Absent", "Present", "Absent"),
  number_of_subjects = c("a", "b", "c", "d")
)

Let’s say we measured a group of subjects for a specific disease twice in a study. A subject either has the disease (present) or does not (absent) in the two time points. We observe the following data:

dat <- data.frame(
  test_1 = c("Present", "Present", "Absent", "Absent"),
  test_2 = c("Present", "Absent", "Present", "Absent"),
  number_of_subjects = c(10, 25, 45, 5)
)
 
dat
 
## total sample size
N <- sum(dat$number_of_subjects)

If we care about comparing those that had the disease (Present) on both occasions (both Test1 and Test2) we calculate them as:

p_1 = (a + b) / N
p_2 = (a + c) / N
Diff = p_1 – p_2

The SE of the difference is:

SE_{diff} = 1/N * sqrt(b + c – (b-c)^2/N)

## Obtain the info we need from the table
p1 <- (10 + 25) / N
p2 <- (10 + 45) / N
diff_prop <- p1 - p2
diff_prop
 
## Calculate the SE of the difference
se_diff <- 1 / N * sqrt(25 + 45 - (25+45)^2/N)
 
## Get z-critical value for 95% confidence
level_of_interest <- 0.95
tail_value <- (1 - level_of_interest) / 2
z_crit <- qnorm(p = tail_value, lower.tail = FALSE)
 
## Calculate the 95% CI
low95 <- round(diff_prop - z_crit * se_diff, 3)
high95 <- round(diff_prop + z_crit * se_diff, 3)
 
cat("The 95% Confidence Interval is:", low95, " to ", high95)

As always, all of this code is available on my GITHUB page.

And, if you’d like to read the full paper, you can find it here:

Gardner, MJ. Altman, DG. (1986). Confidence intervals rather than P values: Estimation rather than hypothesis testing. Brit Med J; 292:746-750.