In this installment of “What can we learn in R in 20min?”, Ellis Hughes and I cover building a logistic regression model to predict Hall of Fame MLB Pitchers.
To watch the screen cast, CLICK HERE.
To access our code, CLICK HERE.
In this installment of “What can we learn in R in 20min?”, Ellis Hughes and I cover building a logistic regression model to predict Hall of Fame MLB Pitchers.
To watch the screen cast, CLICK HERE.
To access our code, CLICK HERE.
A favorite paper of mine is the 1986 paper by Gardner and Altman regarding confidence intervals and estimation as a more useful way of reporting data than a dichotomous p-value:
In this paper, Gardner and Altman discuss three main points for either moving away from or supplementing statistical reporting with p-values:
Aside from the paper providing a clear explanation of the issue at hand, their appendix offers the equations to calculate confidence intervals for means, differences in means, proportions, and differences in proportions. Thus, I decided to compile the appendix in an R script for those looking to code confidence intervals (and not have to rely on pre-built functions).
All of this code is available on my GITHUB page.
Confidence Intervals for Means and Differences
Single Sample
The confidence intervals are calculated as:
Low = mean – t_{crit} * SE
High = mean + t_{crit} * SE
Example
We collect data on 30 participants on a special test of strength and observe a mean of 40 and standard deviation of 10. We want to calculate the 90% confidence interval.
The steps above can easily be computed in R. First, we write down the known info (the data that is provided to us). We then calculate the standard error and degrees of freedom (N – 1). To obtain the critical value for our desired level of interest, we use the t-distribution is specific to the degrees of freedom of our data.
## Known info
N <- 30
avg <- 40
SD <- 10
## Calculate DF and SE
SE <- SD / sqrt(N)
DF <- N - 1
## Calculate the confidence level of interest (90%), the amount of data in each of the the two tails ((1 - level of interest) / 2) and the t-critical value
level_of_interest <- 0.90
tail_value <- (1 - level_of_interest)/2
t_crit <- abs(qt(tail_value, df = DF))
t_crit
## Calculate the 90% CI
low90 <- round(avg - t_crit * SE, 1)
high90 <- round(avg + t_crit * SE, 1)
cat("The 90% Confidence Interval is:", low90, " to ", high90)
Two Samples
Low = (x_1 – x_2) – t_{crit} * SE_{diff}
High = (x_1 – x_2) + t_{crit} * SE_{diff}
Example
The example in the paper provides the following info:
Calculate the 95% CI.
## store the known data
N_diabetic <- 100
N_non_diabetic <- 100
diabetic_avg <- 146.4
diabetic_sd <-18.5
non_diabetic_avg <- 140.4
non_diabetic_sd <- 16.8
## calculate the difference in means, the pooled SD, and the SE of diff
group_diff <- diabetic_avg - non_diabetic_avg
pooled_sd <- sqrt(((N_diabetic - 1)*diabetic_sd^2 + (N_non_diabetic - 1)*non_diabetic_sd^2) / (N_diabetic + N_non_diabetic - 2))
se_diff <- pooled_sd * sqrt(1/N_diabetic + 1/N_non_diabetic)
## Calculate the confidence level of interest (95%), the amount of data in each of the the two tails ((1 - level of interest) / 2) and the t-critical value
level_of_interest <- 0.95
tail_value <- (1 - level_of_interest)/2
t_crit <- abs(qt(tail_value, df = N_diabetic + N_non_diabetic - 2))
t_crit
## Calculate the 95% CI
low95 <- round(group_diff - t_crit * se_diff, 1)
high95 <- round(group_diff + t_crit * se_diff, 1)
cat("The 95% Confidence Interval is:", low95, " to ", high95)
Confidence Intervals for Proportions
Single Sample
low = p – z_{crit} * SE
high = p + z_{crit} * SE
Example
We observe a basketball player with 80 field goal attempts and a FG% of 39%. Calculate the 90% CI.
## Store the known info
N <- 80
fg_pct <- 0.39
## Calculate SE
se <- sqrt((fg_pct * (1 - fg_pct)) / N)
## Calculate z-critical value for 50% confidence
level_of_interest <- 0.95
tail_value <- (1 - level_of_interest) / 2
z_crit <- qnorm(p = tail_value, lower.tail = FALSE)
## Calculate the 95% CI
low95 <- round(fg_pct - z_crit * se, 3)
high95 <- round(fg_pct + z_crit * se, 3)
cat("The 95% Confidence Interval is:", low95, " to ", high95)
Two Samples
low = (p_1 – p_2) – (z_{crit} * se_{diff})
high = (p_1 – p_2) + (z_{crit} * se_{diff})
Example of two unpaired samples
The study provides the following table of example data:
data.frame(
response = c("improvement", "no improvement", "total"),
treatment_A = c(61, 19, 80),
treatment_B = c(45, 35, 80)
)
The difference we are interested in is between the proportion who improved in treatment A and the proportion of those who improved in treatment B.
## Obtain the two proportions from the table and total sample sizes
pr_A <- 61/80
n_A <- 80
pr_B <- 45/80
n_B <- 80
## calculate the difference in proportions
diff_pr <- pr_A - pr_B
## Calculate the SE
se_diff <- sqrt((pr_A * (1 - pr_A))/n_A + (pr_B * (1 - pr_B))/n_B)
## Get z-critical value for 95% confidence
level_of_interest <- 0.95
tail_value <- (1 - level_of_interest) / 2
z_crit <- qnorm(p = tail_value, lower.tail = FALSE)
## Calculate the 95% CI
low95 <- round(diff_pr - z_crit * se_diff, 3)
high95 <- round(diff_pr + z_crit * se_diff, 3)
cat("The 95% Confidence Interval is:", low95, " to ", high95)
Example for two paired samples
We can organize the data in a table like this:
data.frame(
test_1 = c("Present", "Present", "Absent", "Absent"),
test_2 = c("Present", "Absent", "Present", "Absent"),
number_of_subjects = c("a", "b", "c", "d")
)
Let’s say we measured a group of subjects for a specific disease twice in a study. A subject either has the disease (present) or does not (absent) in the two time points. We observe the following data:
dat <- data.frame(
test_1 = c("Present", "Present", "Absent", "Absent"),
test_2 = c("Present", "Absent", "Present", "Absent"),
number_of_subjects = c(10, 25, 45, 5)
)
dat
## total sample size
N <- sum(dat$number_of_subjects)
If we care about comparing those that had the disease (Present) on both occasions (both Test1 and Test2) we calculate them as:
p_1 = (a + b) / N
p_2 = (a + c) / N
Diff = p_1 – p_2
The SE of the difference is:
SE_{diff} = 1/N * sqrt(b + c – (b-c)^2/N)
## Obtain the info we need from the table
p1 <- (10 + 25) / N
p2 <- (10 + 45) / N
diff_prop <- p1 - p2
diff_prop
## Calculate the SE of the difference
se_diff <- 1 / N * sqrt(25 + 45 - (25+45)^2/N)
## Get z-critical value for 95% confidence
level_of_interest <- 0.95
tail_value <- (1 - level_of_interest) / 2
z_crit <- qnorm(p = tail_value, lower.tail = FALSE)
## Calculate the 95% CI
low95 <- round(diff_prop - z_crit * se_diff, 3)
high95 <- round(diff_prop + z_crit * se_diff, 3)
cat("The 95% Confidence Interval is:", low95, " to ", high95)
As always, all of this code is available on my GITHUB page.
And, if you’d like to read the full paper, you can find it here:
The “learn X in 20min” series seems to be a popular amongst the viewers, so Ellis Hughes and I decided to throw together one. This week, we are walking through the basic code for a Bayesian regression model using the {rstanarm} package. Because we only keep it to 20min, we use the default priors and don’t get into any sort of complex stuff about Bayes. We do, however, discuss making predictions of point estimates and posterior distributions, and show how you can explore model uncertainty.
To watch the screen cast, CLICK HERE.
To access the code, CLICK HERE.
In our final episode of Base R plotting, Ellis Hughes and I use the {Lahman} baseball package to explore Hall of Fame Baseball players. In this episode we show how to make your Base R plots more interactive, allowing the user to click on the points within the plot and obtain information specific to the player.
To watch our screen cast, CLICK HERE.
To access our code, CLICK HERE.
Ellis Hughes and I continue our work building plots in Base R. This week, we turn our attention to various plotting approaches for visualizing the distribution of your data. We cover box plots, bar plots, histograms, and density plots.
To access our code, CLICK HERE.
To watch our screen cast, CLICK HERE.