suppressPackageStartupMessages(library(tidyverse))
library(Lahman)
# Change the data set from 'Batting' to df, to shorten the name for coding purposes
df <- Batting
head(df)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB
## 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 0 0 0
## 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 8 1 4
## 3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 3 1 2
## 4 allisdo01 1871 1 WS3 NA 27 133 28 44 10 2 2 27 1 1 0
## 5 ansonca01 1871 1 RC1 NA 25 120 29 39 11 3 0 16 6 2 2
## 6 armstbo01 1871 1 FW1 NA 12 49 9 11 2 1 0 5 0 1 0
## SO IBB HBP SH SF GIDP
## 1 0 NA NA NA NA NA
## 2 0 NA NA NA NA NA
## 3 5 NA NA NA NA NA
## 4 2 NA NA NA NA NA
## 5 1 NA NA NA NA NA
## 6 1 NA NA NA NA NA
# check the number of NAs in the H and RBI column
sum(is.na(df$H))
## [1] 0
sum(is.na(df$RBI))
## [1] 424
Looks like there are no NA’s in the H column but there are 424 in the RBI column. See what seasons these missing values are located in.
df %>%
filter(is.na(RBI)) %>%
count(yearID)
## # A tibble: 2 x 2
## yearID n
## <int> <int>
## 1 1882 92
## 2 1884 332
Looks like the missing values are only located in years 1882 and 1884.
For the purposes of this analysis we will look at the mopre modern years and constrain ourselves to seasons 2010-2016
df <- df %>% filter(yearID > 2009)
df %>% dim()
## [1] 9966 22
# We are dealing with 9966 rows and 22 columns of data
It’s possible that some players may only have a few at bats (AB). We should evaluate this to see if we need to have an inclusion criteria.
boxplot(df$AB, horizontal = T,
xlab = "At Bats",
main = "Distribution of At Bats from 2010-2016\nRed Line = Avg AB",
adj = 0,
col = "light grey")
abline(v = mean(df$AB), col = "red", lwd = 2)
quantile(df$AB)
## 0% 25% 50% 75% 100%
## 0 0 17 171 684
We see that the data is vert right skewed, with a large number of players with a small number of ABs and then a bunch of players with a lot of ABs. This is why the median (thick black line insde of the box, representing the IQR) is so low relative to the mean (thick red line).
Let’s just concentracte on players with greater than or equal to 171 ABs (the 75th percentile). Obviously this is going to change how we interpret our outcome given that players with lots of ABs will have more opportunities for hits and potentially more opportunities to generate runs.
df <- df %>% filter(AB >= 171)
df %>% dim()
## [1] 2495 22
# We now have 2495 rows and 22 columns to work with
## All seasons grouped together
# Hits
df %>%
ggplot(aes(x = H)) +
geom_density(fill = "green", alpha = 0.6) +
theme_bw() +
geom_vline(aes(xintercept = mean(H)), color = "red", size = 1.2) +
xlim(0, 300) +
ggtitle("Season Hit Totals for Players with >= 171 AB (Seasons 2010-2016)",
subtitle = "Red Line = Average Hits")
# RBI
df %>%
ggplot(aes(x = RBI)) +
geom_density(fill = "blue", alpha = 0.6) +
theme_bw() +
geom_vline(aes(xintercept = mean(RBI)), color = "red", size = 1.2) +
xlim(0, 140) +
ggtitle("Season RBI Totals for Players with >= 171 AB (Seasons 2010-2016)",
subtitle = "Red Line = Average Hits")
ggplot(df, aes(x = H, y = RBI)) +
geom_jitter(color = "grey", alpha = 0.8) +
geom_smooth(method = "lm", fill = "red") +
ggtitle("Relationship between H and RBI",
subtitle = "Seasons 2010-2016") +
theme_light()
Okay, so there is a relationship between H and RBI Is this relationship at all influened by Season (perhaps changings in the scoring environment could influence this relationship in some way from one season to the next?)
# Hits by season
df %>%
ggplot(aes(x = H)) +
geom_density(fill = "green", alpha = 0.6) +
theme_bw() +
geom_vline(aes(xintercept = mean(H)), color = "red", size = 1.2) +
xlim(0, 300) +
ggtitle("Season Hit Totals for Players with >= 171 AB (Seasons 2010-2016)",
subtitle = "Red Line = Average Hits") +
facet_wrap(~yearID)
# RBI by season
df %>%
ggplot(aes(x = RBI)) +
geom_density(fill = "blue", alpha = 0.6) +
theme_bw() +
geom_vline(aes(xintercept = mean(RBI)), color = "red", size = 1.2) +
xlim(0, 140) +
ggtitle("Season RBI Totals for Players with >= 171 AB (Seasons 2010-2016)",
subtitle = "Red Line = Average Hits") +
facet_wrap(~yearID)
# Relationship between H and RBI by season
ggplot(df, aes(x = H, y = RBI)) +
geom_jitter(color = "grey", alpha = 0.8) +
geom_smooth(method = "lm", fill = "red") +
ggtitle("Relationship between H and RBI",
subtitle = "Seasons 2010-2016") +
theme_light() +
facet_wrap(~yearID)
Everything seems relatively consistent from year to year
df %>%
summarize_at(c("H", "RBI"), .funs = funs(mean, sd))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
## H_mean RBI_mean H_sd RBI_sd
## 1 102.9351 48.9006 43.31572 25.69956
Mean & SD
H = 103 ± 43
RBI = 49 ± 26
with(df, cor.test(H, RBI))
##
## Pearson's product-moment correlation
##
## data: H and RBI
## t = 70.246, df = 2493, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8014749 0.8278444
## sample estimates:
## cor
## 0.8150814
The correlation between H and AB is 0.82 [95% CI: 0.80, 0.83]
fit_lm <- lm(RBI ~ H, data = df)
summary(fit_lm)
##
## Call:
## lm(formula = RBI ~ H, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.103 -8.830 -0.490 8.892 58.118
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.878201 0.768794 -1.142 0.253
## H 0.483594 0.006884 70.246 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.89 on 2493 degrees of freedom
## Multiple R-squared: 0.6644, Adjusted R-squared: 0.6642
## F-statistic: 4935 on 1 and 2493 DF, p-value: < 2.2e-16
The model explains about 66% of the variability observed in RBI.
The coefficient for H suggests that for a one unit increase in H, on average we would see an approximate 0.48 increase in RBIs. Thus, 10 extra AB could lead to about 5 extra RBI (10 * 0.48).
par(mfrow = c(2,2))
plot(fit_lm)
A little fanning out in the pattern of the residuals againsts predicted values plot (upper left) indicating that the variance is not consistent across the range values.
The upper right plot suggests the residuals are relatively normally distributed
Other diagnostic criteria can be checked. For example, hat values (aka, leverage – bottom right) and Cook’s distance, which can inform us about how an individual case’s may be influencing the model’s ability to make predictions across all cases in the data.
hist(resid(fit_lm), col = "grey", main = "Model Residuals", xlab = "Residuals")
df %>%
mutate(Pred = fitted(fit_lm)) %>%
ggplot(aes(x = Pred, y = RBI)) +
geom_jitter(aes(color = H), alpha = 0.8) +
geom_abline(intercept = 0, slope = 1, color = "red", size = 2) +
theme_light() +
labs(title = "Predicted vs Actual RBI") +
expand_limits(y = 0, x = 0)
Here we look at the residuals in a histogram to see if they are normally distributed.
We also produced a plot of the predicted RBI vs the actual RBI. The plot contains a line of equality (if the model prediction was 100% correct, every single dot would fall directly on that line). I’ve colored the dots by the number of H a batter had so that we can better see where the model makes some mistakes. The lighter colored dots indicate more H and we can see that to the far right of the plot, when thre are more H, the errors between predicted and actual RBI begin to increase.
rmse <- sqrt(mean(fit_lm$resid^2))
rmse
## [1] 14.88596
The research question was to examine the relationship between H and RBI in major league baseball players. It appears that there is a relationship between these two variables with more hits yielding a higher number of Runs. However, with an increase in hits the model predictions become more noisy.
Better predictions could be made with more data that can provide context around hits (situation, pitching, line up order, etc) and should be collected for future analysis.