suppressPackageStartupMessages(library(tidyverse))
library(Lahman)
# Change the data set from 'Batting' to df, to shorten the name for coding purposes
df <- Batting
head(df)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB
## 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 0 0 0
## 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 8 1 4
## 3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 3 1 2
## 4 allisdo01 1871 1 WS3 NA 27 133 28 44 10 2 2 27 1 1 0
## 5 ansonca01 1871 1 RC1 NA 25 120 29 39 11 3 0 16 6 2 2
## 6 armstbo01 1871 1 FW1 NA 12 49 9 11 2 1 0 5 0 1 0
## SO IBB HBP SH SF GIDP
## 1 0 NA NA NA NA NA
## 2 0 NA NA NA NA NA
## 3 5 NA NA NA NA NA
## 4 2 NA NA NA NA NA
## 5 1 NA NA NA NA NA
## 6 1 NA NA NA NA NA
# check the number of NAs in the H and RBI column
sum(is.na(df$H))
## [1] 0
sum(is.na(df$RBI))
## [1] 424
Looks like there are no NA’s in the H column but there are 424 in the RBI column. See what seasons these missing values are located in.
df %>%
filter(is.na(RBI)) %>%
count(yearID)
## # A tibble: 2 x 2
## yearID n
## <int> <int>
## 1 1882 92
## 2 1884 332
Looks like the missing values are only located in years 1882 and 1884.
For the purposes of this analysis we will look at the mopre modern years and constrain ourselves to seasons 2010-2016
df <- df %>% filter(yearID > 2009)
df %>% dim()
## [1] 9966 22
# We are dealing with 9966 rows and 22 columns of data
It’s possible that some players may only have a few at bats (AB). We should evaluate this to see if we need to have an inclusion criteria.
boxplot(df$AB, horizontal = T,
xlab = "At Bats",
main = "Distribution of At Bats from 2010-2016\nRed Line = Avg AB",
adj = 0,
col = "light grey")
abline(v = mean(df$AB), col = "red", lwd = 2)
quantile(df$AB)
## 0% 25% 50% 75% 100%
## 0 0 17 171 684
We see that the data is vert right skewed, with a large number of players with a small number of ABs and then a bunch of players with a lot of ABs. This is why the median (thick black line insde of the box, representing the IQR) is so low relative to the mean (thick red line).
Let’s just concentracte on players with greater than or equal to 171 ABs (the 75th percentile). Obviously this is going to change how we interpret our outcome given that players with lots of ABs will have more opportunities for hits and potentially more opportunities to generate runs.
df <- df %>% filter(AB >= 171)
df %>% dim()
## [1] 2495 22
# We now have 2495 rows and 22 columns to work with
## All seasons grouped together
# Hits
df %>%
ggplot(aes(x = H)) +
geom_density(fill = "green", alpha = 0.6) +
theme_bw() +
geom_vline(aes(xintercept = mean(H)), color = "red", size = 1.2) +
xlim(0, 300) +
ggtitle("Season Hit Totals for Players with >= 171 AB (Seasons 2010-2016)",
subtitle = "Red Line = Average Hits")
# RBI
df %>%
ggplot(aes(x = RBI)) +
geom_density(fill = "blue", alpha = 0.6) +
theme_bw() +
geom_vline(aes(xintercept = mean(RBI)), color = "red", size = 1.2) +
xlim(0, 140) +
ggtitle("Season RBI Totals for Players with >= 171 AB (Seasons 2010-2016)",
subtitle = "Red Line = Average Hits")
ggplot(df, aes(x = H, y = RBI)) +
geom_jitter(color = "grey", alpha = 0.8) +
geom_smooth(method = "lm", fill = "red") +
ggtitle("Relationship between H and RBI",
subtitle = "Seasons 2010-2016") +
theme_light()