Load Packages

suppressPackageStartupMessages(library(tidyverse))
library(Lahman)

Step 1: Research Question/Problem Statement

  1. What is the relationship between hits (H) and runs batted in (RBI) in major league baseball players?
NOTES:
  • Hypothesis: A higher number of H will lead to greater RBI in a season
  • Potential limitations: Other variables may influence the relationship between H and RBI, requiring additional data for future analysis. For example, where in the batting order the batter hits, the number of opportunities the batter has to hit with runners in scoring position, the type of pitching the batter faced that season, etc.

Step 2: Data Collection/Measurement Strategy

  1. What type of data is required
  1. Collection/Measurement
  1. Data Cleaning
# Change the data set from 'Batting' to df, to shorten the name for coding purposes

df <- Batting
head(df)
##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB
## 1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0
## 2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4
## 3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2
## 4 allisdo01   1871     1    WS3   NA 27 133 28 44  10   2  2  27  1  1  0
## 5 ansonca01   1871     1    RC1   NA 25 120 29 39  11   3  0  16  6  2  2
## 6 armstbo01   1871     1    FW1   NA 12  49  9 11   2   1  0   5  0  1  0
##   SO IBB HBP SH SF GIDP
## 1  0  NA  NA NA NA   NA
## 2  0  NA  NA NA NA   NA
## 3  5  NA  NA NA NA   NA
## 4  2  NA  NA NA NA   NA
## 5  1  NA  NA NA NA   NA
## 6  1  NA  NA NA NA   NA
# check the number of NAs in the H and RBI column

sum(is.na(df$H))
## [1] 0
sum(is.na(df$RBI))
## [1] 424

Looks like there are no NA’s in the H column but there are 424 in the RBI column. See what seasons these missing values are located in.

df %>%
  filter(is.na(RBI)) %>%
  count(yearID)
## # A tibble: 2 x 2
##   yearID     n
##    <int> <int>
## 1   1882    92
## 2   1884   332

Looks like the missing values are only located in years 1882 and 1884.

For the purposes of this analysis we will look at the mopre modern years and constrain ourselves to seasons 2010-2016

df <- df %>% filter(yearID > 2009)
df %>% dim()
## [1] 9966   22
# We are dealing with 9966 rows and 22 columns of data

It’s possible that some players may only have a few at bats (AB). We should evaluate this to see if we need to have an inclusion criteria.

boxplot(df$AB, horizontal = T,
        xlab = "At Bats",
        main = "Distribution of At Bats from 2010-2016\nRed Line = Avg AB",
        adj = 0,
        col = "light grey")
abline(v = mean(df$AB), col = "red", lwd = 2)

quantile(df$AB)
##   0%  25%  50%  75% 100% 
##    0    0   17  171  684

We see that the data is vert right skewed, with a large number of players with a small number of ABs and then a bunch of players with a lot of ABs. This is why the median (thick black line insde of the box, representing the IQR) is so low relative to the mean (thick red line).

Let’s just concentracte on players with greater than or equal to 171 ABs (the 75th percentile). Obviously this is going to change how we interpret our outcome given that players with lots of ABs will have more opportunities for hits and potentially more opportunities to generate runs.

df <- df %>% filter(AB >= 171)
df %>% dim()
## [1] 2495   22
# We now have 2495 rows and 22 columns to work with

Step 3: Visualize & Summarize Data

## All seasons grouped together

# Hits
df %>%
  ggplot(aes(x = H)) +
  geom_density(fill = "green", alpha = 0.6) +
  theme_bw() +
  geom_vline(aes(xintercept = mean(H)), color = "red", size = 1.2) +
  xlim(0, 300) +
  ggtitle("Season Hit Totals for Players with >= 171 AB (Seasons 2010-2016)", 
          subtitle = "Red Line = Average Hits")

# RBI
df %>%
  ggplot(aes(x = RBI)) +
  geom_density(fill = "blue", alpha = 0.6) +
  theme_bw() +
  geom_vline(aes(xintercept = mean(RBI)), color = "red", size = 1.2) +
  xlim(0, 140) +
  ggtitle("Season RBI Totals for Players with >= 171 AB (Seasons 2010-2016)", 
          subtitle = "Red Line = Average Hits")

ggplot(df, aes(x = H, y = RBI)) +
  geom_jitter(color = "grey", alpha = 0.8) +
  geom_smooth(method = "lm", fill = "red") +
  ggtitle("Relationship between H and RBI",
          subtitle = "Seasons 2010-2016") +
  theme_light()