```
suppressPackageStartupMessages(library(tidyverse))
library(Lahman)
```

- What is the relationship between hits (H) and runs batted in (RBI) in major league baseball players?

- Hypothesis: A higher number of H will lead to greater RBI in a season
- Potential limitations: Other variables may influence the relationship between H and RBI, requiring additional data for future analysis. For example, where in the batting order the batter hits, the number of opportunities the batter has to hit with runners in scoring position, the type of pitching the batter faced that season, etc.

- What type of data is required

- Data sources: Lahman Databases Batting table
- Data is available in the ‘Lahman’ package within R
- Data issues: Batting data is provided in the table from 1871 through 2016. Older seasons may be missing data.

- Collection/Measurement

- No additional data needs to be collected at this time

- Data Cleaning

```
# Change the data set from 'Batting' to df, to shorten the name for coding purposes
df <- Batting
head(df)
```

```
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB
## 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 0 0 0
## 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 8 1 4
## 3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 3 1 2
## 4 allisdo01 1871 1 WS3 NA 27 133 28 44 10 2 2 27 1 1 0
## 5 ansonca01 1871 1 RC1 NA 25 120 29 39 11 3 0 16 6 2 2
## 6 armstbo01 1871 1 FW1 NA 12 49 9 11 2 1 0 5 0 1 0
## SO IBB HBP SH SF GIDP
## 1 0 NA NA NA NA NA
## 2 0 NA NA NA NA NA
## 3 5 NA NA NA NA NA
## 4 2 NA NA NA NA NA
## 5 1 NA NA NA NA NA
## 6 1 NA NA NA NA NA
```

```
# check the number of NAs in the H and RBI column
sum(is.na(df$H))
```

`## [1] 0`

`sum(is.na(df$RBI))`

`## [1] 424`

Looks like there are no NA’s in the H column but there are 424 in the RBI column. See what seasons these missing values are located in.

```
df %>%
filter(is.na(RBI)) %>%
count(yearID)
```

```
## # A tibble: 2 x 2
## yearID n
## <int> <int>
## 1 1882 92
## 2 1884 332
```

Looks like the missing values are only located in years 1882 and 1884.

For the purposes of this analysis we will look at the mopre modern years and constrain ourselves to seasons 2010-2016

```
df <- df %>% filter(yearID > 2009)
df %>% dim()
```

`## [1] 9966 22`

`# We are dealing with 9966 rows and 22 columns of data`

It’s possible that some players may only have a few at bats (AB). We should evaluate this to see if we need to have an inclusion criteria.

```
boxplot(df$AB, horizontal = T,
xlab = "At Bats",
main = "Distribution of At Bats from 2010-2016\nRed Line = Avg AB",
adj = 0,
col = "light grey")
abline(v = mean(df$AB), col = "red", lwd = 2)
```

`quantile(df$AB)`

```
## 0% 25% 50% 75% 100%
## 0 0 17 171 684
```

We see that the data is vert right skewed, with a large number of players with a small number of ABs and then a bunch of players with a lot of ABs. This is why the median (thick black line insde of the box, representing the IQR) is so low relative to the mean (thick red line).

Let’s just concentracte on players with greater than or equal to 171 ABs (the 75th percentile). Obviously this is going to change how we interpret our outcome given that players with lots of ABs will have more opportunities for hits and potentially more opportunities to generate runs.

```
df <- df %>% filter(AB >= 171)
df %>% dim()
```

`## [1] 2495 22`

`# We now have 2495 rows and 22 columns to work with`

- Visuals of H and RBI

```
## All seasons grouped together
# Hits
df %>%
ggplot(aes(x = H)) +
geom_density(fill = "green", alpha = 0.6) +
theme_bw() +
geom_vline(aes(xintercept = mean(H)), color = "red", size = 1.2) +
xlim(0, 300) +
ggtitle("Season Hit Totals for Players with >= 171 AB (Seasons 2010-2016)",
subtitle = "Red Line = Average Hits")
```

```
# RBI
df %>%
ggplot(aes(x = RBI)) +
geom_density(fill = "blue", alpha = 0.6) +
theme_bw() +
geom_vline(aes(xintercept = mean(RBI)), color = "red", size = 1.2) +
xlim(0, 140) +
ggtitle("Season RBI Totals for Players with >= 171 AB (Seasons 2010-2016)",
subtitle = "Red Line = Average Hits")
```

- Looks like the average number of hits in a season is around 100 and the average numnber of RBI is around 50.
- Plot their relationship to each other

```
ggplot(df, aes(x = H, y = RBI)) +
geom_jitter(color = "grey", alpha = 0.8) +
geom_smooth(method = "lm", fill = "red") +
ggtitle("Relationship between H and RBI",
subtitle = "Seasons 2010-2016") +
theme_light()
```