Data Analysis Framework Example

Load Packages

suppressPackageStartupMessages(library(tidyverse))
library(Lahman)

Step 1: Research Question/Problem Statement

What is the relationship between hits (H) and runs batted in (RBI) in major league baseball players?

NOTES:

Hypothesis: A higher number of H will lead to greater RBI in a season
Potential limitations: Other variables may influence the relationship between H and RBI, requiring additional data for future analysis. For example, where in the batting order the batter hits, the number of opportunities the batter has to hit with runners in scoring position, the type of pitching the batter faced that season, etc.

Step 2: Data Collection/Measurement Strategy

What type of data is required

Data sources: Lahman Databases Batting table
Data is available in the ‘Lahman’ package within R
Data issues: Batting data is provided in the table from 1871 through 2016. Older seasons may be missing data.

Collection/Measurement

No additional data needs to be collected at this time

Data Cleaning

# Change the data set from 'Batting' to df, to shorten the name for coding purposes

df <- Batting
head(df)

##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB
## 1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0
## 2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4
## 3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2
## 4 allisdo01   1871     1    WS3   NA 27 133 28 44  10   2  2  27  1  1  0
## 5 ansonca01   1871     1    RC1   NA 25 120 29 39  11   3  0  16  6  2  2
## 6 armstbo01   1871     1    FW1   NA 12  49  9 11   2   1  0   5  0  1  0
##   SO IBB HBP SH SF GIDP
## 1  0  NA  NA NA NA   NA
## 2  0  NA  NA NA NA   NA
## 3  5  NA  NA NA NA   NA
## 4  2  NA  NA NA NA   NA
## 5  1  NA  NA NA NA   NA
## 6  1  NA  NA NA NA   NA

# check the number of NAs in the H and RBI column

sum(is.na(df$H))

## [1] 0

sum(is.na(df$RBI))

## [1] 424

Looks like there are no NA’s in the H column but there are 424 in the RBI column. See what seasons these missing values are located in.

df %>%
  filter(is.na(RBI)) %>%
  count(yearID)

## # A tibble: 2 x 2
##   yearID     n
##    <int> <int>
## 1   1882    92
## 2   1884   332

Looks like the missing values are only located in years 1882 and 1884.

For the purposes of this analysis we will look at the mopre modern years and constrain ourselves to seasons 2010-2016

df <- df %>% filter(yearID > 2009)
df %>% dim()

## [1] 9966   22

# We are dealing with 9966 rows and 22 columns of data

It’s possible that some players may only have a few at bats (AB). We should evaluate this to see if we need to have an inclusion criteria.

boxplot(df$AB, horizontal = T,
        xlab = "At Bats",
        main = "Distribution of At Bats from 2010-2016\nRed Line = Avg AB",
        adj = 0,
        col = "light grey")
abline(v = mean(df$AB), col = "red", lwd = 2)

quantile(df$AB)

##   0%  25%  50%  75% 100% 
##    0    0   17  171  684

We see that the data is vert right skewed, with a large number of players with a small number of ABs and then a bunch of players with a lot of ABs. This is why the median (thick black line insde of the box, representing the IQR) is so low relative to the mean (thick red line).

Let’s just concentracte on players with greater than or equal to 171 ABs (the 75th percentile). Obviously this is going to change how we interpret our outcome given that players with lots of ABs will have more opportunities for hits and potentially more opportunities to generate runs.

df <- df %>% filter(AB >= 171)
df %>% dim()

## [1] 2495   22

# We now have 2495 rows and 22 columns to work with

Step 3: Visualize & Summarize Data

Visuals of H and RBI

## All seasons grouped together

# Hits
df %>%
  ggplot(aes(x = H)) +
  geom_density(fill = "green", alpha = 0.6) +
  theme_bw() +
  geom_vline(aes(xintercept = mean(H)), color = "red", size = 1.2) +
  xlim(0, 300) +
  ggtitle("Season Hit Totals for Players with >= 171 AB (Seasons 2010-2016)", 
          subtitle = "Red Line = Average Hits")

# RBI
df %>%
  ggplot(aes(x = RBI)) +
  geom_density(fill = "blue", alpha = 0.6) +
  theme_bw() +
  geom_vline(aes(xintercept = mean(RBI)), color = "red", size = 1.2) +
  xlim(0, 140) +
  ggtitle("Season RBI Totals for Players with >= 171 AB (Seasons 2010-2016)", 
          subtitle = "Red Line = Average Hits")

Looks like the average number of hits in a season is around 100 and the average numnber of RBI is around 50.
Plot their relationship to each other

ggplot(df, aes(x = H, y = RBI)) +
  geom_jitter(color = "grey", alpha = 0.8) +
  geom_smooth(method = "lm", fill = "red") +
  ggtitle("Relationship between H and RBI",
          subtitle = "Seasons 2010-2016") +
  theme_light()

Okay, so there is a relationship between H and RBI Is this relationship at all influened by Season (perhaps changings in the scoring environment could influence this relationship in some way from one season to the next?)

# Hits by season
df %>%
  ggplot(aes(x = H)) +
  geom_density(fill = "green", alpha = 0.6) +
  theme_bw() +
  geom_vline(aes(xintercept = mean(H)), color = "red", size = 1.2) +
  xlim(0, 300) +
  ggtitle("Season Hit Totals for Players with >= 171 AB (Seasons 2010-2016)", 
          subtitle = "Red Line = Average Hits") +
  facet_wrap(~yearID)

# RBI by season
df %>%
  ggplot(aes(x = RBI)) +
  geom_density(fill = "blue", alpha = 0.6) +
  theme_bw() +
  geom_vline(aes(xintercept = mean(RBI)), color = "red", size = 1.2) +
  xlim(0, 140) +
  ggtitle("Season RBI Totals for Players with >= 171 AB (Seasons 2010-2016)", 
          subtitle = "Red Line = Average Hits") +
  facet_wrap(~yearID)

# Relationship between H and RBI by season
ggplot(df, aes(x = H, y = RBI)) +
  geom_jitter(color = "grey", alpha = 0.8) +
  geom_smooth(method = "lm", fill = "red") +
  ggtitle("Relationship between H and RBI",
          subtitle = "Seasons 2010-2016") +
  theme_light() +
  facet_wrap(~yearID)

Everything seems relatively consistent from year to year

Produce summary statistics for H and RBI

df %>%
  summarize_at(c("H", "RBI"), .funs = funs(mean, sd))

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

##     H_mean RBI_mean     H_sd   RBI_sd
## 1 102.9351  48.9006 43.31572 25.69956

Mean & SD

H = 103 ± 43

RBI = 49 ± 26

Calculate the correlation

with(df, cor.test(H, RBI))

## 
##  Pearson's product-moment correlation
## 
## data:  H and RBI
## t = 70.246, df = 2493, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8014749 0.8278444
## sample estimates:
##       cor 
## 0.8150814

The correlation between H and AB is 0.82 [95% CI: 0.80, 0.83]

Step 4: Model Development/Interpretation

We will just build a simple regression model for this example

fit_lm <- lm(RBI ~ H, data = df)
summary(fit_lm)

## 
## Call:
## lm(formula = RBI ~ H, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.103  -8.830  -0.490   8.892  58.118 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.878201   0.768794  -1.142    0.253    
## H            0.483594   0.006884  70.246   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.89 on 2493 degrees of freedom
## Multiple R-squared:  0.6644, Adjusted R-squared:  0.6642 
## F-statistic:  4935 on 1 and 2493 DF,  p-value: < 2.2e-16

The model explains about 66% of the variability observed in RBI.

The coefficient for H suggests that for a one unit increase in H, on average we would see an approximate 0.48 increase in RBIs. Thus, 10 extra AB could lead to about 5 extra RBI (10 * 0.48).

Step 5: Model Evaluation

Plot model diagnostics

par(mfrow = c(2,2))
plot(fit_lm)

A little fanning out in the pattern of the residuals againsts predicted values plot (upper left) indicating that the variance is not consistent across the range values.

The upper right plot suggests the residuals are relatively normally distributed

Other diagnostic criteria can be checked. For example, hat values (aka, leverage – bottom right) and Cook’s distance, which can inform us about how an individual case’s may be influencing the model’s ability to make predictions across all cases in the data.

Other ways of plotting that can help understand the model

hist(resid(fit_lm), col = "grey", main = "Model Residuals", xlab = "Residuals")

df %>%
  mutate(Pred = fitted(fit_lm)) %>% 
  ggplot(aes(x = Pred, y = RBI)) +
  geom_jitter(aes(color = H), alpha = 0.8) +
  geom_abline(intercept = 0, slope = 1, color = "red", size = 2) +
  theme_light() +
  labs(title = "Predicted vs Actual RBI") +
  expand_limits(y = 0, x = 0)

Here we look at the residuals in a histogram to see if they are normally distributed.

We also produced a plot of the predicted RBI vs the actual RBI. The plot contains a line of equality (if the model prediction was 100% correct, every single dot would fall directly on that line). I’ve colored the dots by the number of H a batter had so that we can better see where the model makes some mistakes. The lighter colored dots indicate more H and we can see that to the far right of the plot, when thre are more H, the errors between predicted and actual RBI begin to increase.

Calculate model RMSE

rmse <- sqrt(mean(fit_lm$resid^2))
rmse

## [1] 14.88596

Step 6: Communication of Results

The research question was to examine the relationship between H and RBI in major league baseball players. It appears that there is a relationship between these two variables with more hits yielding a higher number of Runs. However, with an increase in hits the model predictions become more noisy.
Better predictions could be made with more data that can provide context around hits (situation, pitching, line up order, etc) and should be collected for future analysis.

Data Analysis Framework Example – MLB Hitting

Patrick Ward

12/14/2019

Load Packages

Step 1: Research Question/Problem Statement

NOTES:

Step 2: Data Collection/Measurement Strategy

Step 3: Visualize & Summarize Data

Step 4: Model Development/Interpretation

Step 5: Model Evaluation

Step 6: Communication of Results