Model Building in R | Patrick Ward, PhD

I’m currently re-reading Michael Mauboussin’s Success Equation. The book is a discussion about the roll both skill and luck play in business and sport success. On page 118, Mauboussin discusses the Mathew Effect. The Mathew Effect, termed by sociologist Robert Merton, comes from a phrase in the bible written in the Gospel of Matthew:

“For whosoever hath, to him shall be given, and he shall have more abundance: but whosoever hath not, from him shall be taken away even that he hath.”

In a nutshell, the Mathew Effect is describing the phenomenon, “the rich get richer and the poor get poorer”.

Mauboussin goes on to provide an example of two graduate students, both with equal ability. Following graduation, the two students are applying for faculty positions. One is hired by an Ivey League university while the other goes to work at a less prestigious university. The Ivey League professor has a wonderful opportunity with perhaps more qualified students, high caliber faculty peers, and more funding for research. Such an opportunity leads to more scientific publications and greater notoriety and accolades in comparison to their peer at the less prestigious university.

As Mauboussin says, “initial conditions matter”. Both students had the same level of skill but different levels of luck. Student one’s initial condition of obtaining a faculty position at an Ivey League university set her up for better opportunities in her professional career, despite not being any more talented than student two.

Such an example applies in many areas of our lives, not just sport and business. For example, in the educational sector, some students may grow up in areas of the country where the public school environment does not provide the same educational experience that more affluent regions might. These students may not be any less intelligent than their peers, however, their initial conditions are not the same, ultimately having an influence in how the rest of their life opportunities turn out and how things look at the finish line.

Luck ends up playing a big role in our lives and the starting line isn’t the same for everyone. Mauboussin refers to this as preferential attachment, whereby the more connections you start with in life, the more new connections you are able to make. To show this concept, Mauboussin creates a simple game of drawing marbles from a jar (pg. 119):

We have a jar filled with the following marbles:

5 red
4 black
3 yellow
2 green
1 blue

You close your eyes and select a marble at random. You then place that marble back in the jar and add one more marble of the same color. For example, let’s say you reach in and grab a yellow marble. You put the yellow marble back in the jar and add one more yellow marble so that there are now 4 yellow marbles in the jar. You repeat this game 100 times.

We can clearly see that starting out, some marbles have a higher chance of being selected than others. For example, there is a 33.3% chance (5/15) of selecting a red marble and only a 6.7% (1/15) chance of selecting a blue marble. The kicker is that, because of the difference in starting points as you select red marbles you end up also adding more red marbles, increasing the probability of selecting future red marbles even further! The red and black marbles begin with a higher number of connections than the other marbles and thus overtime their wealth in connections grows larger.

Let’s see what this looks like in an R simulation!

First, we create our initial starting values for the marbles in the jar:

Let’s play the game one time and see how it works. We reach in, grab a marble at random, and whatever color we get, we will add an additional marble of that same color back to the jar.

In this trial, we selected a green marble. Therefore, there are now 3 green marbles in the jar instead of 2.

If we were to do this 100 times, it would be pretty tedious. Instead, we will write a for() loop that can play out the game for us, each time selecting a marble at random and then adding an additional marble to the jar of the same color.

After running the loop of 100 trials, we end up observing the following number and proportion for each marble color:

Notice that when we started 26.7% of the marbles were black and 6.7% were blue. After 100 trials of our game, black now makes up 32% of the population while blue is only at 7%. Remember, these are random samples, so it is pure chance as to which marble we select in each trial of the game. However, the initial conditions were more favorable for black and less favorable for blue, creating different ending points.

We can take our simulated data and build a plot of the trials, recreating the plot that Mauboussin shows on page 121:

The visual is not exactly what you see on pg. 121 because this is a random simulation. But we can see how each marble grows overtime based on their starting point (which you will notice is different on the y-axis at trial number 0 – the initial number of marbles in the jar).

If you run this code yourself, you will get a slightly different outcome as well. Try it a few times and see how random luck changes the final position of each marble. Increase the number of trials from 100 to 1,000 or 10,000 and see what happens! Simulations like this provide an interesting opportunity to understand the world around us.

The code for creating the simulation and visual are available on my GITHUB page.

In our most recent TidyX screen cast we talked a bit about correlation, which compelled me to expand on Pearson’s correlation to show how you can do things from first principles in R, by writing your own custom functions. This tutorial covers:

A brief explanation of Pearson’s correlation
R functions available to obtain Pearson’s correlation coefficient
How to write your own custom functions to understand the underlying principles of Pearson’s correlation coefficient and it’s confidence intervals

Pearson’s Correlation

Read almost any scientific journal article and you are sure to encounter Pearson’s correlation coefficient (denoted as r), as it is a commonly used metric for describing relationships within ones data. Take any basic stats course and you are bound to hear the phrase, “Correlation does not equal causation”, due to the fact that just because things are correlated does not mean that one definitely caused the other (such a relationship would need to be teased out of more specifically and the natural world is full of all sorts of random correlations).

Pearson’s correlation is a descriptive statistic used to quantify the linear relationship between two variables. It does so by measuring how much two variables covary with each other. The correlation coefficient is scaled between -1 and 1 where 1 means that the two variables have a perfect positive relationship (as one variable increases the other variable also always increases) while -1 means that the two variables have a perfect negative relationship (as one variable decreases the other variable also always decreases). A correlation coefficient of 0 suggests that the two variables do not share a linear relationship. It is rare that we see completely perfect correlations (either 1 or -1) and often our data will present with some scatter suggesting a specific trend (positive or negative) but with some amount of variability.

Data Preparation

R offers a few convenient functions for obtaining the correlation coefficient between two variables. For this tutorial, we will work with the Lahman baseball data set, which is freely available in R once you have installed the {Lahman} package. We will also use the {tidyverse} package for data manipulation and visualization.

We will use the Batting data set, which contains historic batting statistics for players, and the Master data set, which contains meta data on the players (e.g., birth year, death year, hometown, Major League debut, etc).

I do a little bit of data cleaning to obtain only the batting statistics from the 2016 season for those that had at least 200 at bats and I join that data with the player’s birth year from the Master data set. The two hitting variables I’ll use for this tutorial are Hits and RBI. One final thing I do is create a quantile bin for the player’s age so that I can look at correlation across age groups later in the tutorial.

Here is what the data looks like so far:

Data Visualization

Once we have pre-processed the data we can visualize it to see how things look. We will plot two visuals: (1) the number of players per age group; and, (2), a plot of the linear relationship between Hits and RBI.

Correlation in R

R offers a few convenient functions for obtaining the correlation coefficient between two variables. The two we will use are cor() and cor.test(). The only arguments you need to pass to these two functions is the X and Y variables you are interested in. The first function will produce a single correlation coefficient while the latter function will produce the correlation coefficient along with confidence intervals and information about the hypothesis test. Here is what their respective outputs looks like for the correlation between Hits and RBI:

Pretty easy! We can see that the correlation coefficient is the same between both functions, as it should be (r = 0.82). In the bottom output we can see that the 95% Confidence Intervals range from r = 0.78 to r = 0.85. The correlation is high between these two variables but it is not perfect, which we can better appreciate from the scatter plot above showing variability around the regression line.

Correlation from First Principles

One of the best ways to understand what is going on behind the custom functions in your stats program (no matter if it is R, Python, SPSS, SAS, or even Excel) is to try and build your own functions by hand. Doing so gives you an appreciation for some of the inner workings of these statistics and will help you better understand more complex statistics latter on.

I’ll build two functions:

A function to calculate the Pearson’s correlation coefficient
A function to calculate the confidence intervals around the correlation coefficient

Pearson’s Correlation Coefficient Function

Similar to the built in R functions, this function will take inputs of an X and Y variable.
You can see the math for calculating the correlation coefficient in the function below. We start by subtracting the mean for each column from each observation. We then multiply the differences for each row and then produce a column of squared differences for each variable. Those values provide the inputs for the correlation coefficient in the second to last line of the function.

cor_function <- function(x, y){
  
  dat <- data.frame(x, y)
  dat <- dat %>%
    mutate(diff_x = x - mean(x),
           diff_y = y - mean(y),
           xy = diff_x * diff_y,
           x2 = diff_x^2,
           y2 = diff_y^2)
  
  r <- sum(dat$xy) / sqrt(sum(dat$x2) * sum(dat$y2))
  return(r)
}

Confidence Interval for Pearson’s R

This function takes three inputs: (1) the correlation coefficient between two variables (calculated above); (2) the sample size (the number of observations between X and Y); and, (3) The Confidence Level of Interest.
NOTE: To input a confidence level of interest I only set it up for three options, 0.9, 0.95, or 0.99, for the 90%, 95% and 99% Confidence Interval respectively. I could set it up to take any level of confidence and calculate the appropriate critical value but the function (as you can see) was already getting long and messy so I decided to cut it off and keep it simple for illustration purposes here.
This function is a bit more involved than the previous one and that’s because we require some transformations of the data. We have to transform the correlation coefficient using Fisher’s Z-Transformation in order to create a normal distribution. From there we can calculate our confidence intervals and then back transform the values so that they are on the scale of r.

cor.CI <- function(r, N, CI){
  
  fisher.Z <- .5*log((1+r)/(1-r))
  
  se.Z <- sqrt(1/(N-3))
  
  if(CI == .9){
    MOE <- 1.65*se.Z}
  else {
    if(CI == .95){
      MOE <- 1.95*se.Z}
    else {
      if(CI ==.99){
        MOE <- 2.58*se.Z}
      else{
        NA
      }
    }
  }
  
  Lower.Z <- fisher.Z - MOE
  Upper.Z <- fisher.Z + MOE
  Lower.cor.CI <- (exp(2*Lower.Z)-1)/(exp(2*Lower.Z)+1)
  Upper.cor.CI <- (exp(2*Upper.Z)-1)/(exp(2*Upper.Z)+1)
  Correlation.Coefficient.CI <- data.frame(r, Lower.cor.CI, Upper.cor.CI)
  Correlation.Coefficient.CI <- round(Correlation.Coefficient.CI, 3)
  return(Correlation.Coefficient.CI)
  }

Seeing our Functions in Action

We’ve obtained similar results to what was produced from the custom R functions!

We can also look at correlation across age bins. To do this, we will use {tidyverse} so that we can group_by() the age bins that we predefined.

First with the built in R functions

Using the built in cor.test() function we need to call the confident interval specifically from the output and specifying [1] and [2] tells R that we want the first value (lower confidence interval) and the second value (higher confidence interval), respectively.

yr2016 %>%
  group_by(AgeBin) %>%
  summarize(COR = cor(H, RBI),
            COR_Low_CI = cor.test(H, RBI)$conf.int[1],
            COR_High_CI = cor.test(H, RBI)$conf.int[2])

Now with our custom R functions

Similar to above, we need to extract the confidence interval out of the confidence interval function’s output. In this case, if you recall, there were three outputs (correlation coefficient, Low CI, and High CI). As such, we want to indicate [2] and [3] for the lower and upper confidence interval, respectively, since that is where they are located in the model output.

yr2016 %>%
  group_by(AgeBin) %>%
  summarize(COR = cor_function(H, RBI),
            cor.CI(r = COR,
                   N = n(),
                   CI = 0.95)[2],
            cor.CI(r = COR,
                   N = n(),
                   CI = 0.95)[3])

Conclusion

Hopefully this tutorial was useful in helping you to understand Pearson’s correlation and how easy it is to write functions in R that allow us to explore our data from first principles. All of the code for this tutorial is available on my GitHub page.

Patrick Ward, PhD

Patrick Ward, PhD | Sports Science & Analytics

Category Archives: Model Building in R

Simulating preferential attachment in R

R Tips & Tricks: Pearson Correlation from First Principles