Simulating preferential attachment in R

I’m currently re-reading Michael Mauboussin’s Success Equation. The book is a discussion about the roll both skill and luck play in business and sport success. On page 118, Mauboussin discusses the Mathew Effect. The Mathew Effect, termed by sociologist Robert Merton, comes from a phrase in the bible written in the Gospel of Matthew:

“For whosoever hath, to him shall be given, and he shall have more abundance: but whosoever hath not, from him shall be taken away even that he hath.”

In a nutshell, the Mathew Effect is describing the phenomenon, “the rich get richer and the poor get poorer”.

Mauboussin goes on to provide an example of two graduate students, both with equal ability. Following graduation, the two students are applying for faculty positions. One is hired by an Ivey League university while the other goes to work at a less prestigious university. The Ivey League professor has a wonderful opportunity with perhaps more qualified students, high caliber faculty peers, and more funding for research. Such an opportunity leads to more scientific publications and greater notoriety and accolades in comparison to their peer at the less prestigious university.

As Mauboussin says, “initial conditions matter”. Both students had the same level of skill but different levels of luck. Student one’s initial condition of obtaining a faculty position at an Ivey League university set her up for better opportunities in her professional career, despite not being any more talented than student two.

Such an example applies in many areas of our lives, not just sport and business. For example, in the educational sector, some students may grow up in areas of the country where the public school environment does not provide the same educational experience that more affluent regions might. These students may not be any less intelligent than their peers, however, their initial conditions are not the same, ultimately having an influence in how the rest of their life opportunities turn out and how things look at the finish line.

Luck ends up playing a big role in our lives and the starting line isn’t the same for everyone. Mauboussin refers to this as preferential attachment, whereby the more connections you start with in life, the more new connections you are able to make. To show this concept, Mauboussin creates a simple game of drawing marbles from a jar (pg. 119):

We have a jar filled with the following marbles:

  • 5 red
  • 4 black
  • 3 yellow
  • 2 green
  • 1 blue

You close your eyes and select a marble at random. You then place that marble back in the jar and add one more marble of the same color. For example, let’s say you reach in and grab a yellow marble. You put the yellow marble back in the jar and add one more yellow marble so that there are now 4 yellow marbles in the jar. You repeat this game 100 times.

We can clearly see that starting out, some marbles have a higher chance of being selected than others. For example, there is a 33.3% chance (5/15) of selecting a red marble and only a 6.7% (1/15) chance of selecting a blue marble. The kicker is that, because of the difference in starting points as you select red marbles you end up also adding more red marbles, increasing the probability of selecting future red marbles even further! The red and black marbles begin with a higher number of connections than the other marbles and thus overtime their wealth in connections grows larger.

Let’s see what this looks like in an R simulation!

First, we create our initial starting values for the marbles in the jar:

Let’s play the game one time and see how it works. We reach in, grab a marble at random, and whatever color we get, we will add an additional marble of that same color back to the jar.

 

In this trial, we selected a green marble. Therefore, there are now 3 green marbles in the jar instead of 2.

If we were to do this 100 times, it would be pretty tedious. Instead, we will write a for() loop that can play out the game for us, each time selecting a marble at random and then adding an additional marble to the jar of the same color.

After running the loop of 100 trials, we end up observing the following number and proportion for each marble color:

Notice that when we started 26.7% of the marbles were black and 6.7% were blue. After 100 trials of our game, black now makes up 32% of the population while blue is only at 7%. Remember, these are random samples, so it is pure chance as to which marble we select in each trial of the game. However, the initial conditions were more favorable for black and less favorable for blue, creating different ending points.

We can take our simulated data and build a plot of the trials, recreating the plot that Mauboussin shows on page 121:

The visual is not exactly what you see on pg. 121 because this is a random simulation. But we can see how each marble grows overtime based on their starting point (which you will notice is different on the y-axis at trial number 0 – the initial number of marbles in the jar).

If you run this code yourself, you will get a slightly different outcome as well. Try it a few times and see how random luck changes the final position of each marble. Increase the number of trials from 100 to 1,000 or 10,000 and see what happens! Simulations like this provide an interesting opportunity to understand the world around us.

The code for creating the simulation and visual are available on my GITHUB page.

TidyX 77: Intro to tidymodels

Ellis Hughes and I just wrapped up our series on using SQL in R and have decided to move on to doing a series on tidymodels.

For those that don’t know, tidymodels is an approach to building machine learning models in R using tidyverse principles. Up until this point, most of our model building has been in either the native package for the given model or using the caret package (which tidymodels has now replaced). So, we are super excited to get into the tidymodels framework and learn along with you! Each week we will try and build on a different component of modeling within tidymodels.

This first week is a basic introduction to tidymodels and the broom package (which is automatically loaded with tidymodels). We cover how to set up the model and obtain the model outputs in a nice tidy manner.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

TidyX 76: Polling databases for a multi-user interactive shiny table

Ellis Hughes and I wrap up our SQL database/shiny series by taking a question from one of our viewers.

In TidyX 75, we built a {shiny} app that allowed the user to update a table and save the results back to the database. One of the viewers asked if we could address the issue of multiple users editing the table simultaneously, ultimately canceling out the notes that they are both writing to the database. So, we have addressed this in our recent episode.

To watch the screen cast, CLICK HERE.

To access the code, CLICK HERE.

The Nordic Hamstring Exercise, Hamstring Strains, & the Scientific Process

Doing science in lab is hard.

Doing science in the applied environment is hard — maybe even harder than the lab, at times, due to all of the variables you are unable to control for.

Reading and understanding science is also hard.

Let’s face it, science is tough! So tough, in fact, that scientists themselves have a difficult time with all of the above, and they do this stuff for a living! As such, to keep things in check, science applies a peer-review process to ensure that a certain level of standard is upheld.

Science has a very adversarial quality to it. One group of researchers formulate a hypothesis, conduct some research, and make a claim. Another group of researchers look at that claim and say, “Yeah, but…”, and then go to work trying to poke holes in it, looking to answer the question from a different angle, or trying to refute it altogether. The process continues until some type of consensus is agreed upon within the scientific community based on all of the available evidence.

This back-and-forth tennis match of science has a lot to teach those looking to improve their ability to read and understand research. Reading methodological papers and letters to the editor offer a glimpse into how other, more seasoned, scientists think and approach a problem. You get to see how they construct an argument, deal with a rebuttal, and discuss the limitations of both the work they are questioning and the work they are conducting themselves.

All of this brings me to a recent publication from Franco Impellizzeri, Alan McCall, and Maarten van Smeden, Why Methods Matter in a Meta-Analysis: A reappraisal showed inconclusive injury prevention effect of Nordic hamstring exercise.

The paper is directed at a prior meta-analysis which aggregated the findings of several studies with the aim of understanding the role of the Nordic hamstring exercise (NHE) in reducing the risk hamstring strain injuries in athletes. In a nutshell, Impellizzeri and colleagues felt like the conclusions and claims made from the original meta-analysis were too optimistic, as the title of the paper suggested that performing the NHE can halve the rate of hamstring injuries (Risk Ratio: 0.49, 95% CI: 0.32 to 0.74). Moreover, Impellizzeri et al, identified some methodological flaws with regard to how the meta-analysis was performed.

The reason I like this paper is because it literally steps you through the thought process of Impellizzeri and colleagues. First, it discusses the limitations that they feel are present in the previous meta-analysis. They then conduct their own meta-analysis by re-analyzing the data, however, they apply an inclusion criteria that was more strict and, therefore, only included 5 papers from the original study (they also identified and included a newer study that met their criteria). In analyzing the original five papers, they found prediction intervals ranging from 0.06 to 5.14. (Side Note: Another cool piece of this paper is the discussion and reporting of both confidence intervals and prediction intervals, the latter of which are rarely discussed in the sport science literature).

The paper is a nice read if you want to see the thought process around how scientists read research and go about the scientific process of challenging a claim.

Some general thoughts

  • Wide prediction intervals leave us with a lot of uncertainty around the effectiveness of NHE in reducing the risk of hamstring strain injuries. At the lower end of the interval NHE could be beneficial and protective for some while at the upper end, potentially harmful to others.
  • A possible reason for the large uncertainty in the relationship between NHE and hamstring strain injury is that we might be missing key information about the individual that could indicate whether the exercise would be helpful or harmful. For example, context around their previous injury history (hamstring strains in particular), their training age, or other variables within the training program, all might be useful information to help paint a clearer picture about who might benefit the most or the least.
  • Injury is highly complex and multi-faceted. Trying to pin a decrease in injury risk to a single exercise or a single type of intervention seems like a bit of a stretch.
  • No exercise is a panacea and we shouldn’t treat them as such. If you like doing NHE for yourself (I actually enjoy it!) then do it. If you don’t, then do something else. Let’s not believe that certain exercises have magical properties and let’s not fight athletes about what they are potentially missing from not including an exercise in their program when we don’t even have good certainty on whether it is beneficial or not.