R Tips & Tricks: Recreating within column iteration as you would do in excel

One of the easiest things to do in excel is within column iteration. What I mean by this is you create a new column where the starting value is a 0 or a value that occurs in a different column and then all of the following values within that column depend on the value preceding it.

For example, in the below table we can see that we have a value for each corresponding ID. The New Value is calculated as the most recent observation of Value + lag(New Value) – 2. This is true for all observations except the first observation, which simply takes Value of the first ID observation. So, in ID 2, we get: New Value = 7 + 4 – 2 = 9 and in ID 3 we get: New Value = 3 + 9 – 2 = 10.


This type of function is pretty common in excel but it can be a little tricky in R. I’ve been meaning to do a blog about this after a few questions that I’ve gotten and Aaron Pearson reminded me about it last night, so let’s try and tackle it.

Creating Data

We will create two fake data sets:

  • Data set 1 will be a larger data set with multiple subjects.
  • Data set 2 will only be one subject, a smaller data set for us to first get an understanding of what we are doing before trying to perform the function over multiple people.

 


library(tidyverse)

## simulate data
set.seed(1)
subject <- rep(LETTERS[1:3], each = 50)
day <- rep(1:50, times = 3)
value <- c(
  round(rnorm(n = 20, mean = 120, sd = 40), 2),
  round(rnorm(n = 10, mean = 150, sd = 20), 2),
  round(rnorm(n = 20, mean = 110, sd = 30), 2),
  round(rnorm(n = 20, mean = 120, sd = 40), 2),
  round(rnorm(n = 10, mean = 150, sd = 20), 2),
  round(rnorm(n = 20, mean = 110, sd = 30), 2),
  round(rnorm(n = 20, mean = 120, sd = 40), 2),
  round(rnorm(n = 10, mean = 150, sd = 20), 2),
  round(rnorm(n = 20, mean = 110, sd = 30), 2))

df_1 <- data.frame(subject, day, value) df_1 %>% head()

### Create a data frame of one subject for a simple example
df_2 <- df_1 %>%
  filter(subject == "A")

Exponentially Weighted Moving Average (EWMA)

We will apply an exponentially weighted moving average to the data as this type of equation requires within column aggregation.

EWMA is calculated as:

EWMA_t = lamda*x_t + (1 – lamda) * Z_t-1

Where:

  • EWMA_t = the exponentially weighted moving average value at time t
  • Lamda = the weighting factor
  • x_t = the most recent observation
  • Z_t-1 = the lag of the EWMA value

accumulate()

Within {tidyverse} we will use the accumulate() function, which allows us to create this type of within column aggregation. The function takes a few key arguments:

  • First we need to pass the function the name of the column of data with our observations over time
  • .y which represents the value of our most recent observation
  • .f which is the function that we want to apply to our within column aggregation (in this example we will use the EWMA equation)
  • .x which is going to provide us with the lagged value within the new column we are creating

Here is what it looks like in our smaller data set, df_2

 

df_2 <- df_2 %>%
  mutate(ewma = accumulate(value, ~ lamda * .y + (1 - lamda) * .x))

Here, we are using mutate() to create a new column called ewma. We used accumulate() and passed it the value column, which is the column of our data that has our observations and our function for calculating ewma, which follows the tilde.

Within this function we see .y, the most recent observation and .x, the lag ewma value. By default, the first row of the new ewma column will be the first observation in the value row. Here is what the first few rows of the data look like:

Now that new column has been created we can visualize the observed values and the EWMA values:

Applying the approach to all of the subjects

To apply this approach to all of the subjects in our data we simply need to use the group_by() function to tell R that we want to have the algorithm start over whenever it encounters a new subject ID.


df_1 <- df_1 %>%
  group_by(subject) %>%
  mutate(ewma = accumulate(value, ~ lamda * .y + (1 - lamda) * .x))

And then we can plot the outcome:

Pretty easy!

What if we want the start value to be 0 (or something else) instead of the first observation?

This is a quick fix within the accumulate() function by using the .init argument and simply passing it whatever value you want the new column to begin with. What you need to be aware of when you do this, however, is that this argument will add an additional observation to the vector of data and thus we need to remove the last row of the data set to ensure that {tidyverse} can perform the operation without giving you an error. To accomplish this, when I pass the value column to the function I add a bracket and then minus 1 of the total count, n(), of observations in that column.

df_2 %>%
  mutate(ewma = accumulate(value[-n()], ~ lamda * .y + (1 - lamda) * .x, .init = 0)) %>%
  head()

Now we see that the first value in ewma is 0 instead of 94.94, which of course changes all of the values following it since the equation is using the lagged ewma value (.x).

For the complete code, please see my GitHub Page.

 

TidyX 80: Tuning Decision Trees in tidymodels

Ellis Hughes and I discuss how to tune decision trees for regression within the {tidymodels} framework. We cover:

* Pre-processing data
* Splitting data into training and test sets
* Setting tuning parameters and a tuning grid
* Fitting models and gathering model evaluation metrics
* Selecting the final model following tuning and fitting that model to the test data set
* Visualizing your outcomes

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

Simulating preferential attachment in R

I’m currently re-reading Michael Mauboussin’s Success Equation. The book is a discussion about the roll both skill and luck play in business and sport success. On page 118, Mauboussin discusses the Mathew Effect. The Mathew Effect, termed by sociologist Robert Merton, comes from a phrase in the bible written in the Gospel of Matthew:

“For whosoever hath, to him shall be given, and he shall have more abundance: but whosoever hath not, from him shall be taken away even that he hath.”

In a nutshell, the Mathew Effect is describing the phenomenon, “the rich get richer and the poor get poorer”.

Mauboussin goes on to provide an example of two graduate students, both with equal ability. Following graduation, the two students are applying for faculty positions. One is hired by an Ivey League university while the other goes to work at a less prestigious university. The Ivey League professor has a wonderful opportunity with perhaps more qualified students, high caliber faculty peers, and more funding for research. Such an opportunity leads to more scientific publications and greater notoriety and accolades in comparison to their peer at the less prestigious university.

As Mauboussin says, “initial conditions matter”. Both students had the same level of skill but different levels of luck. Student one’s initial condition of obtaining a faculty position at an Ivey League university set her up for better opportunities in her professional career, despite not being any more talented than student two.

Such an example applies in many areas of our lives, not just sport and business. For example, in the educational sector, some students may grow up in areas of the country where the public school environment does not provide the same educational experience that more affluent regions might. These students may not be any less intelligent than their peers, however, their initial conditions are not the same, ultimately having an influence in how the rest of their life opportunities turn out and how things look at the finish line.

Luck ends up playing a big role in our lives and the starting line isn’t the same for everyone. Mauboussin refers to this as preferential attachment, whereby the more connections you start with in life, the more new connections you are able to make. To show this concept, Mauboussin creates a simple game of drawing marbles from a jar (pg. 119):

We have a jar filled with the following marbles:

  • 5 red
  • 4 black
  • 3 yellow
  • 2 green
  • 1 blue

You close your eyes and select a marble at random. You then place that marble back in the jar and add one more marble of the same color. For example, let’s say you reach in and grab a yellow marble. You put the yellow marble back in the jar and add one more yellow marble so that there are now 4 yellow marbles in the jar. You repeat this game 100 times.

We can clearly see that starting out, some marbles have a higher chance of being selected than others. For example, there is a 33.3% chance (5/15) of selecting a red marble and only a 6.7% (1/15) chance of selecting a blue marble. The kicker is that, because of the difference in starting points as you select red marbles you end up also adding more red marbles, increasing the probability of selecting future red marbles even further! The red and black marbles begin with a higher number of connections than the other marbles and thus overtime their wealth in connections grows larger.

Let’s see what this looks like in an R simulation!

First, we create our initial starting values for the marbles in the jar:

Let’s play the game one time and see how it works. We reach in, grab a marble at random, and whatever color we get, we will add an additional marble of that same color back to the jar.

 

In this trial, we selected a green marble. Therefore, there are now 3 green marbles in the jar instead of 2.

If we were to do this 100 times, it would be pretty tedious. Instead, we will write a for() loop that can play out the game for us, each time selecting a marble at random and then adding an additional marble to the jar of the same color.

After running the loop of 100 trials, we end up observing the following number and proportion for each marble color:

Notice that when we started 26.7% of the marbles were black and 6.7% were blue. After 100 trials of our game, black now makes up 32% of the population while blue is only at 7%. Remember, these are random samples, so it is pure chance as to which marble we select in each trial of the game. However, the initial conditions were more favorable for black and less favorable for blue, creating different ending points.

We can take our simulated data and build a plot of the trials, recreating the plot that Mauboussin shows on page 121:

The visual is not exactly what you see on pg. 121 because this is a random simulation. But we can see how each marble grows overtime based on their starting point (which you will notice is different on the y-axis at trial number 0 – the initial number of marbles in the jar).

If you run this code yourself, you will get a slightly different outcome as well. Try it a few times and see how random luck changes the final position of each marble. Increase the number of trials from 100 to 1,000 or 10,000 and see what happens! Simulations like this provide an interesting opportunity to understand the world around us.

The code for creating the simulation and visual are available on my GITHUB page.