TidyX Episode 19: {formattable} for Dashboard Design

This week, Ellis Hughes and I explain coded written by Lauren Pandori, who built a nice dashboard for data on astronaut missions provided by the TidyTuesday Project.

Lauren had a few questions for us regarding other elements she wanted to add to her dashboard so Ellis and I tackle those and how you how you can build your own dashboard using the {formattable} and {sparkline} packages.

The end product is a dashboard that has trend lines, conditional formatting, and conditional bar plots all in one. The nice thing is that we show you how you can save the dashboard to an HTML so that it is easy to share with colleagues and co-workers. Finally, during the data exploration phase we walk through a cool way to use {plotly} to visualize trend lines in an interactive manner.

To watch our screen cast, CLICK HERE.

To get our code, CLICK HERE.

R Tips & Tricks: Force-Velocity-Power Profile Graphs in R Shiny

A colleague of mine was asking about how he could produce plots of force-velocity-power profiles for his coaches based on their Gymaware testing. Rather than plotting static plots for this stuff it is sometimes easier to build out a nice shiny app so that the coaches can interact with it or the practitioner can quickly change between players or position groups when giving a presentation to the staff.

All code and data is available at my GITHUB page (R Script and Data).

This tutorial will cover:

1) Building polynomial regression to represent the average team trend
2) Iterative approaches to building static plots
3) Iterative approaches to building Shiny web apps

This tutorial has a number of different iterations of plots and web apps so working through the R-code on your own is advised so you can see how every step is performed.

Data

After loading the two required packages, {tidyverse} and {shiny}, we load in the data set and we see that it consists of Force, Power, and Velocity values across 5 different external loads for 3 different athletes:

If you want to use the R-script to produce reports for yourself going forward just ensure that your data has the above columns (named the same way, since R is case-sensitive). If you have data with different column names, your two choices are: (1) change my code to match the column names of your data; or, (2) once you pull the data into R change the columns names in your code to match mine.

Average Trend Line (2nd Order Polynomial)

Eventually, we are going to want to compare our athletes to the team average (or position group average if your sport is more heterogeneous). This type of data is often modeled as a. 2nd order polynomial regression. Thus, we will build this type of regression to predict both Velocity and Power from Force. Once I have these two regressions built I can create a data frame that consists of a sequence of Force values (from the minimum to the maximum observed in the team’s data) and predict the velocity and power at each unit of force.

fit_velo <- lm(Velocity ~ poly(Force, 2), data = fv_profile)
fit_power <- lm(Power ~ poly(Force, 2), data = fv_profile)

avg_tbl <- data.frame(Force = seq(from = min(fv_profile$Force),
                                  to = max(fv_profile$Force),
                                  by = 0.5))

avg_tbl$Velocity_Avg <- predict(fit_velo, newdata = avg_tbl)
avg_tbl$Power_Avg <- predict(fit_power, newdata = avg_tbl)
colnames(avg_tbl)[1] <- "Force_Grp"

 

Static Plots

I’ll walk through a few iterations of static plots. See the GITHUB code to walk through how these were produced.

1) All athletes with an average team trend line

This plot gives us a sense for the trend in Velocity as it relates to increasing amounts of force. Below you will see one with a solid trend line and one with a dashed trend line. Feel free to use which ever one you prefer.

2) Team trend for Velocity and Power

The common way this type of data is visualized in the sports science literature it is a bit tricky in R because it requires a dual y-axis. To obtain this, within my ggplot call I shrink Power down to a scale more similar to Velocity (the main y-axis) by dividing it by 1000. Then when I call the second y-axis with the sec_axis() function I multiply Power by 1000 to put it back on its normal scale.

3) Accounting for individuals

The above plots are a look at the entire team (in this case only 3 athletes). However, we may want to look at the individuals more explicitly. As such, we build four plots to show individual differences:

1) All athletes all on the same plot with their corresponding individualized trend lines (NOTE: if you have a lot of athletes, this plot can get pretty busy and ultimately become useless).
2) Plot each individual as a facet to look at the athletes separately.
3) Create the same plot as #2 but add in the group average trend line (which we created using our 2nd order polynomial regression) to allow us to compare each athlete to the groupx.
4) Plot each individual as a facet with velocity and power on separate y-axes.

Shiny App Development

The still figures above are nice but making things interactive is much more useful. I have four {shiny} web iterations that we can go through. Again, using the R code while reading this will help you understand what is going on under the hood. Additionally, you’ll want to run these in R so that R can open up the webpage on your computer and allow you to play with the app. Below is just still shots of each app iteration.

1) Version 1: Independent Player Plots

This version allows you to select one player at a time.

2) Version 2: Add Players to Facets

This version lets you select however many players you want and it will add them in as facets. This is a useful app if you are presenting to the staff and want to select or de-select several players to show how they compare to each other.

3) Version 3: Same as Version 2 but add Power to the Plot on a second y-axis

4) Version 4: Combine all plot details

Version 4 combines everything from this tutorial in one single web app. You can select the  force-velocity-power profile for individual athletes (added to the plot as facets) and see the team average trend line (added to the plot as a dashed line) for velocity and power, to allow you to make comparisons between each player and to the group average.

Conclusion

{tidyverse} makes it incredibly easy to manipulate data and quickly iterate plots to our liking while {shiny} offers an easy way for us to turn our plots into an interactive webpage.

Again, for the code and data see my GITHUB page (R Script and Data).

 

Python Tips & Tricks: Random Forest Classifier for TidyX Episode 18

As a means of working on improving some of my Python skills, I decided I’ll attempt to re-create different elements from some of our TidyX Screen Casts.

This past week, we did an episode on building a random forest classifier for coffee ratings (CLICK HERE). I’ve recreated almost all of the steps that we did in R in Python Code.

1) Loading the data from the TidyTuesday github page.

2) Data pre-processing

3) Exploratory data analysis

4) Random Forest classifier development and model testing

 

You can access the full Jupyter Notebook on my GITHUB page. I’m still trying to get the hang of Python so if there are any Pythonistas out there that have feedback or see errors in my code, I’m all ears!

 

 

TidyX Episode 18: Random Forests

In this weeks episode of TidyX, Ellis Hughes and breakdown the code that Dr. Nyssa Silbiger wrote to produce some lollipop plots with little coffee beans at the end of each one in order to get int the theme of the coffee rating data set provided by the TidyTuesday Project this week.

After that, we discuss Random Forests and, using the {randomForest} package in R, we create a Random Forest classifier to classify the coffee ratings of each cup based on a number of features. We walk through:

1. Data prep/cleaning
2. Exploratory analysis with visualizations
3. Random splitting of data into training and testing sets
4. Model building
5. Model testing and evaluation

If you’d like to watch the full screen cast, CLICK HERE.

For the code we used in the analysis, CLICK HERE.

R Tips & Tricks: Pearson Correlation from First Principles

In our most recent TidyX screen cast we talked a bit about correlation, which compelled me to expand on Pearson’s correlation to show how you can do things from first principles in R, by writing your own custom functions. This tutorial covers:

  • A brief explanation of Pearson’s correlation
  • R functions available to obtain Pearson’s correlation coefficient
  • How to write your own custom functions to understand the underlying principles of Pearson’s correlation coefficient and it’s confidence intervals

Pearson’s Correlation

Read almost any scientific journal article and you are sure to encounter Pearson’s correlation coefficient (denoted as r), as it is a commonly used metric for describing relationships within ones data. Take any basic stats course and you are bound to hear the phrase, “Correlation does not equal causation”, due to the fact that just because things are correlated does not mean that one definitely caused the other (such a relationship would need to be teased out of more specifically and the natural world is full of all sorts of random correlations).

Pearson’s correlation is a descriptive statistic used to quantify the linear relationship between two variables. It does so by measuring how much two variables covary with each other. The correlation coefficient is scaled between -1 and 1 where 1 means that the two variables have a perfect positive relationship (as one variable increases the other variable also always increases) while -1 means that the two variables have a perfect negative relationship (as one variable decreases the other variable also always decreases). A correlation coefficient of 0 suggests that the two variables do not share a linear relationship. It is rare that we see completely perfect correlations (either 1 or -1) and often our data will present with some scatter suggesting a specific trend (positive or negative) but with some amount of variability.

Data Preparation

R offers a few convenient functions for obtaining the correlation coefficient between two variables. For this tutorial, we will work with the Lahman baseball data set, which is freely available in R once you have installed the {Lahman} package. We will also use the {tidyverse} package for data manipulation and visualization.

We will use the Batting data set, which contains historic batting statistics for players, and the Master data set, which contains meta data on the players (e.g., birth year, death year, hometown, Major League debut, etc).

I do a little bit of data cleaning to obtain only the batting statistics from the 2016 season for those that had at least 200 at bats and I join that data with the player’s birth year from the Master data set. The two hitting variables I’ll use for this tutorial are Hits and RBI. One final thing I do is create a quantile bin for the player’s age so that I can look at correlation across age groups later in the tutorial.

Here is what the data looks like so far:

Data Visualization

Once we have pre-processed the data we can visualize it to see how things look. We will plot two visuals: (1) the number of players per age group; and, (2), a plot of the linear relationship between Hits and RBI.


Correlation in R

R offers a few convenient functions for obtaining the correlation coefficient between two variables. The two we will use are cor() and cor.test(). The only arguments you need to pass to these two functions is the X and Y variables you are interested in. The first function will produce a single correlation coefficient while the latter function will produce the correlation coefficient along with confidence intervals and information about the hypothesis test. Here is what their respective outputs looks like for the correlation between Hits and RBI:

Pretty easy! We can see that the correlation coefficient is the same between both functions, as it should be (r = 0.82). In the bottom output we can see that the 95% Confidence Intervals range from r = 0.78 to r = 0.85. The correlation is high between these two variables but it is not perfect, which we can better appreciate from the scatter plot above showing variability around the regression line.

Correlation from First Principles

One of the best ways to understand what is going on behind the custom functions in your stats program (no matter if it is R, Python, SPSS, SAS, or even Excel) is to try and build your own functions by hand. Doing so gives you an appreciation for some of the inner workings of these statistics and will help you better understand more complex statistics latter on.

I’ll build two functions:

  1.  A function to calculate the Pearson’s correlation coefficient
  2. A function to calculate the confidence intervals around the correlation coefficient

Pearson’s Correlation Coefficient Function

  • Similar to the built in R functions, this function will take inputs of an X and Y variable.
  • You can see the math for calculating the correlation coefficient in the function below. We start by subtracting the mean for each column from each observation. We then multiply the differences for each row and then produce a column of squared differences for each variable. Those values provide the inputs for the correlation coefficient in the second to last line of the function.
cor_function <- function(x, y){
  
  dat <- data.frame(x, y)
  dat <- dat %>%
    mutate(diff_x = x - mean(x),
           diff_y = y - mean(y),
           xy = diff_x * diff_y,
           x2 = diff_x^2,
           y2 = diff_y^2)
  
  r <- sum(dat$xy) / sqrt(sum(dat$x2) * sum(dat$y2))
  return(r)
}

Confidence Interval for Pearson’s R

  • This function takes three inputs: (1) the correlation coefficient between two variables (calculated above); (2) the sample size (the number of observations between X and Y); and, (3) The Confidence Level of Interest.
  • NOTE: To input a confidence level of interest I only set it up for three options, 0.9, 0.95, or 0.99, for the 90%, 95% and 99% Confidence Interval respectively. I could set it up to take any level of confidence and calculate the appropriate critical value but the function (as you can see) was already getting long and messy so I decided to cut it off and keep it simple for illustration purposes here.
  • This function is a bit more involved than the previous one and that’s because we require some transformations of the data. We have to transform the correlation coefficient using Fisher’s Z-Transformation in order to create a normal distribution. From there we can calculate our confidence intervals and then back transform the values so that they are on the scale of r.
cor.CI <- function(r, N, CI){
  
  fisher.Z <- .5*log((1+r)/(1-r))
  
  se.Z <- sqrt(1/(N-3))
  
  if(CI == .9){
    MOE <- 1.65*se.Z}
  else {
    if(CI == .95){
      MOE <- 1.95*se.Z}
    else {
      if(CI ==.99){
        MOE <- 2.58*se.Z}
      else{
        NA
      }
    }
  }
  
  Lower.Z <- fisher.Z - MOE
  Upper.Z <- fisher.Z + MOE
  Lower.cor.CI <- (exp(2*Lower.Z)-1)/(exp(2*Lower.Z)+1)
  Upper.cor.CI <- (exp(2*Upper.Z)-1)/(exp(2*Upper.Z)+1)
  Correlation.Coefficient.CI <- data.frame(r, Lower.cor.CI, Upper.cor.CI)
  Correlation.Coefficient.CI <- round(Correlation.Coefficient.CI, 3)
  return(Correlation.Coefficient.CI)
  }


Seeing our Functions in Action

We’ve obtained similar results to what was produced from the custom R functions!

We can also look at correlation across age bins. To do this, we will use {tidyverse} so that we can group_by() the age bins that we predefined.

First with the built in R functions

  • Using the built in cor.test() function we need to call the confident interval specifically from the output and specifying [1] and [2] tells R that we want the first value (lower confidence interval) and the second value (higher confidence interval), respectively.
yr2016 %>%
  group_by(AgeBin) %>%
  summarize(COR = cor(H, RBI),
            COR_Low_CI = cor.test(H, RBI)$conf.int[1],
            COR_High_CI = cor.test(H, RBI)$conf.int[2])

Now with our custom R functions

  • Similar to above, we need to extract the confidence interval out of the confidence interval function’s output. In this case, if you recall, there were three outputs (correlation coefficient, Low CI, and High CI). As such, we want to indicate [2] and [3] for the lower and upper confidence interval, respectively, since that is where they are located in the model output.
yr2016 %>%
  group_by(AgeBin) %>%
  summarize(COR = cor_function(H, RBI),
            cor.CI(r = COR,
                   N = n(),
                   CI = 0.95)[2],
            cor.CI(r = COR,
                   N = n(),
                   CI = 0.95)[3])

 

Conclusion

Hopefully this tutorial was useful in helping you to understand Pearson’s correlation and how easy it is to write functions in R that allow us to explore our data from first principles. All of the code for this tutorial is available on my GitHub page.