Category Archives: TidyX Screen Cast

TidyX Episode 21: Pairs Plots for Exploring Data

Pairs plots are plots of data in a matrix format to allow one to visualize as many numeric relationships as they would like. Oftentimes the data are plotted as scatter plots, however, this format can be extended to include other visualizations such as histograms, boxplots, and even correlation coefficients.

This week, Ellis Hughes and I explore these types of plots using the {palmerpenguins} data set pulled together by Allison Horst for the TidyTuesday Project.

This is a really fun data set for starting out in R because the data is very clean and has a few columns with nice relationships that can be exploited with statistical models. Ellis and I discuss some potential use cases for modeling this data and then we create a number of plots of the data to show various statistical relationships. We create pair plots using the {GGally} package and then we build some interactive {plotly} graphics with the data and explain how to build interactive visualizations of regression coefficients from a linear model.

Finally, we wrap up by going through the code of Roman Link. Roman created his own package, {corrmorant}, for creating pairs plots. This package offers a ton of flexibility and allows you to style the visualization in a more customized way. We had a lot of fun playing around with the package and the final project that Roman created using the {palmerpenguins} data set was this:

To listen to the screen cast, CLICK HERE.

To check out our code, CLICK HERE.

TidyX Episode 20: Special Guest David Robinson

This week, Ellis Hughes and I are joined by someone who has had a big influence on both of us, David Robinson!

For those that don’t know, David has been the developer of several R packages, such as {broom}, {tidytext}, {fuzzyjoin}, and {widyr}. Additionally, each week David does a live screen cast of himself working through the TidyTuesday data set from scratch. For anyone that has never watched these, they are excellent. David covers a lot of ground in 60 minutes of these live screen casts and shows you how he quickly extracts as much information as possible from a data set that he is seeing for the first time ever.

Today, we walk through one of David’s TidyTuesday screen casts where he does some text analysis of a data set consisting of cocktail ingredients. The screen cast features the use of his newest R package, {widyr}. After some exploratory data analysis and data cleaning, David calculates correlations between categorical variables (phi coefficient) and shows us how to plot the results in a network graph. The screen cast wraps up with David showing us how {widyr} can be used for Principal Components Analysis and then a short discussion on David’s journey into data science and how blogging and public work can be incredibly valuable for developing your professional network and career.

To check out the screen cast, CLICK HERE.

To check out David’s blog, CLICK HERE.

TidyX Episode 19: {formattable} for Dashboard Design

This week, Ellis Hughes and I explain coded written by Lauren Pandori, who built a nice dashboard for data on astronaut missions provided by the TidyTuesday Project.

Lauren had a few questions for us regarding other elements she wanted to add to her dashboard so Ellis and I tackle those and how you how you can build your own dashboard using the {formattable} and {sparkline} packages.

The end product is a dashboard that has trend lines, conditional formatting, and conditional bar plots all in one. The nice thing is that we show you how you can save the dashboard to an HTML so that it is easy to share with colleagues and co-workers. Finally, during the data exploration phase we walk through a cool way to use {plotly} to visualize trend lines in an interactive manner.

To watch our screen cast, CLICK HERE.

To get our code, CLICK HERE.

TidyX Episode 18: Random Forests

In this weeks episode of TidyX, Ellis Hughes and breakdown the code that Dr. Nyssa Silbiger wrote to produce some lollipop plots with little coffee beans at the end of each one in order to get int the theme of the coffee rating data set provided by the TidyTuesday Project this week.

After that, we discuss Random Forests and, using the {randomForest} package in R, we create a Random Forest classifier to classify the coffee ratings of each cup based on a number of features. We walk through:

1. Data prep/cleaning
2. Exploratory analysis with visualizations
3. Random splitting of data into training and testing sets
4. Model building
5. Model testing and evaluation

If you’d like to watch the full screen cast, CLICK HERE.

For the code we used in the analysis, CLICK HERE.

TidyX Episode 17: Regression, KMeans Clustering, & PCA

This week, Ellis Hughes and I explain the code Rebecca Stevick, who shows us how to plot a linear regression model with the regression line, regression equation, and correlation coefficient all conveniently visualized on the plot. The plot was created using data on The Uncanny X-Men comic books and was supplied by the TidyTuesday Project.

Following Rebecca’s code we delve into other ways of looking at the regression equation and discuss using Ellis’ R package, {colortable}, to produce conditionally formatted tables for model outputs. We then move on to using the X-Men data to build and visualize a KMeans Cluster and PCA.

The episode is a little longer than usual (50 minutes) but combines a number of different thoughts around coding and visualizing statistical models in R.