Author Archives: Patrick

TidyX Episode 140: Data restructuring via splits

This week, Ellis Hughes and I answer a viewer question about data restructuring using splits in the data.

The viewer has some data sets that include information about competitions, the name of the athlete, the team the athlete is on, and the athlete gender. The goal was to help the viewer write some code that could output this information into a clean looking list with all of the information arranged appropriately within each competition.

Ellis and I both approached it (and interpreted it) in different ways, so you get to see multiple different ways of solving the problem.

To watch the screen cast, CLICK HERE.

To access our code, CLICK HERE.

From tidyverse to python

Anyone who reads this blog or watches our Tidy Explained Screen Cast knows that I am a massive R user and R fan. I can work fast in it and I find it to be one of the best tools for building statistical models. That said, there are times when python comes in handy and there are a number of software and web applications that interact and play nicer with python compared to R. For example, R can be run within AWS Sagemaker, but python seems to be more efficient. I’ve recently been doing a few projects in Databricks as well and, while I can use R within their system, I’ve been trying to code the projects using python.

For those of us trying to learn a bit of python to be somewhat useful in that language (or for pythonistas who may need to learn a little bit of R) I’ve put together the following tutorial that shows how to do some of the common stuff you’d use R’s tidyverse for, in python.

The codes for both the RMarkdown file and the Jupyter Notebook are available on my GITHUB page. The codes has many more examples than I will go over here (for space reasons), so be sure to check it out!

Load Libraries and Data

We will be using the famous Palmer Penguins data. Here is a side-by-side look at how we load the libraries and data set in tidyverse and what those same steps look like in Python.

Exploratory Data Analysis

One of the most popular features of the suite of tidyverse libraries is the ability to nicely summarize and plot data.

I won’t go through every possible EDA example contained in the notebooks but here are a few side-by-side.

Create a Count of the Number of Samplers for Each Species

Create a barplot of the count of Species types

Scatter plot of flipper length and body mass colored by sex with linear regression lines

Group By Summarize

In tidyverse we often will group our data by a specific condition and then summarize data points relative to that condition. In tidyverse we use pipes, %>%, to connect together lines of code. In the pandas library (the common python library for working with data frames) we use dots to connect these lines of code.

Group By Mutate

In addition to summarizing by group, sometimes we need to add additional columns to the existing data set but those columns need to have the new data conditional on a specific grouping variable. In tidyverse we use mutate and in pandas we use transform.

We can also build columns using grouping and custom functions. In tidyverse we do this inside of the mutate but in pandas we need to set up a lambda function. For example, we can build the z-score of each sample grouped by Species (meaning the observation is divided by the mean and standard deviation for that Species population).

ifelse / case_when

Another task commonly performed in data cleaning is to assign values to specific cases. For example, we have three Islands in the data set and we want to assign them Island1, Island2, and Island3. In tidyverse we could use either ifelse or case_when() to solve this task. In pandas we need to either set up a custom function and then map that function to the data set or we can use the numpy library, which has a function called where(), which behaves like case_when() in tidyverse.

Linear Regression Model

To finish, I’ll provide some code to write a linear model in both languages.

Wrapping Up

Hopefully the tutorial will help with folks going from R to Python or vice versa. Generally, I suggest picking one or the other and trying to really dig into being good at it. But, there are times where you might need to delve into another language and produce some code. This tutorial provides a mirror image of code between R and Python.

As stated earlier, there are a number of extra code examples in the RMarkdown and Jupyter Notebook files available on my GITHUB page.

TidyX Episode 139: Custom z-score function with pre-specified population parameters

This week, Ellis Hughes and I work through a custom function for calculating z-scores conditional on specific population parameters.

Often, when building models, it is common to normalize the data to get all of the features onto the same scale. Occasionally we are dealing with a situation where we want to scale the most recent data (e.g., this year’s data) with the mean and standard deviation of prior year’s data. Another example would be scaling our testing set to the mean and standard deviation of a training set. Thus, we create a custom function to handle this task. We go step-by-step through our process of construction the function and show the errors along the way and the iterations we went through.

To watch the screen cast, CLICK HERE.

To access our code, CLICK HERE.

Can I please be introduced to the Non-Applied Sport Scientist?

A recent discussion on Twitter spurred some thoughts that I had with respect to titles and roles in sport and in particular the title/role of Applied Sport Scientist.

@ScientistSport posed the following question:

It’s an interesting question to ponder. Given that sport science was originally born out of physiologists attempting to study human performance in Olympic sport athletes (which then eventually bled into team sport athletes) the question makes sense. Moreover, it seems like people generally think of sport science as something directed at helping the team “train better” – monitoring training loads, testing strength, power and conditioning, and even entering into the discussion of return to play following injury. Such a role has led many teams to employ an Applied Sport Scientist.

Titles in sport are weird. What does an Applied Sport Scientist do? What is the description of the role? More importantly, is there a Non-Applied Sport Scientist? If so, what are they doing?

Generally, when I’ve been introduced to the Applied Sport Scientist at a team when I’ve found is they are an assistant strength coach or assistant athletic trainer that has been tasked with turning on GPS units, conducting force plate jumps with the players, and coordinating the reports from the team’s Athlete Management System (AMS).

No doubt these are important tasks and critical to helping the staff plan and manage the team’s training! But, why is this a science role? What’s scientific about it? Is the individual ensuring data quality and integrity is being maintained before it is stored in the AMS? Is the individual conducting scientific inquiry of the data within the AMS to understand the measurements being made and determining if the measures are valid, reliable, or responsive? More importantly, how is the individual using the abundance of data being collected to answer larger questions that are relevant to the entire organization?

Perhaps the role shouldn’t be called Applied Sport Scientist? Maybe it should be Data Collection Coordinator or something more descriptive of the task at hand? Titles matter! They define what we do and how we do it. Again, if there is an Applied Sport Scientist is there a Non-Applied Sport Scientist? Maybe the latter is the one doing the real scientific work – identifying the pertinent research questions, planning applied science studies, structuring and establishing best practice data collection methods, analyzing data, and communicating the results to the end users.

What should the role of an Applied Sport Scientist be?

While some may feel like my argument is a bit pedantic here is why it matters.

The aim of the Applied Sport Scientist or the Sport Science Department should be to answer questions across the entire sporting organization. This shouldn’t simply be limited to matters of strength and conditioning. Rather, the goal should be to apply the scientific method to any and all questions in sport – training, return to play, performance evaluation, player acquisition, team tactics, etc. – and work at the intersection of such topics to provide analysis that helps the key stakeholders make decisions. A few colleagues and I wrote a paper about the parallels between Business Intelligence and Sports Science a few years ago <CLICK HERE>.

Science isn’t just a title; it is a framework and process for asking and answering questions. Or, as David Salsburg states, in his brilliant book The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century, “Science, we are often taught, is measurement. We make careful measurements and use them to find mathematical formulas that describe nature.” Consequently, someone that is given the title Applied Sport Scientist should actually have scientific training. The concept of framing a question, collecting data, doing basic statistics, knowing basic physiology and biomechanics, understanding how to run a simple reliability study, etc., are things that should be fundamental skills for this individual. Calling someone a Sport Scientist who doesn’t have these skills – even though they might be a really smart person and they might know a good deal about whatever technology they are using – is like calling me a strength coach. Sure, I can write a program and I can train and coach people. But, that’s probably not why you would hire me. Just as the strength coach can collect data and print reports, but you aren’t hiring them to conduct scientific investigations. You’re also not hiring the Physical Therapist to run the nutrition program.

Being smart and hardworking are important qualities in sport and everyone can help out in various areas of the organization. But titles should matter because they in some way define roles and responsibilities. The best organizations find the right people, with the right skill sets, to work together and create a super team.

As I like to say, Success boils down to four things:

  1. Knowing what you know.
  2. Working to be really good at what you know.
  3. Knowing what you don’t know.
  4. Knowing enough about what you don’t know to ask the right questions to get people in who can help you out with that thing.

 

 

TidyX Episode 137: Magically Multiplying Tabbed Reports

This week, Ellis Hughes and I answer a view question about plotting in different tabs within an {Rmarkdown} report. The issue the viewer was running into was that they had a lot of groups that they wanted to create the same plot for and have a tab representing each of those individual groups. Rather than copying and pasting the tab code hundreds of times (and risk messing up and not changing the data for one of the groups) we walk through a quick approach to have R do this for you in a loop. In literally 20 lines of code you can plot as many tabs as you’d like!

To watch the screen cast, CLICK HERE.

To access our code, CLICK HERE.