Category Archives: Python Tips & Tricks

Approximating a Bayesian Posterior Prediction

This past week on the Wharton Moneyball Podcast, during Quarter 2 of the show the hosts got into a discussion about Bayesian methods, simulating uncertainty, and frequentist approaches to statistical analysis. The show hosts are all strong Bayesian advocates but at one point in the discussion, Eric Bradlow mentioned that “frequenstist can answer similar questions by building distributions from their model predictions.” (paraphrasing)

This comment reminded me of Chapter 7 in Gelman and Hill’s brilliant book, Data Analysis Using Regression and Multilevel/Hierarchical Models. In this chapter, the authors do what they call an informal Bayesian approach by simulating the predictions from a linear regression model. It’s an interesting (and easy) approach that can be helpful for getting a sense for how Bayesian posterior predictions work without building a full Bayesian model. I call it, “building a bridge to Bayes”.

Since the entire book is coded in R, I decided to code an example in Python.

The Jupyter notebook is accessible on my GITHUB page.

(Special Material: In the Jupyter Notebook I also include additional material on how to calculate prediction intervals and confidence intervals by hand in python. I wont go over those entire sections here as I don’t want the post to get too long.)

Libraries & Data

We will start by loading up the libraries that we need and the data. I’ll use the iris data set, since it is conveniently available from the sklearn library. We will build the regression model using the statsmodels.api library.

## import libraries

from sklearn import datasets
import pandas as pd
import numpy as np
import statsmodels.api as smf
from scipy import stats
import matplotlib.pyplot as plt

## iris data set
iris = datasets.load_iris()

## convert to pandas sata frame
data = iris['data']
target = iris['target']

iris_df = pd.DataFrame(data)
iris_df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
iris_df.head(3)

Build a linear regression model

Next, we build a simple ordinary least squares regression model to predict petal_width from petal_length.

 

## Set up our X and y variables
X = iris_df['petal_length']
y = iris_df['petal_width']

# NOTE: statsmodels does not automatically add an intercept, so you need to do that manually

X = smf.add_constant(X)

# Build regression model
# NOTE: the X and y variables are reversed in the function compared to sklearn

fit_lm = smf.OLS(y, X).fit()

# Get an R-like output of the model

fit_lm.summary()

Simulate a distribution around the slope coefficient

The slope coefficient for theĀ petal_length variable, from the model output above, is 0.4158 with a standard error of 0.01. We will store these two values in their own variables and then use them to create a simulation of 10,000 samples and plot the distribution.

## get summary stats
mu_slope = 0.4158
se_slope = 0.01

## create simulation
n_samples = 10000
coef_sim = np.random.normal(loc = mu_slope, scale = se_slope, size = n_samples)

## plot simulated distribution

plt.hist(coef_sim, bins = 60)
plt.show()

We can also grab the summary statistics from the simulated distribution. We will snag the mean and the 90% quantile interval.

## get summary stats from our simulation
summary_stats = {
    'Mean': coef_sim.mean(),
    'Low90': np.quantile(coef_sim, 0.05),
    'High90': np.quantile(coef_sim, 0.95)
}

summary_stats

Making a prediction on a new observation and building a posterior predictive distribution

Now that we’ve gotten a good sense for how to create a simulation in python, we can create a new observation of petal_length and make a prediction about what the petal_width would be based on our model. In addition, we will get the prediction intervals from the output and use them to calculate a standard error for the prediction, which we will use for the posterior predictive simulation.

Technical Note: In the prediction output you will see that python returns a mean_ci_lower and mean_ci_upper and an obs_ci_lower and obs_ci_higher. The latter two are the prediction intervalsĀ  while the former two are the confidence intervals. I previously discussed the difference HERE and this is confirmed in the Jupyter Notebook where I calculate these values by hand.

# create a new petal_length observation
new_dat = np.array([[1, 1.7]])    # add a 1 upfront to indicate the model intercept

# make prediction of petal_width using the model
prediction = fit_lm.get_prediction(new_dat)

# get summary of the prediction
prediction.summary_frame()

Store the predicted value (0.343709) and then calculate the standard error from the lower and upper prediction intervals. Run a simulation and then plot the distribution of predictions.

mu_pred = 0.343709
se_pred = 0.754956 - 0.343709     # subtract the upper prediction interval from the mean to get the variability
n_sims = 10000

pred_obs = np.random.normal(loc = mu_pred, scale = se_pred, size = n_sims)

plt.hist(pred_obs, bins = 60)
plt.show()

Just as we did for the simulation of the slope coefficient we can extract our summary statistics (mean and 90% quantile intervals).

## get summary stats from our simulation
summary_stats = {
    'Mean': pred_obs.mean(),
    'Low90': np.quantile(pred_obs, 0.05),
    'High90': np.quantile(pred_obs, 0.95)
}

summary_stats

Wrapping Up

That is a pretty easy way to get a sense for approximating a Bayesian posterior predictive distribution. Rather than simply reporting the predicted value and a confidence interval or prediction interval, it is sometimes nice to build an entire distribution. Aside from it being visually appealing, it allows us to answer other questions we might have, for example, what percentage of the data is greater or less than 0.5 (or any other threshold value you might be interested in)?

As stated earlier, all of this code is accessible on my GITHUB page and the Jupyter notebook also has additional sections on how to calculate confidence intervals and prediction intervals by hand.

If you notice any errors, let me know!

t-test…ANOVA…It’s linear regression all the way down!

I had someone ask me a question the other day about t-tests. The question was regarding how to get the residuals from a t-test. The thing we need to remember about t-tests and ANOVA is that they are general linear models. As such, an easier way of thinking about them is that they are a different way of looking at a regression output. In this sense, a t-test is just a simple linear regression with a single categorical predictor (independent) variableĀ  that has two levels (e.g., Male & Female) while ANOVA is a simple linear regression with a single predictor variable that has more than two levels (e.g., Cat, Dog, Fish).

Let’s look at an example!

Complete code is available on my GITHUB page.

Load Data

The data we will use is the iris data set, available in the numpy library.

Exploratory Data Analysis

The Jupyter Notebook I’ve made available on GITHUB has a number of EDA steps. For this tutorial the variable we will look at is Sepal Length, which appears to different between Species.

T-test

We will start by conducting a t-test. Since a t-test is a comparison of means between two groups, I’ll create a data set with only the setosa and versicolor species.

## get two groups to compare
two_groups = ["setosa", 'versicolor']

## create a data frame of the two groups
df2 = df[df['species'].isin(two_groups)]

First I build the t-test in two common stats libraries in python, statsmodels and scipy.

## t-test in statsmodels.api
smf.stats.ttest_ind(x1 = df2[df['species'] == 'setosa']['sepal_length'],
                    x2 = df2[df['species'] == 'versicolor']['sepal_length'],
                   alternative="two-sided")

## t-test in scipy
stats.ttest_ind(a = df2[df['species'] == 'setosa']['sepal_length'],
                b = df2[df['species'] == 'versicolor']['sepal_length'],
                alternative="two-sided")

Unfortunately, the output of both of these approaches leaves a lot to be desired. They simply return the t-statistics, p-value, and degrees of freedom.

To get a better look at the underlying comparison, I’ll instead fit the t-test using the researchpy library.

## t-test in reserachpy
rp.ttest(group1 = df2[df['species'] == 'setosa']['sepal_length'],
         group2 = df2[df['species'] == 'versicolor']['sepal_length'])

This output is more informative. We can see the summary stats for both groups at the top. the t-test results follow below. We see the observed difference, versicolor has a sepal length 0.93 (5.006 – 5.936) longer that setosa, on average. We also get the degrees of freedom, t-statistic, and p-value, along with several measures of effect size.

Linear Regression

Now that we see what the output looks like, let’s confirm that this is indeed just linear regression!

We fit our model using the statsmodels library.

## Linear model to compare results with t-test (convert the species types of dummy variables)
X = df2[['species']]
X = pd.get_dummies(X['species'], drop_first = True)

y = df2[['sepal_length']]

## add an intercept constant, since it isn't done automatically
X = smf.add_constant(X)

# Build regression model
fit_lm = smf.OLS(y, X).fit()

# Get model output

fit_lm.summary()

Notice that the slope coefficient for versicolor is 0.93, indicating it’s sepal length is, on average, 0.93 greater than setosa’s sepal length. This is the same result we obtained with our t-test above.

The intercept coefficient is 5.006, which means that when versicolor is set to “0” in the model (0 * 0.93 = 0) all we are left with is the intercept, which is the mean value for setosa’s sepal length, the same as we saw in our t-test.

What about the residuals?

The question original question was about residuals from the t-test. Recall, the residuals are the difference between actual/observed value and the predicted value. When we have a simple linear regression with two levels (a t-test) the predicted value is simply the overall mean value for that group.

We can add predictions from the linear regression model into our data set and calculate the residuals, plot the residuals, and then calculate the mean squared error.

## Add the predictions back to the data set
df2['preds'] = fit_lm.predict(X)

## Calculate the residual
resid = df2['sepal_length'] - df2['preds']

## plot the residuals
sns.kdeplot(resid,shade = True)
plt.axvline(x = 0,linewidth=4, linestyle = '--', color='r')

In the code, you will find the same approach taken by just applying the group mean as the “predicted” value, which is the same value that the model predicts. At the bottom of the code, we will find that the outcome of the MSE is the same.

Wrapping Up

In summary, whenever you think t-test or ANOVA, just think linear regression. The intercept will end up reflecting the mean value for the reference class while the coefficient(s) for the other classes of that variable will represent the difference between their mean value and the reference class. If you want to make a comparison to a different reference class, you can change the reference class before specifying your model and you will obtain a different set of coefficients (given that they are compared to a new reference class) but the predicted values, residuals, and MSE will end up being the same.

Again, if you’d like the full code, you can access it HERE.

TidyX Episode 99: Intro to Bayes

Ellis Hughes and I continue our series on simulation and sampling by moving into a basic introduction to Bayesian thinking. We code two of the common Bayesian problems that get discussed in intro courses/textbooks: (1) The probability of disease given a positive test; and, (2) the probability that a coin is biased towards heads or tails.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.

If you’d like to see the code in Python (I wrote a script for that too), CLICK HERE.

 

Doing things in Python that you’d normally do in Excel – Data Analysis for Strength & Conditioning Coaches (Turner et al., 2015)

My previous blog on, Doing things in Python that you would normally do in Excel, got some nice feedback and seemed to be useful to a number of folks. As such, I’ve decided to continue on with that theme and put together a Python approach for constructing the stats in Anthony Turner‘s paper, Data Analysis for Strength and Conditioning Coaches: Using Excel to Analyze Reliability, Differences and Relationships.

In the paper, Anthony works through a few examples of calculating things like Smallest Worthwhile Change, Typical Error Measurement, and Cohen’s d Effect Size for CMJ, RSI, and Pro Agility. For the sake of brevity, my tutorial will only work through the CMJ data.

In his paper, Anthony walks through how a strength coach can apply this type of analysis very simply in Excel. I use his data to construct the same analysis in Python and end up with a data frame that represents each athlete on the team, their 3 CMJ’s, a goal/training target based on their SWC, and their Error Measurement.

 

Some things I cover in the tutorial that might be of interest to folks just starting out with Python:

    1. Adding new columns to a data frame while calculating summary summary statistics row-wise (e.g., working across each individual athlete’s row instead of calculating summary statistics over an entire column, which is how we commonly do it).
    2. Writing a custom function. In this tutorial I write a custom function to calculate the average and standard deviation across rows so that we only need one line of code to extract the information we are interested in.
    3. Step-by-step calculation of Cohen’s d Effect Size in Python.

Building your own data frames. I manually build Anthony’s data from the paper into a data frame. I also build a data frame at the end of the tutorial for Cohen’s d Effect Size interpretation.

To access my Jupyter Notebook, go to my Github page, HERE.

Doing things in Python that you would normally do in Excel

Learning a new coding language is always a challenge. One thing that helps me is to create a short tutorial for myself of some of the basic data tasks that I might do when I initially sit down with a data set. I try and think through this in relationship to stuff I might have done (a long, long time ago) in Excel with regards to summarizing data, adding new features, and creating pivot tables.

Since I’m not that great in Python, here is my Doing things in Python that you would normally do in Excel tutorial that may help others looking to get started with this coding language.

The tasks that I cover are:

  1. Exploring features of the data
  2. Sorting the columns
  3. Filtering the columns
  4. Creating new features
  5. Calculating summary statistics
  6. Building pivot tables
  7. Data visualization

The data I use comes form the pybaseball library, freely available for install in python. I’ll be using the pitchers dataset from years 2012 to 2016.

The entire jupyter notebook is accessible on my GitHub page.

Libraries and Data

The libraries I use are:

  • pandas — for working with data frames
  • numpy — for additional computational support
  • matplotlib & seaborn — for plotting data
  • pybaseball — for data acquisition

I called the data set pitchers. The data consists of 408 rows and 334 columns. After doing a bit of exploring the data set (seeing how large it is, checking columns for NA values, etc), we begin by sorting the columns.

Sort Columns

Sorting columns is done by calling the name of the data set and using the sort_values() function, passing it the column you’d like to sort on (in this case, sorting alphabetically by pitcher name)

## Sort the data by pitcher name

pitchers.sort_values(by='Name')

If you have a specific direction you’d like to sort by, set the ‘ascending’ argument to either True or False. In this case, setting ascending = False allows us to sort the ERA from highest to lowest (descending order).

## Sort the data by ERA from highest to lowest

pitchers.sort_values(by = 'ERA', ascending = False)

(For additional sorting examples, see the GitHub code)

Filtering Columns

Filtering can be performed by explicitly stating the value you’d like to filter on within the square brackets. Here, I call the data frame (pitchers) and add the .loc function after the data frame name in order to access the rows that are specific to the condition of interest. Here, I’m only interested in looking at the 2012 season. Additionally, I only want a few columns (rather than the 334 from the full data set). As such, I specify those columns AFTER the comma within the square brackets. Everything to the left of the comma is specific to the rows I want to filter (Season == 2012) and everything to the right of the comma represents the columns of interest I’d like returned.


## Filter to only see the 2012 season
# Keep only columns: Name, Team, Age, G, W, L, WAR, and ERA

season2012 = pitchers.loc[pitchers['Season']== 2012, ['Name', 'Team', 'Age', 'G', 'W', 'L', 'WAR', 'ERA']]
season2012.head()

If I only want to look at Clayton Kershaw over this time period, I can filter him out like so:


## Filter Clayton Kershaw's seasons
# Keep only columns: Season, Name, Team, Age, G, W, L, WAR, and ERA
# arrange the data set from earliest season to latest

kershaw = pitchers.loc[pitchers['Name']=='Clayton Kershaw', ['Season', 'Name', 'Team', 'Age', 'G', 'W', 'L', 'WAR', 'ERA']].sort_values('Season', ascending = True)
kershaw.head()

To make the data set more palatable for the rest of the tutorial, I’m going to create a smaller data set, with fewer columns (pitchers_small).


## Create a smaller data set
# Keep only columns: Season, Name, Team, Age, G, W, L, WAR, ERA, Start-IP, and Relief-IP
# arrange the data set from earliest season to latest for each pitcher

pitchers_small = pitchers[['Season', 'Name', 'Team', 'Age', 'G', 'W', 'L', 'WAR', 'ERA', 'Start-IP','Relief-IP']].sort_values(['Name', 'Season'], ascending = True)
pitchers_small.head(30)


Creating New Features

I create three new features in the data set:

  1. A sequence counter that counts the season number of each pitcher from 1 to N seasons that they are in the data set.This is done by simply grouping the data by the pitcher name and then cumulatively counting each row that the pitcher is seen in the data. Notice I add “+1” to the end of the code because python begins counter at “0”. NOTE: To make this work properly, ensure that the data is ordered by pitcher name and season. I did this at the end of my code, in the previous step.
  2. An ‘age group’ feature that groups the ages of the pitchers in 5 year bins.To accomplish this task, I use a 3 step process. First, I specify where I want the age bins to occur and assign it to the bins variable. I then create the labels I would like to correspond to each of the bins and assign that to the age_group variable. Then I use the np.select() function to combine this information, assigning it to a new column in my data set called ‘age_group‘.
  3. A ‘pitcher type’ feature that considers anyone who’s starter innings pitched was greater or equal to the median number of starting innings pitched as a ‘starter’ and all others as ‘relievers’.To create the pitcher_type column, I use the np.where() function, which works like ifelse() or case_when() in R or like IF() in excel. The first argument is the condition I’d like checked (“did this pitcher have starter innings pitched that were greater than or equal to the median number of starter innings pitched?). If the condition is met, the function will assign the pitcher in that row as a “starter”. If the condition is not met, then the pitcher in that row is designated as a “reliever”.

 

## Add a sequence counter for each season for each pitcher
pitchers_small['season_id'] = pitchers_small.groupby(['Name']).cumcount() + 1

## Create a new column called 'age_group'
# create conditions for the age_group bins
bins = [
    (pitchers_small['Age'] <= 25), (pitchers_small['Age'] > 25) & (pitchers_small['Age'] <= 30), (pitchers_small['Age'] > 30) & (pitchers_small['Age'] <= 35), (pitchers_small['Age'] > 35) & (pitchers_small['Age'] <= 40), (pitchers_small['Age'] > 40)
]

# create the age_group names to be assigned to each bin
age_group = ['<= 25', '25 to 30', '31 to 35', '36 to 40', '> 40']

# add the age_group bins into the data
pitchers_small['age_group'] = np.select(bins, age_group)

## Create a pitcher_type column which makes a distinction between starters and relievers
pitchers_small['pitcher_type'] = np.where(pitchers_small['Start-IP'] >= pitchers_small['Start-IP'].median(), 'starter', 'reliever') 

pitchers_small.head()



Calculating Summary Statistics

The GItHub repo for this post offers a few ways of obtaining the mean, standard deviation, and counts of values for different columns. For simplicity, I’ll show a convenient way to get summary stats over an entire data set using the describe() function, which is called following the name of the data frame and a period.


## Get summary stats for each column

pitchers_small.describe()

Pivot Tables

There are a few ways to create a pivot table in python. One way is to use the groupby() function for the class you’d like to summarize over and then call the mathematical operation (e.g., mean) you are interested in. I have an example of this in the GitHub post in code chuck 42 as well as several examples of other pivot table options. Another way is to use the pivot_table() function from the pandas library.

Below is a pivot table of the mean and standard deviation of pitcher age across the 5 seasons in the data set.

## Pivot Table of Average and Standard Deviation of Wins by Season
# Round the results to 1 significant digit

round(pitchers_small.pivot_table(values = ['Age'], index = ['Season'], aggfunc = (np.mean, np.std)), ndigits = 1)

You can also make more complicated pivot tables by setting the columns to a second grouping variable, as one would do in Excel. Below, we look at the average WAR across all 5 seasons within the 5 age groups (which we created in the previous section).

## Calculate the average WAR per season across age group

pitchers_small.pivot_table(values = ['WAR'], index = ['Season'], columns = ['age_group'], aggfunc = np.mean)


Data Visualization

In the GitHub post I walk through how to plot 8 different plots:

  1. Histogram
  2. Density plots by group
  3. Boxplots by group
  4. Scatter plot
  5. Scatter plot highlighting groups
  6. Bar plots for counting values
  7. Line plot for a single player over time
  8. Line plots for multiple players over time