t-test…ANOVA…It’s linear regression all the way down!

I had someone ask me a question the other day about t-tests. The question was regarding how to get the residuals from a t-test. The thing we need to remember about t-tests and ANOVA is that they are general linear models. As such, an easier way of thinking about them is that they are a different way of looking at a regression output. In this sense, a t-test is just a simple linear regression with a single categorical predictor (independent) variable that has two levels (e.g., Male & Female) while ANOVA is a simple linear regression with a single predictor variable that has more than two levels (e.g., Cat, Dog, Fish).

Let’s look at an example!

Complete code is available on my GITHUB page.

Load Data

The data we will use is the iris data set, available in the numpy library.

Exploratory Data Analysis

The Jupyter Notebook I’ve made available on GITHUB has a number of EDA steps. For this tutorial the variable we will look at is Sepal Length, which appears to different between Species.

T-test

We will start by conducting a t-test. Since a t-test is a comparison of means between two groups, I’ll create a data set with only the setosa and versicolor species.

## get two groups to compare
two_groups = ["setosa", 'versicolor']

## create a data frame of the two groups
df2 = df[df['species'].isin(two_groups)]

First I build the t-test in two common stats libraries in python, statsmodels and scipy.

## t-test in statsmodels.api
smf.stats.ttest_ind(x1 = df2[df['species'] == 'setosa']['sepal_length'],
                    x2 = df2[df['species'] == 'versicolor']['sepal_length'],
                   alternative="two-sided")

## t-test in scipy
stats.ttest_ind(a = df2[df['species'] == 'setosa']['sepal_length'],
                b = df2[df['species'] == 'versicolor']['sepal_length'],
                alternative="two-sided")

Unfortunately, the output of both of these approaches leaves a lot to be desired. They simply return the t-statistics, p-value, and degrees of freedom.

To get a better look at the underlying comparison, I’ll instead fit the t-test using the researchpy library.

## t-test in reserachpy
rp.ttest(group1 = df2[df['species'] == 'setosa']['sepal_length'],
         group2 = df2[df['species'] == 'versicolor']['sepal_length'])

This output is more informative. We can see the summary stats for both groups at the top. the t-test results follow below. We see the observed difference, versicolor has a sepal length 0.93 (5.006 – 5.936) longer that setosa, on average. We also get the degrees of freedom, t-statistic, and p-value, along with several measures of effect size.

Linear Regression

Now that we see what the output looks like, let’s confirm that this is indeed just linear regression!

We fit our model using the statsmodels library.

## Linear model to compare results with t-test (convert the species types of dummy variables)
X = df2[['species']]
X = pd.get_dummies(X['species'], drop_first = True)

y = df2[['sepal_length']]

## add an intercept constant, since it isn't done automatically
X = smf.add_constant(X)

# Build regression model
fit_lm = smf.OLS(y, X).fit()

# Get model output

fit_lm.summary()

Notice that the slope coefficient for versicolor is 0.93, indicating it’s sepal length is, on average, 0.93 greater than setosa’s sepal length. This is the same result we obtained with our t-test above.

The intercept coefficient is 5.006, which means that when versicolor is set to “0” in the model (0 * 0.93 = 0) all we are left with is the intercept, which is the mean value for setosa’s sepal length, the same as we saw in our t-test.

What about the residuals?

The question original question was about residuals from the t-test. Recall, the residuals are the difference between actual/observed value and the predicted value. When we have a simple linear regression with two levels (a t-test) the predicted value is simply the overall mean value for that group.

We can add predictions from the linear regression model into our data set and calculate the residuals, plot the residuals, and then calculate the mean squared error.

## Add the predictions back to the data set
df2['preds'] = fit_lm.predict(X)

## Calculate the residual
resid = df2['sepal_length'] - df2['preds']

## plot the residuals
sns.kdeplot(resid,shade = True)
plt.axvline(x = 0,linewidth=4, linestyle = '--', color='r')

In the code, you will find the same approach taken by just applying the group mean as the “predicted” value, which is the same value that the model predicts. At the bottom of the code, we will find that the outcome of the MSE is the same.

Wrapping Up

In summary, whenever you think t-test or ANOVA, just think linear regression. The intercept will end up reflecting the mean value for the reference class while the coefficient(s) for the other classes of that variable will represent the difference between their mean value and the reference class. If you want to make a comparison to a different reference class, you can change the reference class before specifying your model and you will obtain a different set of coefficients (given that they are compared to a new reference class) but the predicted values, residuals, and MSE will end up being the same.

Again, if you’d like the full code, you can access it HERE.