{"id":2430,"date":"2022-05-16T00:55:11","date_gmt":"2022-05-16T00:55:11","guid":{"rendered":"http:\/\/optimumsportsperformance.com\/blog\/?p=2430"},"modified":"2022-11-08T03:36:41","modified_gmt":"2022-11-08T03:36:41","slug":"t-test-anova-its-linear-regression-all-the-way-down","status":"publish","type":"post","link":"https:\/\/optimumsportsperformance.com\/blog\/t-test-anova-its-linear-regression-all-the-way-down\/","title":{"rendered":"t-test&#8230;ANOVA&#8230;It&#8217;s linear regression all the way down!"},"content":{"rendered":"<p>I had someone ask me a question the other day about t-tests. The question was regarding how to get the residuals from a t-test. The thing we need to remember about t-tests and ANOVA is that they are general linear models. As such, an easier way of thinking about them is that they are a different way of looking at a regression output. In this sense, a t-test is just a simple linear regression with a single categorical predictor (independent) variable\u00a0 that has two levels (e.g., Male &amp; Female) while ANOVA is a simple linear regression with a single predictor variable that has more than two levels (e.g., Cat, Dog, Fish).<\/p>\n<p>Let&#8217;s look at an example!<\/p>\n<p>Complete code is available on my <span style=\"color: #0000ff;\"><strong><a style=\"color: #0000ff;\" href=\"https:\/\/github.com\/pw2\/Python-Tips-and-Tricks\/blob\/master\/General%20Linear%20Models%20-%20t-tests%20is%20just%20linear%20regression.ipynb\">GITHUB page<\/a><\/strong><\/span>.<\/p>\n<p><strong>Load Data<\/strong><\/p>\n<p>The data we will use is the iris data set, available in the <strong>numpy<\/strong> library.<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.17.06-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-2431\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.17.06-PM-1024x386.png\" alt=\"\" width=\"625\" height=\"236\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.17.06-PM-1024x386.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.17.06-PM-300x113.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.17.06-PM-768x290.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.17.06-PM-624x235.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.17.06-PM.png 1044w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p><strong>Exploratory Data Analysis<\/strong><\/p>\n<p>The Jupyter Notebook I&#8217;ve made available on <strong><span style=\"color: #0000ff;\"><a style=\"color: #0000ff;\" href=\"https:\/\/github.com\/pw2\/Python-Tips-and-Tricks\/blob\/master\/General%20Linear%20Models%20-%20t-tests%20is%20just%20linear%20regression.ipynb\">GITHUB<\/a><\/span><\/strong> has a number of EDA steps. For this tutorial the variable we will look at is Sepal Length, which appears to different between Species.<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.17.36-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-2432\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.17.36-PM-1024x637.png\" alt=\"\" width=\"625\" height=\"389\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.17.36-PM-1024x637.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.17.36-PM-300x187.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.17.36-PM-768x478.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.17.36-PM-624x388.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.17.36-PM.png 1434w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p><strong>T-test<\/strong><\/p>\n<p>We will start by conducting a t-test. Since a t-test is a comparison of means between two groups, I&#8217;ll create a data set with only the setosa and versicolor species.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n## get two groups to compare\r\ntwo_groups = &#x5B;&quot;setosa&quot;, 'versicolor']\r\n\r\n## create a data frame of the two groups\r\ndf2 = df&#x5B;df&#x5B;'species'].isin(two_groups)]\r\n<\/pre>\n<p>First I build the t-test in two common stats libraries in python, <strong>statsmodels<\/strong> and <strong>scipy<\/strong>.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n## t-test in statsmodels.api\r\nsmf.stats.ttest_ind(x1 = df2&#x5B;df&#x5B;'species'] == 'setosa']&#x5B;'sepal_length'],\r\n                    x2 = df2&#x5B;df&#x5B;'species'] == 'versicolor']&#x5B;'sepal_length'],\r\n                   alternative=&quot;two-sided&quot;)\r\n\r\n## t-test in scipy\r\nstats.ttest_ind(a = df2&#x5B;df&#x5B;'species'] == 'setosa']&#x5B;'sepal_length'],\r\n                b = df2&#x5B;df&#x5B;'species'] == 'versicolor']&#x5B;'sepal_length'],\r\n                alternative=&quot;two-sided&quot;)\r\n<\/pre>\n<p>Unfortunately, the output of both of these approaches leaves a lot to be desired. They simply return the t-statistics, p-value, and degrees of freedom.<\/p>\n<p>To get a better look at the underlying comparison, I&#8217;ll instead fit the t-test using the <strong>researchpy <\/strong>library.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n## t-test in reserachpy\r\nrp.ttest(group1 = df2&#x5B;df&#x5B;'species'] == 'setosa']&#x5B;'sepal_length'],\r\n         group2 = df2&#x5B;df&#x5B;'species'] == 'versicolor']&#x5B;'sepal_length'])\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.39.03-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-2433\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.39.03-PM-1024x434.png\" alt=\"\" width=\"625\" height=\"265\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.39.03-PM-1024x434.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.39.03-PM-300x127.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.39.03-PM-768x326.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.39.03-PM-624x265.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.39.03-PM.png 1976w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p>This output is more informative. We can see the summary stats for both groups at the top. the t-test results follow below. We see the observed difference, versicolor has a sepal length 0.93 (5.006 &#8211; 5.936) longer that setosa, on average. We also get the degrees of freedom, t-statistic, and p-value, along with several measures of effect size.<\/p>\n<p><strong>Linear Regression<\/strong><\/p>\n<p>Now that we see what the output looks like, let&#8217;s confirm that this is indeed just linear regression!<\/p>\n<p>We fit our model using the <strong>statsmodels <\/strong>library.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n## Linear model to compare results with t-test (convert the species types of dummy variables)\r\nX = df2&#x5B;&#x5B;'species']]\r\nX = pd.get_dummies(X&#x5B;'species'], drop_first = True)\r\n\r\ny = df2&#x5B;&#x5B;'sepal_length']]\r\n\r\n## add an intercept constant, since it isn't done automatically\r\nX = smf.add_constant(X)\r\n\r\n# Build regression model\r\nfit_lm = smf.OLS(y, X).fit()\r\n\r\n# Get model output\r\n\r\nfit_lm.summary()\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.43.18-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-2434\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.43.18-PM-1024x874.png\" alt=\"\" width=\"625\" height=\"533\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.43.18-PM-1024x874.png 1024w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.43.18-PM-300x256.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.43.18-PM-768x655.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.43.18-PM-624x533.png 624w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.43.18-PM.png 1256w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p>Notice that the slope coefficient for versicolor is 0.93, indicating it&#8217;s sepal length is, on average, 0.93 greater than setosa&#8217;s sepal length. This is the same result we obtained with our t-test above.<\/p>\n<p>The intercept coefficient is 5.006, which means that when versicolor is set to &#8220;0&#8221; in the model (0 * 0.93 = 0) all we are left with is the intercept, which is the mean value for setosa&#8217;s sepal length, the same as we saw in our t-test.<\/p>\n<p><strong>What about the residuals?<\/strong><\/p>\n<p>The question original question was about residuals from the t-test. Recall, the residuals are the difference between actual\/observed value and the predicted value. When we have a simple linear regression with two levels (a t-test) the predicted value is simply the overall mean value for that group.<\/p>\n<p>We can add predictions from the linear regression model into our data set and calculate the residuals, plot the residuals, and then calculate the mean squared error.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n## Add the predictions back to the data set\r\ndf2&#x5B;'preds'] = fit_lm.predict(X)\r\n\r\n## Calculate the residual\r\nresid = df2&#x5B;'sepal_length'] - df2&#x5B;'preds']\r\n\r\n## plot the residuals\r\nsns.kdeplot(resid,shade = True)\r\nplt.axvline(x = 0,linewidth=4, linestyle = '--', color='r')\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.48.29-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-2435\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.48.29-PM.png\" alt=\"\" width=\"531\" height=\"439\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.48.29-PM.png 936w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.48.29-PM-300x248.png 300w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.48.29-PM-768x635.png 768w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.48.29-PM-624x516.png 624w\" sizes=\"auto, (max-width: 531px) 100vw, 531px\" \/><\/a><\/p>\n<p>In the code, you will find the same approach taken by just applying the group mean as the &#8220;predicted&#8221; value, which is the same value that the model predicts. At the bottom of the code, we will find that the outcome of the MSE is the same.<\/p>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.51.04-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-2436\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.51.04-PM.png\" alt=\"\" width=\"383\" height=\"162\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.51.04-PM.png 478w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/05\/Screen-Shot-2022-05-15-at-5.51.04-PM-300x127.png 300w\" sizes=\"auto, (max-width: 383px) 100vw, 383px\" \/><\/a><\/p>\n<p><strong>Wrapping Up<\/strong><\/p>\n<p>In summary, whenever you think t-test or ANOVA, just think linear regression. The intercept will end up reflecting the mean value for the reference class while the coefficient(s) for the other classes of that variable will represent the difference between their mean value and the reference class. If you want to make a comparison to a different reference class, you can change the reference class before specifying your model and you will obtain a different set of coefficients (given that they are compared to a new reference class) but the predicted values, residuals, and MSE will end up being the same.<\/p>\n<p>Again, if you&#8217;d like the full code, you can access it <strong><span style=\"color: #0000ff;\"><a style=\"color: #0000ff;\" href=\"https:\/\/github.com\/pw2\/Python-Tips-and-Tricks\/blob\/master\/General%20Linear%20Models%20-%20t-tests%20is%20just%20linear%20regression.ipynb\">HERE<\/a><\/span><\/strong>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I had someone ask me a question the other day about t-tests. The question was regarding how to get the residuals from a t-test. The thing we need to remember about t-tests and ANOVA is that they are general linear models. As such, an easier way of thinking about them is that they are a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[48,46,43],"tags":[],"class_list":["post-2430","post","type-post","status-publish","format-standard","hentry","category-model-building-in-python","category-python-tips-tricks","category-sports-analytics"],"_links":{"self":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/2430","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/comments?post=2430"}],"version-history":[{"count":3,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/2430\/revisions"}],"predecessor-version":[{"id":2439,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/2430\/revisions\/2439"}],"wp:attachment":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/media?parent=2430"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/categories?post=2430"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/tags?post=2430"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}