{"id":2670,"date":"2022-10-17T01:43:44","date_gmt":"2022-10-17T01:43:44","guid":{"rendered":"http:\/\/optimumsportsperformance.com\/blog\/?p=2670"},"modified":"2022-11-08T03:32:58","modified_gmt":"2022-11-08T03:32:58","slug":"approximating-a-bayesian-posterior-prediction","status":"publish","type":"post","link":"https:\/\/optimumsportsperformance.com\/blog\/approximating-a-bayesian-posterior-prediction\/","title":{"rendered":"Approximating a Bayesian Posterior Prediction"},"content":{"rendered":"<p>This past week on the <span style=\"color: #0000ff;\"><strong><a style=\"color: #0000ff;\" href=\"https:\/\/shows.acast.com\/wharton-moneyball\/episodes\/10122022-mlb-cfb-nfl-poe-dennehy-naughton\">Wharton Moneyball Podcast<\/a><\/strong><\/span>, during Quarter 2 of the show the hosts got into a discussion about Bayesian methods, simulating uncertainty, and frequentist approaches to statistical analysis. The show hosts are all strong Bayesian advocates but at one point in the discussion, Eric Bradlow mentioned that &#8220;<em>frequenstist can answer similar questions by building distributions from their model predictions.&#8221; <\/em>(paraphrasing)<\/p>\n<p>This comment reminded me of Chapter 7 in Gelman and Hill&#8217;s brilliant book, <strong><span style=\"color: #0000ff;\"><a style=\"color: #0000ff;\" href=\"https:\/\/www.amazon.com\/Analysis-Regression-Multilevel-Hierarchical-Models\/dp\/052168689X\">Data Analysis Using Regression and Multilevel\/Hierarchical Models<\/a><\/span><\/strong>. In this chapter, the authors do what they call an <em>informal Bayesian approach<\/em> by simulating the predictions from a linear regression model. It&#8217;s an interesting (and easy) approach that can be helpful for getting a sense for how Bayesian posterior predictions work without building a full Bayesian model. I call it, <em>&#8220;building a bridge to Bayes&#8221;.<\/em><\/p>\n<p>Since the entire book is coded in R, I decided to code an example in Python.<\/p>\n<p>The Jupyter notebook is accessible on my <strong><span style=\"color: #0000ff;\"><a style=\"color: #0000ff;\" href=\"https:\/\/github.com\/pw2\/Python-Tips-and-Tricks\/blob\/master\/Approximating%20a%20Bayesian%20Posterior%20Prediction.ipynb\">GITHUB page<\/a><\/span><\/strong>.<\/p>\n<p>(<em><strong>Special Material:<\/strong> In the <strong><span style=\"color: #0000ff;\"><a style=\"color: #0000ff;\" href=\"https:\/\/github.com\/pw2\/Python-Tips-and-Tricks\/blob\/master\/Approximating%20a%20Bayesian%20Posterior%20Prediction.ipynb\">Jupyter Notebook<\/a><\/span><\/strong> I also include additional material on how to calculate prediction intervals and confidence intervals by hand in python. I wont go over those entire sections here as I don&#8217;t want the post to get too long.<\/em>)<\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Libraries &amp; Data<\/strong><\/span><\/p>\n<p>We will start by loading up the libraries that we need and the data. I&#8217;ll use the iris data set, since it is conveniently available from the <strong>sklearn <\/strong>library. We will build the regression model using the <strong>statsmodels.api<\/strong> library.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n## import libraries\r\n\r\nfrom sklearn import datasets\r\nimport pandas as pd\r\nimport numpy as np\r\nimport statsmodels.api as smf\r\nfrom scipy import stats\r\nimport matplotlib.pyplot as plt\r\n\r\n## iris data set\r\niris = datasets.load_iris()\r\n\r\n## convert to pandas sata frame\r\ndata = iris&#x5B;'data']\r\ntarget = iris&#x5B;'target']\r\n\r\niris_df = pd.DataFrame(data)\r\niris_df.columns = &#x5B;'sepal_length', 'sepal_width', 'petal_length', 'petal_width']\r\niris_df.head(3)\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-6.42.02-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-2671\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-6.42.02-PM.png\" alt=\"\" width=\"446\" height=\"123\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-6.42.02-PM.png 446w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-6.42.02-PM-300x83.png 300w\" sizes=\"auto, (max-width: 446px) 100vw, 446px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Build a linear regression model<\/strong><\/span><\/p>\n<p>Next, we build a simple ordinary least squares regression model to predict <strong>petal_width<\/strong> from <strong>petal_length<\/strong>.<\/p>\n<p>&nbsp;<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n## Set up our X and y variables\r\nX = iris_df&#x5B;'petal_length']\r\ny = iris_df&#x5B;'petal_width']\r\n\r\n# NOTE: statsmodels does not automatically add an intercept, so you need to do that manually\r\n\r\nX = smf.add_constant(X)\r\n\r\n# Build regression model\r\n# NOTE: the X and y variables are reversed in the function compared to sklearn\r\n\r\nfit_lm = smf.OLS(y, X).fit()\r\n\r\n# Get an R-like output of the model\r\n\r\nfit_lm.summary()\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-6.44.09-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-2673\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-6.44.09-PM.png\" alt=\"\" width=\"549\" height=\"480\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-6.44.09-PM.png 549w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-6.44.09-PM-300x262.png 300w\" sizes=\"auto, (max-width: 549px) 100vw, 549px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Simulate a distribution around the slope coefficient<\/strong><\/span><\/p>\n<p>The slope coefficient for the\u00a0<strong>petal_length<\/strong> variable, from the model output above, is 0.4158 with a standard error of 0.01. We will store these two values in their own variables and then use them to create a simulation of 10,000 samples and plot the distribution.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n## get summary stats\r\nmu_slope = 0.4158\r\nse_slope = 0.01\r\n\r\n## create simulation\r\nn_samples = 10000\r\ncoef_sim = np.random.normal(loc = mu_slope, scale = se_slope, size = n_samples)\r\n\r\n## plot simulated distribution\r\n\r\nplt.hist(coef_sim, bins = 60)\r\nplt.show()\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-6.51.14-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-2675\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-6.51.14-PM.png\" alt=\"\" width=\"365\" height=\"235\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-6.51.14-PM.png 365w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-6.51.14-PM-300x193.png 300w\" sizes=\"auto, (max-width: 365px) 100vw, 365px\" \/><\/a><\/p>\n<p>We can also grab the summary statistics from the simulated distribution. We will snag the mean and the 90% quantile interval.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n## get summary stats from our simulation\r\nsummary_stats = {\r\n    'Mean': coef_sim.mean(),\r\n    'Low90': np.quantile(coef_sim, 0.05),\r\n    'High90': np.quantile(coef_sim, 0.95)\r\n}\r\n\r\nsummary_stats\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-6.52.54-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-2676\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-6.52.54-PM.png\" alt=\"\" width=\"360\" height=\"64\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-6.52.54-PM.png 360w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-6.52.54-PM-300x53.png 300w\" sizes=\"auto, (max-width: 360px) 100vw, 360px\" \/><\/a><\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Making a prediction on a new observation and building a posterior predictive distribution<\/strong><\/span><\/p>\n<p>Now that we&#8217;ve gotten a good sense for how to create a simulation in python, we can create a new observation of <strong>petal_length<\/strong> and make a prediction about what the <strong>petal_width<\/strong> would be based on our model. In addition, we will get the prediction intervals from the output and use them to calculate a standard error for the prediction, which we will use for the posterior predictive simulation.<\/p>\n<p><em><strong>Technical Note:<\/strong> In the prediction output you will see that python returns a <strong>mean_ci_lower<\/strong><\/em> and <em><strong>mean_ci_upper<\/strong><\/em><em> and an <strong>obs_ci_lower <\/strong>and<strong> obs_ci_higher<\/strong>. The latter two are the prediction intervals\u00a0 while the former two are the confidence intervals. I previously discussed the difference <strong><span style=\"color: #0000ff;\"><a style=\"color: #0000ff;\" href=\"https:\/\/optimumsportsperformance.com\/blog\/confidence-intervals-vs-prediction-intervals-a-frequentist-bayesian-example\/\">HERE<\/a><\/span><\/strong><\/em> <em>and this is confirmed in the Jupyter Notebook where I calculate these values by hand.<\/em><\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n# create a new petal_length observation\r\nnew_dat = np.array(&#x5B;&#x5B;1, 1.7]])    # add a 1 upfront to indicate the model intercept\r\n\r\n# make prediction of petal_width using the model\r\nprediction = fit_lm.get_prediction(new_dat)\r\n\r\n# get summary of the prediction\r\nprediction.summary_frame()\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-7.00.43-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-2677\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-7.00.43-PM.png\" alt=\"\" width=\"562\" height=\"62\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-7.00.43-PM.png 562w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-7.00.43-PM-300x33.png 300w\" sizes=\"auto, (max-width: 562px) 100vw, 562px\" \/><\/a><\/p>\n<p>Store the predicted value (0.343709) and then calculate the standard error from the lower and upper prediction intervals. Run a simulation and then plot the distribution of predictions.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\nmu_pred = 0.343709\r\nse_pred = 0.754956 - 0.343709     # subtract the upper prediction interval from the mean to get the variability\r\nn_sims = 10000\r\n\r\npred_obs = np.random.normal(loc = mu_pred, scale = se_pred, size = n_sims)\r\n\r\nplt.hist(pred_obs, bins = 60)\r\nplt.show()\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-7.02.32-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-2678\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-7.02.32-PM.png\" alt=\"\" width=\"374\" height=\"229\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-7.02.32-PM.png 374w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-7.02.32-PM-300x184.png 300w\" sizes=\"auto, (max-width: 374px) 100vw, 374px\" \/><\/a><\/p>\n<p>Just as we did for the simulation of the slope coefficient we can extract our summary statistics (mean and 90% quantile intervals).<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n## get summary stats from our simulation\r\nsummary_stats = {\r\n    'Mean': pred_obs.mean(),\r\n    'Low90': np.quantile(pred_obs, 0.05),\r\n    'High90': np.quantile(pred_obs, 0.95)\r\n}\r\n\r\nsummary_stats\r\n<\/pre>\n<p><a href=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-7.02.38-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-2679\" src=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-7.02.38-PM.png\" alt=\"\" width=\"313\" height=\"60\" srcset=\"https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-7.02.38-PM.png 313w, https:\/\/optimumsportsperformance.com\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-16-at-7.02.38-PM-300x58.png 300w\" sizes=\"auto, (max-width: 313px) 100vw, 313px\" \/><\/a><span style=\"text-decoration: underline;\"><strong>Wrapping Up<\/strong><\/span><\/p>\n<p>That is a pretty easy way to get a sense for approximating a Bayesian posterior predictive distribution. Rather than simply reporting the predicted value and a confidence interval or prediction interval, it is sometimes nice to build an entire distribution. Aside from it being visually appealing, it allows us to answer other questions we might have, for example, <em>what percentage of the data is greater or less than 0.5 (or any other threshold value you might be interested in)?<\/em><\/p>\n<p>As stated earlier, all of this code is accessible on my <strong><span style=\"color: #0000ff;\"><a style=\"color: #0000ff;\" href=\"https:\/\/github.com\/pw2\/Python-Tips-and-Tricks\/blob\/master\/Approximating%20a%20Bayesian%20Posterior%20Prediction.ipynb\">GITHUB page<\/a><\/span><\/strong> and the Jupyter notebook also has additional sections on how to calculate confidence intervals and prediction intervals by hand.<\/p>\n<p>If you notice any errors, let me know!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This past week on the Wharton Moneyball Podcast, during Quarter 2 of the show the hosts got into a discussion about Bayesian methods, simulating uncertainty, and frequentist approaches to statistical analysis. The show hosts are all strong Bayesian advocates but at one point in the discussion, Eric Bradlow mentioned that &#8220;frequenstist can answer similar questions [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[49,46,43,42],"tags":[],"class_list":["post-2670","post","type-post","status-publish","format-standard","hentry","category-bayesian-model-building","category-python-tips-tricks","category-sports-analytics","category-sports-science"],"_links":{"self":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/2670","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/comments?post=2670"}],"version-history":[{"count":3,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/2670\/revisions"}],"predecessor-version":[{"id":2680,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/posts\/2670\/revisions\/2680"}],"wp:attachment":[{"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/media?parent=2670"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/categories?post=2670"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/optimumsportsperformance.com\/blog\/wp-json\/wp\/v2\/tags?post=2670"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}