Bayesian Modeling Primer
Want to share your content on python-bloggers? click here.
Well, dear reader, I know I haven’t been posting very much lately. That’s because I’ve been busy moving to a new city and working a new DS gig and learning some new things, including Bayesian modeling. In particular I’ve been reading Richard McElreath’s excellent book Statistical Rethinking, which I recommend to you as well. As a dedicated reader of this blog, I’m sure you’re perfectly capable of digesting a 600 page statistics textbook on your own, but just for fun, today I present to you my Bayesian statistics crash course.
My primary goal is to illuminate the major steps in the Bayesian workflow, that way you have a mental framework where you can store and contextualize new pieces of information as you learn. My secondary goal is to give you an intuitive understanding of Bayesian modeling from two interconnected perspectives: a mathematical formulation based primarily in probability theory and a probabilistic programming approach based on writing code to generate random data. Each perspective supports the other, and they are both necessary to grasp the full picture. I will attempt to weave these two perspectives throughout the description of the workflow, which is motivated by a toy example we’ll use throughout the post.
Let’s do this! ➡️
🪨📄✂️ The Rock Paper Scissors Pro
I spent a summer as an intern at RAND Corporation during my PhD. It was a fascinating place full of fascinating characters. One of the researchers, Fritz R, liked to take each cohort of interns out for drinks at some point in the summer. After picking up our first round himself, Fritz offered to buy a second drink for any of the interns who could beat him in a rock paper scissors (RPS) match, warning us that he was “pretty good at it.”
Let’s fact check his claim. We’d like to know something about his actual RPS win rate, but that is unobservable. We can’t observe it directly, but we could observe some match outcomes and make an inference about what his actual win rate might plausibly be.
Let’s say that after facing off with the 10 interns, Fritz racks up the following match outcomes.
observed_outcomes = [1, 1, 0, 1, 0, 0, 1, 1, 1, 1]
He won 7 out of 10 matches—not bad. But is his performance the result of skill or simply a lucky round? We’re going to address this question using Bayesian statistical analysis.
🛠️ The Bayesian Workflow in 3 Steps
I consider the Bayesian workflow to have 3 major steps:
- Modeling – specify the data generating process as a generative model
- Inference – use the model, the observed data, and some inference algorithm to infer the values of unknown model parameters
- Interpretation – summarize and interpret the inferred model parameters to answer your analysis questions
⚙️ Step 1. Modeling
Modeling the Data Generating Process
In this step, we’re going to build a generative model, i.e. a model that can simulate data similar to our observed data. If you’re coming from ML, the key mental shift is to think about modeling the data generating process (DGP), rather than curve-fitting the data itself. Practically this means our model is a set of random variables which relate to one another in some way and from which we can draw realizations… random realizations, that is. You can invent a DGP as follows:
- Identify the key variables in the system.
- Define each variable as a draw from some probability distribution, or in terms of the other variables.
- Use unknown parameters as needed in the probability distributions or in the functional relationships among the key variables.
In our RPS example, there is one key variable—Fritz’s match outcome. We can define the match outcome variable as a random draw from some distribution, e.g. a Bernoulli distribution. The Bernoulli distribution has one parameter—the success probability—which corresponds here to Fritz’s actual true win rate. Given some true win rate, we can simulate match outcomes by drawing realizations from the Bernoulli distribution.
where if Fritz loses to intern
and
if he wins, and
where
. In this DGP, the parameter
corresponds to Fritz’s true win rate.
This is a good start, but we can’t simulate data from this model yet because has no particular value. So, what value should we use?
Probability as Relative Plausibility
One of the key ideas in Bayesian modeling is that we can represent the relative plausibility of potential values of any unobserved variable using a probability distribution. Highly plausible values get higher probability, and less plausible values get lower probability.
It is this view of probability as a measure of relative plausibility that distinguishes Bayesian statistics from Frequentist statistics, which views probability as the relative frequency of events.
We don’t know the true value of Fritz’s RPS win rate, but even before collecting any data, we might have some contextual knowledge about how the world works which can provide some prior information about the relative plausibility of its possible values. For me it’s easiest to think in terms of how surprising a given true value would be. I wouldn’t be surprised at all if his win rate was near 0.5, but I would be shocked if it was 0.9 or 0.1, hence 0.5 has higher relative plausibility than 0.9 or 0.1.
Let’s represent the prior relative plausibility of values of Fritz’s RPS win rate with a probability distribution. Below are a few different probability distributions defined over the possible values .
Code
import numpy as np import pandas as pd import matplotlib.pyplot as plt from scipy.stats import beta, bernoulli # set colors for later prior_color = "C0" post_color = "C1" # prior beta parameters parameters = [ (4, 4), (7, 7), (10, 10) ] # plot the prior x = np.linspace(0, 1, 100) plt.figure() for i, (alpha, beta_val) in enumerate(parameters): y = beta.pdf(x, alpha, beta_val) label = f'Beta($\\alpha$={alpha}, $\\beta$={beta_val})' plt.plot(x, y, label=label) plt.xlabel("win rate (theta)") plt.ylabel("Probability Density") plt.title('Relative Plausibility of RPS Win Rate') plt.legend()
Each of these PDFs has a mode at and decreases toward 0 and 1. They’re all aligned with the relative plausibilities we discussed earlier.
You can check the relative plausibility between two possible values of implied by a given pdf by taking the ratio of the height of the pdf at one value of
versus the height at another value of
.
For example, let’s compare to
for a Beta(10, 10) prior.
beta.pdf(0.5, 10, 10) / beta.pdf(0.7, 10, 10)
np.float64(4.802710683776413)
The Beta(10, 10) distribution implies that a 0.5 win rate is about 5 times more plausible than a 0.7 win rate, which sounds, ahem, plausible.
Priors
We can include this prior information about the relative plausibility of values of in our model as follows.
In Bayesian parlance, we call the probability distribution that represents the relative plausibilities of an unobserved parameter its prior distribution, or simply its prior. Notice that with the addition of the prior for , our model is now fully generative.
Implementing the generative model
Let’s implement the DGP using random variables from scipy
.
# Implementing the DGP as a generative model def draw_from_prior(alpha_param, beta_param): return beta.rvs(alpha_param, beta_param) def simulate_one_outcome(theta, N): y = bernoulli.rvs(theta, size=N) return {"theta": theta, "y": y, "sum_y": np.sum(y)} def simulate_outcomes(n_outcomes, alpha_param, beta_param, N): return pd.DataFrame([ simulate_one_outcome(theta=draw_from_prior(alpha_param, beta_param), N=N) for _ in range(n_outcomes) ]) # set DGP parameters alpha_param, beta_param = 10, 10 N = 10 # simulate outcomes from the generative model outcome_df = simulate_outcomes(1_000, alpha_param, beta_param, N) outcome_df.head()
theta | y | sum_y | |
---|---|---|---|
0 | 0.451620 | [0, 0, 1, 1, 1, 0, 0, 1, 0, 0] | 4 |
1 | 0.460305 | [0, 1, 1, 0, 1, 1, 1, 0, 1, 1] | 7 |
2 | 0.555594 | [0, 0, 1, 1, 1, 1, 0, 1, 0, 0] | 5 |
3 | 0.518724 | [1, 0, 0, 0, 1, 1, 1, 0, 1, 0] | 5 |
4 | 0.569247 | [1, 0, 1, 1, 0, 1, 1, 1, 1, 1] | 8 |
Each time you run this simulation, you first draw a new value of from its prior, then that value is used in the Bernoulli distribution to draw an array of binary match win/loss observations. To help us summarize the match observations in each simulated outcome, we also compute the sum of the match values, i.e. the number of wins.
Prior Predictive Check
But how do we know that the prior we chose is reasonable? There are two places we can look: (1) at the parameter itself and (2) at the downstream variables it influences. We already looked at the parameter itself by inspecting its pdf and thinking about the relative plausibilities it implies. To look at its impact on the downstream variables, we can simply run simulations from the model and inspect the outcome data it produces. If we see it’s generating lots of highly implausible outcomes, then we know something isn’t right. This process is called a prior predictive check, because we’re checking the simulated outcomes (a.k.a. predictions) implied by the prior. Let’s run our model simulation 1000 times and have a look at the distribution of the number of wins out of 10 matches that it predicts, i.e. the sum of the y
variable from each simulation.
Code
outcome_df.hist("sum_y", bins=50, color=prior_color) plt.xlabel('sum(y)') plt.ylabel('count') plt.title('Prior Predictive Check');
The histogram shows most of the simulations yield between 3 and 7 wins, with very few outcomes less than 3 or greater than 7. That seems pretty reasonable.
Let’s look at what the prior predictive check might look like when things aren’t quite right.
Code
simulate_outcomes(1_000, alpha_param=0.5, beta_param=0.5, N=10).hist("sum_y", bins=50, color=prior_color) plt.xlabel('sum(y)') plt.ylabel('count') plt.title('Prior Predictive Check');
In this simulation, many of the outcomes are close to 0 or 10 wins out of 10. From our prior knowledge about RPS, we know it would be possible but very unusual for someone to win either 0/10 or 10/10 matches. This tips us off that something isn’t right with our priors. At this point we would iterate on our priors until we find something reasonable like our Beta(10, 10).
Once we’ve got our generative model and its priors nailed down, we’re ready to move from the modeling step to the inference step!
🧮 Step 2. Inference
The Goal of Bayesian Inference
In the inference step, we use observed outcome data to infer the plausible values of the unobserved parameters. Whereas simulation passes information forward from parameters to outcomes, inference passes it backwards from observed outcomes to parameters. It’s analogous to model fitting or training in machine learning; it’s the part where we use data to learn about the model parameters. The specific output of inference is the updated relative plausibility of the unknown model parameters. Whereas we represent the prior relative plausibilities with the prior distribution, we represent the posterior relative plausibilities (after incorporating information from the data) with the posterior distribution, or simply, the posterior. Like the prior, our model’s posterior distribution is a probability density defined over the possible values of , where larger values indicate higher relative plausibility.
Analytical Formulation of Bayesian Inference
Let’s nail down the mathematical formulation of Bayesian inference. We have data and parameter(s)
. These have a joint probability density
. This joint distribution of data and parameters is defined by our generative model of the system—simulating data from our DGP is equivalent to drawing realizations from the joint distribution
. Using the definition of conditional probability,, we can write the joint distribution as:
where
is the joint distribution of parameter
and data
is the prior distribution of the parameter
is the likelihood—the conditional distribution of observed data
given parameter
.
When we do inference we are interested in the relative plausibility of unknown parameter given data
, which we quantify as the conditional distribution of
given
. Using Baye’s Rule, we can write the posterior as
where
is the posterior distribution of the parameter
is the marginal likelihood of the data
(to be explained soon)
Technically, the joint distribution and the posterior are functions of both parameter and data
. But in practice when we compute the posterior, we’ll have some actual observed data—say
—so that
is actually fixed at
. Substituting the fixed value
in the posterior, we get
If we view as fixed, then the posterior can be interpreted as just the slice of the joint distribution where
. To get a proper conditional probability distribution, we just need to divide the sliced joint density function by the area under
along the
axis. And guess what? That’s exactly what the marginal likelihood is doing;
is just the area under the sliced joint density, and it’s there in the denominator to normalize the sliced joint density so that we get a proper conditional distribution for the posterior.
Computing the Posterior using Grid Approximation
Let’s compute the posterior using the formulas we cooked up in the previous section. Earlier when we wrote down our generative model, we already identified all the pieces we need:
- the prior—since
,
is the probability density function of a
random variable.
- the likelihood—since
, the likelihood is the probability mass function of a Bernoulli random variable with parameter
… well, almost.
The one remaining detail to iron out is that our observed data consists of
observations of the binary match outcomes. Our likelihood needs to reflect the conditional probability of the entire dataset given
, not just a single observation. We know from probability theory that the joint probability of two independent events is the product of their individual probabilities. Therefore, assuming independence among our observations, the joint likelihood of the full dataset is the product of the likelihood of each observation.
Let’s implement the prior, the likelihood, the joint distribution, and the posterior in python and plot out the prior and the posterior distribution of the parameter .
# 0. Defining the observed data y_obs = np.array(observed_outcomes) sum_y_obs = np.sum(y_obs) # 1. Functions for Prior and Likelihood def prior(theta): return beta.pdf(theta, alpha_param, beta_param) def likelihood(theta, y): product_of_likelihoods = 1.0 for y_i in y: product_of_likelihoods *= bernoulli.pmf(y_i, p=theta) return product_of_likelihoods # 2. Function for Joint Density p(y, theta) def joint_density(theta, y): return likelihood(theta, y) * prior(theta) # 3. Computing the Posterior by "Slicing" and Normalizing def posterior_from_joint_slice(theta_values, y_observed_data): # compute grid of p(theta, y_obs) over values of theta and fixed y_obs unnormalized_posterior_values = np.array([ joint_density(theta, y_observed_data) for theta in theta_values ]) # numerical integration to get marginal likelihood p(y_obs delta_theta = theta_values[1] - theta_values[0] marginal_likelihood_approx = np.sum(unnormalized_posterior_values * delta_theta) # p(theta | y_obs) = p(theta, y_obs) / p(y_obs) normalized_posterior_values = unnormalized_posterior_values / marginal_likelihood_approx return normalized_posterior_values
# Define a grid of theta values theta_grid = np.linspace(0.001, 0.999, 500) # Calculate prior over the grid of theta values prior_values = np.array([prior(theta) for theta in theta_grid]) # Calculate likelihood values over the grid of theta values and fixed y_obs likelihood_values = np.array([likelihood(theta, y_obs) for theta in theta_grid]) # Calculate the posterior over the grid of theta values and fixed y_obs posterior_values = posterior_from_joint_slice(theta_grid, y_obs)
Code
# Plotting plt.plot(theta_grid, prior_values, label=f'Prior p(theta) ~ Beta({alpha_param},{beta_param})', linestyle='--', color=prior_color) plt.plot(theta_grid, posterior_values, label=f'Posterior p(theta|y_obs)', color=post_color, linewidth=2) plt.title('Bayesian Inference: Prior, Likelihood, and Posterior') plt.xlabel(r'$\theta$ (Probability of Match Win)') plt.ylabel('Density') plt.legend() plt.grid(True, linestyle=':', alpha=0.7) plt.xlim(0, 1) plt.ylim(bottom=0) plt.tight_layout()
From the figure we can see that while the prior is centered at , the posterior is actually pulled slightly toward larger values of
by the observed data, indicating increased relative plausibility on win rates greater than 0.5.
It’s nice to see that the math works and that we can successfully implement it in code, but grid approximation is a pedagogical endeavor. In practice, when models start to get complicated, we’ll need a more flexible approach for finding the posterior.
Sampling from the Posterior
It turns out that we can do inference using our generative model and our observed data without having to compute the likelihood directly. But while the forward problem of simulating data from the model is quite straightforward, it’s less obvious how to approach the inverse problem of doing inference. There is no silver bullet here. In fact there are tons of different algorithms for doing inference on probabilistic models, but luckily, since virtually all the important inference algorithms use sampling-based approaches, e.g. Hamiltonian Monte Carlo, Metropolis-Hastings, and Variational Inference, no matter which algorithm we use under the hood, we’ll end up with the same kind of output at the end—samples of the unknown parameters which have been drawn from the posterior.
To get some intuition for how we can do inference by sampling from the generative model,, we’re going to implement a very simple inference algorithm called rejection sampling. The key idea is based on the insight we discussed earlier that the posterior is essentially a slice through the joint distribution where the data is fixed to what we actually observed. Since simulating from the generative model is equivalent to drawing samples from the joint distribution, isolating simulation outcomes where the simmulated data is equal to the observed data is equivalent to sampling from the joint distribution along the slice, and hence equivalent to sampling from the posterior.
The algorithm for doing Bayesian inference via rejection sampling is as follows
- Generate a sample
from the generative model.
- Keep the sample if
, otherwise discard it.
- Repeat 1 and 2 until the desired number of retained samples is collected.
- The retained samples of
can be interpreted as samples from the posterior.
Let’s do this in python. In our example, the order of the wins and losses over the match observations doesn’t matter, so we’ll focus on the number of wins. Since
in our observed data, we’ll isolate the simulation outcomes where
sum_y
equals 7.
Code
fig, ax = plt.subplots() outcome_df.query('sum_y != @sum_y_obs').plot(x="sum_y", y="theta", kind="scatter", color="black", alpha=0.2, label="outcomes where $y^* \\ne y_{\\text{obs}}$", ax=ax) outcome_df.query('sum_y == @sum_y_obs').plot(x="sum_y", y="theta", kind="scatter", color=post_color, alpha=0.4, label="outcomes where $y^* = y_{\\text{obs}}$", ax=ax) ax.axvline(x=sum_y_obs-0.25, color="black", linestyle='--') ax.axvline(x=sum_y_obs+0.25, color="black", linestyle='--') plt.title("Samples from the Generative Model");
Boom! By isolating the outcomes where , we effectively have samples from the posterior. Let’s draw a larger number of samples so we get an adequate sample from the posterior.
# drawing a large number of samples and isolating outcomes where y = y_obs rejection_sampling_outcome_df = simulate_outcomes(10_000, alpha_param, beta_param, N).query('sum_y == @sum_y_obs') # posterior samples from rejection sampling posterior_samples = rejection_sampling_outcome_df["theta"]
Code
posterior_samples.hist(bins=30, color=post_color) plt.xlabel("theta") plt.title("Samples from the Posterior");
Now let’s put all the pieces together.
Code
fig, ax = plt.subplots() prior_samples = outcome_df["theta"] prior_samples.hist(density=True, bins=30, alpha=0.7, color=prior_color, label="prior samples", ax=ax) posterior_samples.hist(density=True, bins=30, alpha=0.7, color=post_color, label="posterior samples", ax=ax) ax.plot(theta_grid, prior_values, label=f'Prior p(theta) ~ Beta({alpha_param},{beta_param})', linestyle='--', color=prior_color) ax.plot(theta_grid, posterior_values, label=f'Posterior p(theta|y_obs)', color=post_color, linewidth=2) plt.title('Bayesian Inference: Prior and Posterior') plt.xlabel(r'$\theta$ (Probability of Match Win)') plt.ylabel('Density') plt.legend() plt.grid(True, linestyle=':', alpha=0.7) plt.xlim(0, 1) plt.ylim(bottom=0) plt.tight_layout();
In this figure we have:
- the functional form of the prior
- the functional form of the posterior from grid approximation
- samples from the prior drawn directly from the generative model
- and samples of the posterior obtained by applying rejection sampling to the generative model.
Great! Our sampling algorithms are generating samples from the prior and the posterior which are consistent with the functional forms we computed earlier!
While we used rejection sampling here, regardless of what sampling algorithm we choose, we’ll end up with the same thing after inference—a set of samples from the posterior distribution for each unknown parameter. Once we have those samples, we’re ready to move to the interpretation and analysis step.
🔬 Step 3. Interpretation
So how do we get insight into our analysis questions from this posterior_samples
array? Well we’ve got samples from the posterior distribution of which represents our updated beliefs about the relative plausibility of different values of Fritz’s actual underlying RPS win rate. We can use them just like any other dataset to answer questions about his win rate.
Let’s start with getting a point estimate of his true win rate. We can simply take the mean of the samples.
print(f'point estimate of theta: {np.mean(posterior_samples)}')
point estimate of theta: 0.5655783348185507
To get a confidence interval, often called a credible interval in the Bayesian context, we can just pull the quantiles of the posterior distribution. Note there are fancier ways to do this, e.g. computing highest posterior density intervals (HPDIs), but conceptually we’re basically just looking at the quantiles of the sample.
print(f'89% credible interval of theta:: {np.quantile(posterior_samples, [0.055, 0.945])}')
89% credible interval of theta:: [0.42005907 0.70367915]
What’s the probability that Fritz’s win rate is actually really good, say greater than 75%? We can just check the samples directly for the proportion greater than 0.75.
print(f'P[theta > 0.75]: {np.mean(posterior_samples > 0.75)}')
P[theta > 0.75]: 0.011933174224343675
This means our analysis implies theres only a 1.5% chance that his true win rate is larger than 75%.
Now that we’ve taken a look at interpreting the posterior samples to get some insight into the unknown parameter representing Fritz’s actual RPS win rate, we can take it one step further and make some predictions.
Posterior Predictive Distribution
We have one more character to meet in this cast of Bayesian players—the posterior predictive distribution.
It represents the most plausible distribution of new data given that we observed data $y_{}, i.e. based on the posterior rather than the prior. Mathematically it is obtained by integrating the product of the likelihood for the new data and the posterior distribution over the parameter space.
where
is the likelihood of the new data, assuming a specific parameter value
. It’s the same likelihood function we used before, evaluated on
.
is just the posterior distribution
- the integral
effectively “averages” the likelihood of the new data over all possible values of
, weighted by how plausible each value is according to the posterior.
In terms of the DGP, we can generate samples from the posterior predictive by
- Drawing a sample of
from the posterior distribution.
- Using this value of
to generate a value of
from the generative model.
Let’s implement this in python.
def simulate_posterior_predictive(posterior_samples, N): return pd.DataFrame([ simulate_one_outcome(theta=theta, N=N) for theta in posterior_samples ]) posterior_predictive_df = simulate_posterior_predictive(posterior_samples, N)
Here again, we can inspect the distribution using any familiar techniques for working with samples of data. Here’s a histogram of the posterior predictive.
Code
posterior_predictive_df["sum_y"].hist(density=True, bins=50, color=post_color) plt.title("Posterior Predictive Distribution") plt.xlabel("sum(y)") plt.ylabel("probability mass");
We can use this for forecasting, e.g. what’s the probability that Fritz wins at least 7 games in his next round?
print(f'Probability of winning >= 7 in next round: {np.mean(posterior_predictive_df["sum_y"] >= 7)}')
Probability of winning >= 7 in next round: 0.32537788385043753
Here’s where things can get interesting. In addition to forecasting the outcome itself, we can also compute the probabilities of events that depend on the outcome. For example, let’s say Fritz has only $50 left in his wallet, and he wants to know the probability that he can cover his bill after the next round. Let’s assume each drink costs $12. We can compute that probability as follows.
# probability that Fritz's next round bill is less than or equal to $50 cost_per_drink = 12.0 posterior_predictive_df = ( posterior_predictive_df .assign(losses = lambda x: N - x.sum_y) .assign(bill = lambda x: cost_per_drink * x.losses) ) print(f'Probability next round bill <= $50: {np.mean(posterior_predictive_df["bill"] <= 50)}')
Probability next round bill <= $50: 0.5369928400954654
Summary of the Bayesian Workflow
Wow, we covered a lot of ground today. Let’s summarize the key points.
The Bayesian Analysis Workflow has three major steps:
- Modeling
- We build a generative model that describes the relationships among key variables and unknown parameters to represent the data generating process.
- We encode our prior knowledge about the relative plausibility of different parameter values in the prior distribution of the parameters.
- We can use prior predictive checks to simulate outcome data from the generative model to sanity check our modeling assumptions.
- Inference
- Based on our modeling assumptions, we can use observed data to infer the posterior distribution, which quantifies the relative plausibility of different values of the unknown parameters after observing data.
- We can view simulations from the generative model as sampling from the joint distribution of data and parameters, and we can view the posterior as the result of conditioning the joint distribution on the data we actually observed.
- We can do inference by using sampling-based inference algorithms like rejection sampling, which uses logic on top of our generative model to isolate samples from the posterior distribution.
- Interpretation
- After inference we can use data analysis tools to summarize the samples from the posterior distribution to compute point estimates, intervals, and probabilities of interest.
- If we simulate data from our generative model using the posterior rather than the prior, we get samples from the posterior predictive distribution, which predicts future outcomes given observed data.
- Again, we can analyze these posterior predictive samples to compute point estimates, intervals, or probabilities of future outcomes.
Phew! Hopefully that’s a helpful introduction to the Bayesian workflow and the major ideas behind it. The techniques we looked at here are mostly for pedagogical purposes; if you want to apply Bayesian methods to practical problems, you’ll want to use a probabilistic programming language like PyMC, pyro, or stan. Maybe we’ll get into some of those tools in future posts. See you then!
Resources
- Statistical Rethinking – This page links to where you can obtain the book and also to a number of repos where folks have ported the R and Stan code examples to python libraries like PyMC and pyro.
Reader Exercises
You didn’t think you’d get away without homework did you? Here are a couple suggestions for exercises.
- compute the posterior predictive distribution
for the RPS example using grid approximation.
- Suppose that each RPS match was played as best out of three. Rewrite the generative model to generate both sub-match and match outcomes. Do inference with rejection sampling. Use your model to find the probability that Fritz wins his next match by winning two submatches in a row.
Want to share your content on python-bloggers? click here.