Newbie questions on how Stan weights the likelihood by the prior + implications for visualization

A lot of people, including me, like to visualize both the prior and the posterior of a parameter in the same plot. This seems a particularly helpful thread on how to do it. One thing puzzles me though.

For the most part, people seem to be plotting empirical prior densities, obtained e.g. by using Brms with sample_prior = TRUE or by sampling the prior distribution post-hoc through the use of R’s random sampling functions (e.g. rnorm()). But isn’t this mathematically inaccurate? Aren’t the posterior densities derived by first computing the likelihoods of all parameter values and then weighting each likelihood value by its exact prior density¹ as it is determined by the formula for the prior distribution’s PDF? Surely the prior weights of the likelihood values are not determined by first sampling the prior, then calculating empirical prior weights by using kernel density estimation (KDE) on those random draws, and finally weighting the likelihood by the results of that procedure? That would seem absurd.

And yet, it is empirical prior sampling distributions that I see people plotting alongside their posteriors, not the exact prior PDF. Why?

Is it just for practical reasons? Is it because it’s more convenient, when preparing data for visualization, to use the same KDE software on both the prior and the posterior, rather than using the theoretical PDF for the prior and KDE for the posterior?

I personally prefer plotting the posterior together with the theoretical prior PDF rather than its randomly sampled approximation. This makes more mathematical sense to my limited brain, plus the curves are smoother and prettier. Am I wrong to do this?

¹As far as I can understand, the fact that this exact prior density might need to be scaled down by a constant because the posterior must integrate to 1 is not relevant here.

Well I think that for practical reasons it is rare that you really get to define the “exact PDF”. Sampling from the prior does not necessarily mean you have in the end a very simple output with N(\mu, \sigma) but in reality interactions between priors of all the underlying parameters involved in your model. You will notice this very early on in an “high-dimensional” parameter space. However, depending on the restrictions of parameter priors or transformations this will be visible with even just a few parameters. The KDE is obviously just a tool to get a smoother visualisation and I don’t think people would claim it to be perfect especially for outcomes with boundaries. However, it is doing a good job in most cases.

What? Now I’m bewildered. Surely the prior always has an exact PDF? (or perhaps “analytical” is a better word.) We know the formulae for the normal PDF, the exponential PDF, the Cauchy PDF, etc. It’s the posterior that needs to be approximated by sampling, not the prior densities used to weight the likelihood…or else I’m spectacularly confused.

Sorry I messed up my first example…

Consider a simple model with a truncated parameter \mu which we get by using \theta as hyperparameter. The prior for \theta is a standard normal.

data {
  int<lower=0> N;
  int<lower=0, upper=1> priorOnly;
  vector[N] y;
}

parameters {
  real<lower=0> mu;
  real<lower=0> sigma;
  real theta;
}

model {
  theta ~ normal(0,1);
  sigma ~ cauchy(0,1);
  mu ~ normal(theta,1);
  if(priorOnly==1) {
    y ~ normal(mu, sigma);
  }
}

generated quantities {
  real y_rep = normal_rng(mu, sigma);
}
TestModel <- cmdstan_model(stan_file = "stan/TestStan.stan")
TestFit <- TestModel$sample(data = list(N=1, y=0, priorOnly=1), iter_sampling = 10e3, chains = 4, parallel_chains = 4)
bayesplot::mcmc_dens(TestFit$draws())

You will see that due restrictions in the model \theta is not just standard normal. It exists in the model with other parameters influencing it. It’s conditional on the other parameters.

 variable  mean median   sd  mad    q5   q95 rhat ess_bulk ess_tail
    lp__  -3.15  -2.80 1.42 1.19 -5.99 -1.57 1.00    10666    12934
    mu     0.49   0.31 0.54 0.36  0.01  1.62 1.00     9355     9938
    sigma  1.10   0.64 2.27 0.63  0.05  3.28 1.00    11013     9858
    theta  0.24   0.24 0.75 0.75 -0.98  1.50 1.00    14102    14287
    y_rep  0.49   0.24 2.30 0.62 -1.31  3.00 1.00    28126    29014
1 Like

Thank you for the example. I guess this would indeed explain why so many people put sampled priors in their plots, rather than the analytical function curve.

I don’t understand it at all though — how on earth can a prior distribution be influenced by “other parameters” when it is explictly defined as being determined solely by the \mu and \sigma of a normal PDF. Does not compute. I wonder if it’s related to the ‘violent’ forcing of a hard boundary onto \mu (as opposed to using an inherently bounded prior). Probably not though.

Anyway, in my case this is a side issue. The priors I want to plot next to posteriors belong to the SDs of group-level effects in logistic regression (categorical outcome), and they are Exponential with rate = 2. By re-fitting one of my (large number of) models with sample_prior = TRUE, I have verified that those priors are indeed identical to the output of R’s rexp(N, rate = 2) — notwithstanding minor random variability which does not affect the shape but does inconvenience the visual comparison or multiple parameters to their priors.

If anyone would like to try and explain the phenomenon further (preferably with brms or something else other than Stan code, which I don’t read well), I would appreciate it. Otherwise, I’ll accept the solution after having waited for a few days at least.

Just to clarify for my example there is only an interaction of \theta and \mu due to the restrictions which also means that this defines possible posteriors.


If your parameters are completely independent of each other (as \sigma and \mu) then this does not apply. However, it is sometimes tricky were these dependencies in higher dimensional space occur. But I hope someone can give a more comprehensive explanation :)

1 Like

When you say “the restrictions”, could you specify which restriction it is that causes this behavior?

And is the top right-hand panel of your latest plot the correlation between the prior draws of theta and mu? If so, I’m guessing that’s where the answer lies.

With “restriction” I mean the truncation of \mu. As a consequence the possible parameter space for \theta is limited as you rightly pointed out in the top right-hand panel (and bottom left-hand for that matter). And when thinking about it it makes sense that \theta has to full fill the condition \mu > 0. But in this case we have really just a few parameters and sometimes when using some transformations etc. things can get very weird. The geometry of the parameter space can potentially get very tight. At least for me it is is easier/safer to spit out prior draws in a more complicated model just to understand if there are weird geometries. Sometimes you might be able to correct for these with reparameterisations, but sometimes that is not the case.
Edit: To many edits now because I haven’t had coffee yet… Will check again later to be sure.

1 Like

I don’t yet see how it “has to fulfill” that condition. :) 50% of a standard normal variable should be negative. And given that \mu's mean is \theta, 50% of \mu should lie exactly at 0. Like in this simple R simulation:

set.seed(2023)
1e4 -> N
theta <- mu <- numeric(length = N)

for(i in 1:N){
  theta[i] <- rnorm(1, mean = 0, sd = 1)
  mu[i] <-    rnorm(1, mean = theta[i], sd = 1)
  if(mu[i] < 0){mu[i] <- 0}}

summary(theta)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-3.555994 -0.680682 -0.010867 -0.001956  0.675468  3.546898 
# ^ exactly standard normal

summary(mu)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.5720  0.9613  5.3918 
# ^ This looks as expected

mean(mu == 0)
[1] 0.5017 # Half the mu values are zero.

plot(theta, mu)

My plot looks almost the same as yours, except for the thick band of zero \mu's at the bottom, which is exactly what I would have expected.

Where does the simulation go wrong?

Perhaps this is the core confusion – we cannot always assume that we know the analytic distribution of all the marginals in the prior. Suppose I have observable quantity Y and two parameters \theta_1, \theta_2 and a model something like

Y \sim t_{5}(0, \theta_{1}), \quad \theta_{1} \sim \text{Cauchy}_{+}(0, \theta_{2}), \quad \theta_{2} \sim N_{+}(0, 1)

where t_{5}(0, \theta_{1}) is a t-distribution with \nu = 5 degrees of freedom, mean 0, and scale parameter \theta_{1}; and \text{Cauchy}_{+}(0, \theta_{2}) is a Cauchy distribution with location parameter 0, scale parameter \theta_{2} truncated to [0, \infty). We truncate the normal distribution in the same way (Note that this truncation does not result in any extra probability mass/density at 0 – we normalise the >0 part of the distribution so that the area under the density function integrates to 1).

We have specified a joint model p(Y, \theta_{1}, \theta_{2}) by specifying the conditional distributions for Y and \theta_{1} and the marginal for \theta_{2}: p(Y, \theta_{1}, \theta_{2}) = p(Y \mid \theta_{1}) p(\theta_{1} \mid \theta_{2}) p(\theta_{2}).

Sampling p(Y, \theta_{1}, \theta_{2}) either via MCMC (using sample_prior = TRUE in brms) or by standard Monte Carlo sampling (i.e. using rnorm to sample \theta_{2} \sim p(\theta_{2}), then using rt(df = 1, ...) to sample p(\theta_{1} \mid \theta_{2}) etc) yields, for example, samples of \theta_{1} from it’s marginal distribution p(\theta_{1}).

So we have samples from the marginal p(\theta_{1}), but when writing out our model we only specified the conditional p(\theta_{1} \mid \theta_{2}). To find the analytic marginal p(\theta_{1}) we need to do a touch of integration, as p(\theta_{1}) = \int p(\theta_{1} \mid \theta_{2}) p(\theta_{2}) \text{d}\theta_{2}. For the model we have written down above, this integration is painful enough that it might as well be non-analytic (meaning we do not “know” the mathematical form of the marginal distribution for \theta_{1}). So, for visualisation purposes, we use our samples from the prior and a KDE to estimate/plot \hat{p}(\theta_{1}). This problem is not present for the “top most” parameter \theta_{2} because we specify its prior unconditionally.

“computing the likelihood of all parameter values” as written would involve an infinite amount of computation – for a model with a single continuous parameter, there are an infinite number of likelihood values. The posterior is a mathematical object: we start from our joint model p(Y, \theta_{1}, \theta_{2}), and then we collect/observe our data Y. We then apply the laws of conditional probability:

p(\theta_{1}, \theta_{2} \mid Y) = \frac{ p(Y, \theta_{1}, \theta_{2}) } { p(Y) } = \frac{ p(Y \mid \theta_{1}) p(\theta_{1} \mid \theta_{2}) p(\theta_{2}) } { p(Y) }

and usually we can’t compute p(Y) (for exactly the same reason we couldn’t compute the marginal p(\theta_{1}) in the prior), so we drop the equality for proportionality:

p(\theta_{1}, \theta_{2} \mid Y) \propto p(Y \mid \theta_{1}) p(\theta_{1} \mid \theta_{2}) p(\theta_{2}).

We like the RHS of this expression because we can actually evaluate it (it’s just our model, even though it’s just proportional to the posterior), and we have fancy algorithms (the MCMC algorithm inside Stan, for example) that generate samples from p(\theta_{1}, \theta_{2} \mid Y) just using the RHS of the above expression.

Here you are choosing to assign half of the samples to be exactly 0. This is a modelling choice. For continuous quantities like \mu it doesn’t make a lot of sense, as the probability \Pr(\mu =0) is 0 by definition. When we truncate a continuous quantity, we renormalise the distribution function over the non-truncated region to ensure the density integrates to 1. For example, when truncating a standard normal to [0, \infty), we have (as you correctly note) removed half of the probability mass from (-\infty, 0]. In a typical half-normal distribution we renormalise by multiplying the density function by 2 over [0, \infty).

As an experiment, try sampling possible sds for mu[i] from various distributions (gamma, inverse gamma, chi-squared, half-normal) and look at the corresponding histograms of mu. It will look far from normal.

\mu has location/mean-parameter \theta, but it’s actual mean/expected value is something else because of the truncation.

6 Likes

Exactly. The simulation is doing something different entirely. If we use the analogy of a skater in a pool for how Stan samples the parameter space it is limited to the inside of the pool. You can’t just leave. @blokeman simulation, however, kicks the skater out of the park and claims he is at the edge of the pool.

Thanks for making such arduous efforts to educate me.

So \mu in danielparthier’s original model: is it a half-normal distribution (which sounds like a different beast entirely) or is it just “the other half” of a standard normal distribution that has been split down the middle?

In this very simple scenario we are still in the gaussian world so \mu will be a “other half” (truncated) of a normal with mean-parameter of \mu_{\theta} and a standard deviation of \sqrt{\sigma_{\mu}^2+\sigma_{\theta}^2}. But I encourage you to change the distribution of \theta to see what happens.
//edit: when using a \mu_{\theta}=0 we have the special case of the half-gaussian. However, we could have chosen something else as a \mu_{\theta}.

mu <- vector(mode = "numeric")
theta <- vector(mode = "numeric")
while(length(mu)<1e6) {
  thetaTmp <- rexp(n = 1e6)
  muTmp <- rnorm(n = 1e6, mean = thetaTmp, sd = 1)
  mu <- c(mu,muTmp[muTmp>0])
  theta <- c(theta, thetaTmp[muTmp>0])
}
mu <- mu[1:1e6]
theta <- theta[1:1e6]

# plot theta vs mu
plot(theta,mu, pch=".") 

# generate mu density and compare to half-normal density with same sd
muDensity <- density(mu)
plot(muDensity$x, y = muDensity$y/max(muDensity$y), type="l")
halfNormalDensity <- density(abs(rnorm(n = 1e6, sd = mean(mu)*sqrt(pi)/sqrt(2))))
lines(x = halfNormalDensity$x, y = halfNormalDensity$y/max(halfNormalDensity$y), col="red")

# look at theta and compare to theoretical density of exponential distribution  with rate 1
thetaDensity <- density(theta)
plot(thetaDensity$x, y = thetaDensity$y/max(thetaDensity$y), type="l")
thetaExp <- rexp(n = 1e6)
expDensity <- density(thetaExp)
lines(x = expDensity$x, y = expDensity$y/max(expDensity$y), col="red")

\mu will not be normally distributed anymore and neither will \theta be exponentially distributed. All to show what @hhau mentioned before that there might not be an easy analytical solution.

To circle back to your main question: Why do people simulate from draws and don’t use analytical solutions? Sometimes (I would argue more often than not) we don’t have the possibility/solution to write the simple model priors/posteriors and simulating from a joint model is doable. “If in doubt simulate.” This is also the power of Stan and other approaches which was not really available for a long time post-Gauss.

2 Likes

Thank you for your patience and for the new simulation example. I might have understood something new here, but let’s see. Here’s what I make of it now:

Each parameter’s pre-specified prior distribution is not necessarily its final observed prior distribution, but only the prior distribution of its “proposal values”. Every “proposed set” of parameter values which collectively fails to meet a prespecified criterion (in the original example, \mu_{proposal} \geq 0), will be rejected. And given that “proposals” in which \theta is negative are overall likelier to result in \mu_{proposal} < 0 (and thus get rejected) than “proposals” in which \theta is nonnegative, this ends up skewing the actual, observed prior distribution of \theta to the right of zero.

Does this sound correct now?

1 Like

I would say that sounds like a fair summary.

What I would maybe just add to the discussion is that playing with examples like these and simulating from models when developing them gives you an intuition for how they work. This might also help you in identifying issues in the model itself by finding very extreme geometries which can be difficult to sample from. I have to admit that I haven’t used brms enough to know how much is taken care of, but it’s always useful to do sanity checks.

1 Like

Many thanks to both of you for your patience! This wasn’t easy to grasp, but it was an absolutely exhilarating epiphany!

EDIT: Now I have a hard time deciding which answer to credit with the “solution”. Truth is, they each contributed to nudge me along the path to comprehension.

1 Like

If you have to go for @hhau. His answer was very comprehensive :)

2 Likes

Just in case you still want to understand more about the advantages of simulating from the model (prior/posterior) and pursue a satisfying approach:
Go and check out Gabry et al. 2019 if you haven’t already done that. Working like this will not only help you too understand your priors but you might rethink your model completely and feel much better about the final product.
Since @jonah will in this context not advertise it himself I will happily do it.

2 Likes