Closed form of posterior


Although Stan is so powerful in computation, sometimes I am always curious about whether we could find the closed form posterior. I know that most time conjugacy would give us closed form posterior, but is there any general reference or guidline on whether the closed form solution exist or not. For example, I am fitting hierarchical models with normal mixture priors and I spent much time in calculation the closed form, but not sure whether I would succeed and this makes me so frustrated…

You could use Stan to check your closed form. That is, simulate a small data set (possibly running SBC to check that you’ve implemented things properly), run it through your closed-form solution then run lots of chains for lots of warmup+sample to get a stable set of posterior samples, and finally compare these samples to the expectations derived from your closed-form solution. If you find they are consistent, this suggests that you might reasonably apply the closed-form solution to larger datasets where MCMC is presumably too slow.

Oops, I guess I didn’t really address your actual question as to how to determine whether a closed form solution could exist. Since that’s a more general Stats Q that’s not Stan specific, maybe ask over on Cross-Validated? Be sure to link your post back here so that others can follow what you find there.


In general, if you have a likelihood L(\theta | y) a prior \pi(\theta | \alpha), and a hyperprior \pi(\alpha | \beta), you can always write out the joint posterior kernel:

p(\theta, \alpha | y, \beta) \propto L(\theta | y) \pi(\theta | \alpha) \pi(\alpha | \beta)

But you will not be able to find the normalizing constant and/or marginal posterior of \theta, p(\theta | y) in general (it will only work when the likelihood, prior, and hyperprior are all normal).

Here’s a quick example: Suppose y \sim \text{Ber}(\theta), \theta \sim \text{Beta}(\alpha, \beta), and \beta \sim \pi(\beta). Then the joint posterior is

\begin{align} p(\theta, \beta | y, \alpha) &\propto \theta^y(1 - \theta)^{1-y} \frac{1}{B(\alpha, \beta)} \theta^{\alpha - 1}(1 - \theta)^{\beta - 1} \pi(\beta) \\ % &= \theta^{y + \alpha - 1} \frac{1}{B(\alpha, \beta)}(1 - \theta)^{1 - y + \beta - 1} \pi(\beta) \end{align}

This resembles the kernel of a beta density, but it is not because of the normalizing constant B(\alpha, \beta). You cannot ignore this normalizing constant, because it is a function of the hyperparameter \beta, which is treated as random.

Using Stan (or any MCMC method), you can, however, simulate draws from the posterior density. This is the whole purpose of MCMC. MCMC methods only require that the kernel of the target density may be evaluated (or more generally, that the target density may be evaluated up to a normalizing constant).

Thanks and could you explian more about ‘But you will not be able to find the normalizing constant and/or marginal posterior of \theta, p(\theta|y) in general (it will only work when the likelihood, prior, and hyperprior are all normal).’ Do we still have conjugacy if even we don’t have normal likelihood & prior?


You have conjugacy between the likelihood and prior only. When you add the hyperprior, you lose conjugacy because of the normalizing constant. That is, the posterior does not not have the same distribution as the prior. See the example above.

Thanks so if I have hyperparamters, do you mean that I could get closed form posterior only if all the likelihood, prior and the prior for hyperparameters are normal?

Again, it depends on what you mean by “closed form posterior”. If you mean that you can write down the entire density function, including normalizing constants, then in the vast majority of cases you will not be able to do this.

Think about what conjugacy means. If \pi(\theta) is a conjugate prior to a likelihood L(\theta | y), then the posterior p(\theta | y) is from the same family as \pi. If you can’t write it out as such, then it’s not a conjugate posterior.

Finally, let me echo @mike-lawrence that this question is out of the scope of this forum, and belongs on a site like Cross-Validated or Stack Exchange

A question like hits on a lot of subtle probability theory that is left out of most introductory statistics classes. Consequently arriving at a satisfying answer may very well take a good bit of time.

Conjugacy is a property of probability density functions, which are just one way to represent an abstract probability distribution (for much more careful explanation see Probability Theory (For Scientists and Engineers)). In particular conjugacy depends on the parameterization of the model, and not the model itself.

Often when people refer to a “closed form posterior” they are referring to an explicit, analytic form of a normalized posterior probability density function. In one dimension this is reasonable – we can plot the density function and summarize our results – but in higher dimensional parameter spaces the concept of a closed-form posterior becomes much, much more complicated.

The full posterior density function is actually straightforward to construct up to a normalizing constant. One just multiples the prior density function times the likelihood function; this is exact up to that normalization constant. Indeed a Stan program completely specifies this unnormalized posterior density function before any “fitting” is done!

That normalization constant, however, doesn’t really do anything. Take for example a one-dimensional model. The normalization constant just scales the y-axis which doesn’t really change how we would interpret the plot. What we really care about is the relative density in different parts of parameter space, and this is unaffected by any global scaling.

Unfortunately there aren’t that many useful one-dimensional models. Most models relevant in applied settings are high-dimensional. Once we construct our unnormalized posterior density function how are going to plot it on a two-dimensional screen?

Easy, you might say. We’ll just plot a marginal of our posterior distribution that summarizes what happens along one axis. That makes sense theoretically, but how does one construct a marginal density function? Lots and lots of integrals that we probably won’t be able to do analytically.

And that is the heart of probabilisitic computation. Constructing an (unnormalized) posterior density function is straightforward but extract summary information from that density function requires complex integrals. Indeed mathematically probability density functions literally exist to be integrated. They’re not well-defined outside of an integral!

Conjugacy is a bit of a red-herring because even if you can construct an analytic, normalized conjugate posterior density function you won’t be able to integrate anything but the simplest functions. To take full advantage of our posterior distribution we need to be able to integrate the posterior density function with respect to lots of different, complex functions.

These posterior integrals are more formally known as expectation values; again see Probability Theory (For Scientists and Engineers) for a conceptual introduction. Algorithms like Markov chain Monte Carlo that we use in Stan? They’re just general methods for approximating expectation values! All that work that Stan does when you run is just accumulating as much information as possible to approximate posterior expectation values.