A question like hits on *a lot* of subtle probability theory that is left out of most introductory statistics classes. Consequently arriving at a satisfying answer may very well take a good bit of time.

Conjugacy is a property of *probability density functions*, which are just one way to represent an abstract probability distribution (for much more careful explanation see Probability Theory (For Scientists and Engineers)). In particular conjugacy depends on the *parameterization* of the model, and not the model itself.

Often when people refer to a “closed form posterior” they are referring to an explicit, analytic form of a normalized posterior probability density function. In one dimension this is reasonable – we can plot the density function and summarize our results – but in higher dimensional parameter spaces the concept of a closed-form posterior becomes much, much more complicated.

The full posterior density function is actually straightforward to construct up to a normalizing constant. One just multiples the prior density function times the likelihood function; this is *exact* up to that normalization constant. Indeed a Stan program completely specifies this unnormalized posterior density function before any “fitting” is done!

That normalization constant, however, doesn’t really do anything. Take for example a one-dimensional model. The normalization constant just scales the y-axis which doesn’t really change how we would interpret the plot. What we really care about is the *relative* density in different parts of parameter space, and this is unaffected by any global scaling.

Unfortunately there aren’t that many useful one-dimensional models. Most models relevant in applied settings are high-dimensional. Once we construct our unnormalized posterior density function how are going to plot it on a two-dimensional screen?

Easy, you might say. We’ll just plot a *marginal* of our posterior distribution that summarizes what happens along one axis. That makes sense theoretically, but how does one construct a marginal density function? Lots and lots of integrals that we probably won’t be able to do analytically.

And *that* is the heart of probabilisitic computation. Constructing an (unnormalized) posterior density function is straightforward but *extract summary information* from that density function requires complex integrals. Indeed mathematically probability density functions literally *exist* to be integrated. They’re not well-defined outside of an integral!

Conjugacy is a bit of a red-herring because even if you can construct an analytic, normalized conjugate posterior density function you won’t be able to integrate anything but the simplest functions. To take full advantage of our posterior distribution we need to be able to integrate the posterior density function with respect to lots of different, complex functions.

These posterior integrals are more formally known as *expectation values*; again see Probability Theory (For Scientists and Engineers) for a conceptual introduction. Algorithms like Markov chain Monte Carlo that we use in Stan? They’re just general methods for approximating expectation values! All that work that Stan does when you run is just accumulating as much information as possible to approximate posterior expectation values.