Approximating posterior expectations by likelihood expectations


This is more of a general statistical question, but addressing it may give some insight into Stan models, so I appeal to the statisticians in the forum. I realize that the term “likelihood expectations” that I used in the title is a misnomer since the likelihood is generally not even a pdf, but hopefully you get the idea (maybe a better term would be likelihood importance sampling). Basically, I wonder when the expression below will be a good approximation to posterior expectations:

E[\theta \mid y]\approx\frac{1}{ \sum_{i=1}^{N} p(y \mid \theta^{(i)})}\sum_{i=1}^{N} \theta^{(i)}p(y \mid \theta^{(i)})
\textrm{where }\theta^{(i)} \textrm{ are draws from the prior } p(\theta)\textrm{ and }p(y \mid \theta^{(i)})\textrm{ is the likelihood }

Basically, this weighs prior draws by the likelihood function. I tested this is in a very simple model with one parameter, and the estimates I get using the expression above are virtually the same as the true posterior estimates (both the mean and SD).

Any thoughts on when this approximation may fall apart? Thanks.

For finite N, this approximation will fall apart whenever there are insufficient prior draws within the region of “important” posterior probability to reliably characterize the posterior. Note that as the dimensionality of the parameter space increases, the proportion of prior draws that are in regions of high posterior probability quickly gets very small, even if the posterior is not much narrower than the prior.

1 Like

This approach is a special case of importance sampling estimation. I go into much more detail on importance sampling estimators and their error in Section 3.2.2 of Probabilistic Computation but I’ll make a few comments here.

The weight function in importance sampling estimators is a Radon-Nikodym derivative between the target distribution and the sampling distribution being used. If we’re trying to estimate posterior expectation values from prior samples then we need the Radon-Nikodym derivative between the two which is…exactly the realized likelihood function up to normalization! For more on interpreting the likelihood function as an unnormalized Radon-Nikodym derivative see Probabilistic Modeling and Statistical Inference.

Importance sampling estimators are asymptotically consistent in that the estimators will converge to the exact expectation value as N approaches infinity subject to some reasonable conditions (the prior distribution has to be absolutely continuous with respect to the posterior distribution so that the weight function is never infinity, the exact expectation value has to exist, etc). The problem is what happens for finite N.

The error of importance sampling estimators for finite N is determined by the overlap between the sampling distribution (again the prior here) and the target distribution (the posterior). The less we learn from an observation, the more diffuse the likelihood function, the more the prior and posterior distributions will overlap, and the more accurate importance sampling estimators will be. As the likelihood function becomes more and more informative however, the estimators will become less and less accurate.

As @jsocolar notes the the overlap is also affected by the dimensionality of the model configuration space. The higher the dimensionality the more distributions concentrate into typical sets and the less any two distributions will tend to overlap even when they’re only small perturbations from each other; see for example the figure at the end Section 3.2.2 in the first link. Because of this importance sampling estimators become exponentially more expensive as the dimensionality of the model configuration space increases. Very quickly they become unusable.

1 Like

Thanks so much for your input, jsocolar and betanalpha. What you said makes sense to me and numerical experiments confirm that the performance of this approximation deteriorates as dimensionality increases. Basic intuition was telling me that such a simple approximation could not work in all cases, otherwise Stan would be somehow pointless, but it’s good to get a more formal explanation of why and when it becomes unusable.

1 Like

Just another way to look at the intuition here:

What we need are evaluations of (something proportional to) the posterior density throughout the region of important posterior mass. The more of these we have, the better we’ll be able to characterize the posterior.

One approach is to put a grid over all of parameter space, and evaluate the posterior density everywhere. But parameter space is too big, so this doesn’t work.

Your approach is to put a grid over all of parameter space, but then only to evaluate the posterior density in regions of high prior density. You achieve this by evaluating the prior density based on how many of your prior samples land in a grid cell, and evaluating the likelihood at each of the prior samples. But in most applications, this doesn’t work either, because the prior is still too big compared to the posterior. Too few of the likelihood evaluations happen in the region of important posterior mass.

Stan’s approach, and that of MCMC methods generally, is to come up with a way to drastically increase the proportion of likelihood evaluations that happen in the region of important posterior mass (in HMC samplers like Stan, essentially all of the likelihood evaluations happen in the region of important posterior mass, but the cost is that each sample requires many gradient evaluations to produce). That’s the whole point, and magic, of MCMC: it provides a way to target our likelihood evaluations towards the region of important posterior mass while at the same time (ideally) avoiding any bias associated with preferentially sampling some regions of parameter space and not others. (I say ideally; this is why we have convergence diagnostics, and note that the convergence diagnostics associated with HMC samplers like Stan are uniquely good).