Choosing a high density point from the posterior

My Stan model produces a posterior distribution over the space of parameters. Now I want to pick one point in that space to simulate new data and verify that it looks similar to the experimental data.

The question is, how would I choose that one point in the parameter space? It doesn’t make sense to take the mean, for instance. If the posterior was concentrated near the unit circle, the mean would be (0,0), which is not on the circle.

Ideally I’d like to choose a point in a region of high density, but is there an easy way to do that?

I could also just pick a single random sample generated by Stan, and I understand that with 95% probability it’ll be in the 95% credible region. But it seems that, given the available data, we should be able to do much better than that and not rely on luck.

Posterior predictives are the things you want. Use the generated quantities block in a Stan model to do it.

This just takes the parameters of each draw that the sampler generates and generates predictions. In your case, it sounds like you want to generate new data and verify it looks like the original data, which this is built to do.

It is not obvious to me that this is what I want. In my case, I want to build a specific histogram for true and predicted data and compare the two histograms. If I generate predictions using all samples at once, it seems that would lead to overdispersion compared to any particular parameter sample.

If you want to look for the single most likely thing, then you’re looking to do a MAP estimate or something.

The posterior isn’t necessarily gonna be near the highest probability point in parameter space anyway. The standard example everyone uses on here is r = sum(x^2) where all the xs are normal(0, 1). The distribution of r is gonna be chi-squared, and the mass is going to be away from zero even if the single highest probability point is 0.

Of course! Now I feel dumb for not thinking about it :) Thanks a lot!

That’s a good point too. So ideally I’d generate a histogram per sample, then find a high density point in the space of histograms rather than in the space of parameters. But that sounds rather complicated, so I’ll settle for the MAP now.

Go look up “posterior predictive checks” (PPC); this case is exactly what they’re for. There’s even packages that automate generating the observed-vs-simulated histograms you’re thinking about.

1 Like

That “over dispersion” is a manifestation of your posterior uncertainty. Ignoring that uncertainty by choosing a single point both leads to poor predictions (especially when generalizing out of sample) and doesn’t faithfully represent what the model is trying to tell you (so when you get bad results you prematurely try to fix your model which may not have been broken in the first place).

MAPs are bad for this reason and more.

If you need to report a single summary from the posterior then it’s highly advisable to consider a decision theoretic perspective and ask first “what properties do I best want to preserve?” and then “which point estimates preserve those properties as well as possible?”. Mathematically MAPs will never be the result of a formal process like this.

1 Like

Thanks for the response, but I’m not sure I follow, or perhaps I didn’t express myself clearly. I don’t intend to use MAP as a summary but more like a PCC, as others noted. Let me give you an example (a very contrived one, admittedly).

Let’s say I’m modelling my data as \mathcal{N}(\mu,1). Say I know that the variance of my data is 1 but (this is the contrived part) I don’t know what the variance of \mathcal{N}(\mu,1) is. I’d like to check that my model is adequate by comparing the variance of my data to the variance of data generated by the model. If I pick any particular \mu, the check will pass, but if I generate a new y for each \mu, it will fail.

You cannot compare the variance of the data to that of the model because you work with only a single data set and not the entire ensemble of possible observations.

Maximum likelihood presumes that the single data set is large enough that it mimics the behavior of the entire ensemble, but if the data set is not large enough then it will be a poor representation of the ensemble and the comparisons will be misleading. Only in the large data limit will the “variance of the data” be a reasonable proxy for the “variance of the data generating process”.

Bayesian methods explicitly acknowledge the uncertainty inherent in trying to draw conclusions using only finite data which manifests as the breadth of the posterior. In the large data limit the posterior concentrates around the true value and that breadth becomes negligible (assuming various regularity conditions).

Mike Lawrence: thanks, I read about posterior predictive checks in BDA some time ago. I don’t have the book handy now, but from what I could google, it matches what I seem to remember. So yes, this does broadly fall into the framework of PCC, but I’m not sure what value I can extract from this observation.

Here’s roughly what I’m doing. I have a mixture model where two probabilities, p and q, differ across components. My PCC histogram is obtained as follows: I generate a set of samples (a,b), where a\sim \operatorname{Bernoulli}(p) and b\sim \operatorname{Binomial}(m, q), where m is some constant. Then I group the samples by b, calculate \operatorname{mean}(a) inside each group, and visualize the result as a barplot (b, \operatorname{mean}(a)).

To address Michael Betancourt’s concern, yes, I have enough data (which I even have to subsample before passing to Stan so that it doesn’t take too long). This barplot done on the original data shows a specific pattern, and whether I can recreate this pattern hopefully tells me whether the mixture model is appropriate here.