I wonder how I can find potential signs/evidence of multimodality posterior distribution. I tried to plot the bivariate distribution of MCMC samples for two parameters. The contour plot is shown below
My question is:
Can I use the contour plot as evidence of multimodal posterior distribution?
If I find the multimodal posterior distribution in (1), can I reason that DIC may not be a good criterion for model selection?
How can multimodal posterior distribution relate to frequentist local maxima?
As far having samples of a density that has more than one maximum, it is some evidence of multimodality; however, there are some nuanced aspects to it, in addition to the concept of multimodality not being a vey formal one (do you need strictly one peak, how big should the others be to count as additional peaks, etc). Some things to consider:
If there are more parameters, the lower (or even the higher) peak has very low probability when other dimensions are considered and doesn’t really represent a “proper” mode in the higher-dimensional joint distribution;
Is the effective sample size large enough that the distributions are we characterized?
Are they an artefact of poor tuning of the method that ends up being stuck in a low-probability region (maybe this is more likely the other way around).
DIC is generally not a great criterion since its based on the mean estimate instead of the entire distribution, but I guess maybe in this case it is especially problematic because the mean is even less representative and low probability (other criteria may not be adequate in this case either, but the source of the problem would be something else).
The only difference between local maxima and multimodal posteriors is that posteriors are what you get when multiplying the priors, but as far as being a probability distribution for p(\theta | D ) the maxima are just discrete points in the distribution. It’s the same relationship between MLE (or MAP more precisely) and a unimodal posterior, just multimodal surfaces are messier from an inferences standpoint.
Thank you so much for the detailed explanation. It’s very helpful.
I looked into the output. And found that the effective sample sizes for these parameters that have multimodal posterior distribution are all comparative smaller than other unimodal parameters in the model.
Can we attribute the multimodality to the small n_eff?
Without thinking too much I’d say no (but maybe someone else thought things like this through and has a better assessment); it’s a bit like proving a negative – if you don’t have enough samples you cannot really make confident statements either way.
Still, while the samples are random (and correlated draws), the likelihood/posterior function is fixed so the values computed will not change; instead what may happen if you drew more samples and get a better resolution on the surface is that you may find that the “peak” is sharper, flatter, or not a peak at all.
That’s a statement about the likelihood function conditioned on the data \mathcal{L} = p(\theta|D) not the approximation of it you manage to sample and the associated convergence metrics like \hat{R}. For a given data set \mathcal{L} it’s fixed for any choice of parameter values.
Multimodality in the posterior density function itself isn’t a problem. The real problem is that most samplers have trouble fully exploring across the multiple modes, at worst getting stuck in quasistatonary exploration around a single mode and at best slowly moving between the two modes, which compromises the accuracy of the posterior quantification.
Here you can see multimodality manifest in the variable X29, but there not enough information to know how the sampler is performing.
For example if the plot was generated from a single Markov chain then the sampler is able to jump between the modes but only slowly. This will result in accurate but imprecise computation, with large autocorrelations and low effective sample size for the variable X29 relative to X30.
On the other hand if the plot was generated from a collection of Markov chains then one will have to separate out the Markov chains to determine the sampler behavior. If each Markov chain concentrates around a single mode – and none cross between the two modes – then you won’t be able to determine the relative weights of the two modes. In other words your posterior computation will be inaccurate. Sometimes you’ll see Rhat warnings in this circumstance but it’s not guaranteed.
If each Markov chain in the collection is able to transition between the two modes then you’re back in the “accurate but imprecise computation” case and your only option is to run your Markov chains longer to better resolve the posterior structure.