When obtaining point estimates and also credible intervals of parameters in the MCMC-setup, the most common approach is to do this marginally (using just the samples of one parameter, which corresponds to integrating the posterior).
However, another approach could be to jointly maximize the posterior.
Of course, it is much easier to maximize the marginal densities but is there some theory on why this makes sense and whether the joint or the marginal approach leads to better results?
The typical point estimate for a parameter in Bayesian statistics is the posterior mean, because if the model is correct, that minimizes expected square error in the estimate. The median minimizes expected absolute error. The posterior mean is the same whether you do it one parameter at a time or with all the parameters together.
For finding a maximum likelihood or max a posterior estimate, we take the joint optimum. That is, for the MAP estimate, we take
You cannot derive an MLE or MAP estimate by running MCMC, though I’ve seen people in applied fields try to take the maximum of the sampled posterior. This doesn’t work. MCMC will never get anywhere near maximum values—it samples from the high probability region of the posterior, not near the mode. You can see this by sampling a high-dimensional standard normal. I demonstrate the idea in my Getting started with Stan tutorial and in my case study on the curse of dimensionality and typical sets.
Thank you very much for this very interesting reply. Can you suggest good references on the minimization of the expected square error and the absolute error?
Regarding the MAP estimator: Assuming that we can somehow obtain the joint mode using some other method, is there any theory on how that estimate compares to the mean/median/marginal mode obtained by MCMC?
And as a follow-up: Can MCMC samples of a distribution be used to obtain estimates of the corresponding distribution function by relative frequencies/empirical distribution function? Does this depend on the dimensionality of the distribution?
The basic idea is that you define a loss function to minimize, such as squared error or absolute error, then do some math. You can start with the Wikipedia:
Well, the first problem is that the MAP only exists for a subset of models. If you have hierarchical priors, the MAP isn’t defined because the posterior density isn’t bounded. Under some conditions where the MAP exists, you can show they converge to the same value in the limit of infinite data.
“Under some conditions where the MAP exists, you can show they converge to the same value in the limit of infinite data.”
Could you recommend literature on this topic?