Marginal Estimates

Ariane1 · April 3, 2024, 10:36pm

Hi!

When obtaining point estimates and also credible intervals of parameters in the MCMC-setup, the most common approach is to do this marginally (using just the samples of one parameter, which corresponds to integrating the posterior).

However, another approach could be to jointly maximize the posterior.

Of course, it is much easier to maximize the marginal densities but is there some theory on why this makes sense and whether the joint or the marginal approach leads to better results?

Thank you!

Bob_Carpenter · April 4, 2024, 9:14pm

The typical point estimate for a parameter in Bayesian statistics is the posterior mean, because if the model is correct, that minimizes expected square error in the estimate. The median minimizes expected absolute error. The posterior mean is the same whether you do it one parameter at a time or with all the parameters together.

For finding a maximum likelihood or max a posterior estimate, we take the joint optimum. That is, for the MAP estimate, we take

\theta^* = \textrm{argmax}_\theta \ p(\theta \mid y).

This jointly optimizes the values in \theta.

You cannot derive an MLE or MAP estimate by running MCMC, though I’ve seen people in applied fields try to take the maximum of the sampled posterior. This doesn’t work. MCMC will never get anywhere near maximum values—it samples from the high probability region of the posterior, not near the mode. You can see this by sampling a high-dimensional standard normal. I demonstrate the idea in my Getting started with Stan tutorial and in my case study on the curse of dimensionality and typical sets.

Ariane1 · April 24, 2024, 2:02pm

Thank you very much for this very interesting reply. Can you suggest good references on the minimization of the expected square error and the absolute error?

Regarding the MAP estimator: Assuming that we can somehow obtain the joint mode using some other method, is there any theory on how that estimate compares to the mean/median/marginal mode obtained by MCMC?

And as a follow-up: Can MCMC samples of a distribution be used to obtain estimates of the corresponding distribution function by relative frequencies/empirical distribution function? Does this depend on the dimensionality of the distribution?

Bob_Carpenter · April 25, 2024, 8:48pm

The basic idea is that you define a loss function to minimize, such as squared error or absolute error, then do some math. You can start with the Wikipedia:

Well, the first problem is that the MAP only exists for a subset of models. If you have hierarchical priors, the MAP isn’t defined because the posterior density isn’t bounded. Under some conditions where the MAP exists, you can show they converge to the same value in the limit of infinite data.

Ariane1 · April 26, 2024, 7:07am

Thank you so much for the fast reply! Could you also say something about the relative frequencies I added as a follow-up in the previous question?

Ariane1 · May 8, 2024, 8:35am

“Under some conditions where the MAP exists, you can show they converge to the same value in the limit of infinite data.”
Could you recommend literature on this topic?

Topic		Replies	Views
Multiple likelihoods with same outcome, shared parameter Modeling techniques	5	221	May 28, 2024
Metropolis–Hastings :"We assume that the posterior density is locally flat over the region we sample from" General mcmc	7	842	June 10, 2020
Two Conceptual Questions Regarding Bayesian Model Averaging General techniques , loo , model-comparison , bma	2	104	June 5, 2025
Splitting data and combining sub-posteriors for “big” data General	7	1910	July 7, 2020
Prediction Intervals General	8	3020	October 14, 2019

Marginal Estimates

Related topics