Question about lp__ as weight?

Some models I estimate are some problematic mixture models. Priors and constraints help identify these things, but sometimes a chain may get stuck in a region near a constraint boundary due to not “finding” the “better” posterior mode.

In principle, this shouldn’t be ignored, since multimodal posteriors pretty well state that there are indeed multiple solution sets.
However, more practically, on simulated data I see that chains with the greatest log-posterior recover the generative parameters very well.
In developing this new model, I’ve taken a pragmatic approach based on that observation to basically select the chain with the greatest lp__, which I’m sure is a statistical sin (even if it is rooted in performance tests).

I realized last night that if I didn’t want to ignore these other regions, I could include them, but weight them according to the lp__ in that chain (or per sample?).
I tested this, and it worked fairly well. E.g., I have two chains that “found” the solution set I had generated the data from, and their mean lp__ was about -640. The other two chains were stuck in a mode near a boundary meant to identify the model, and their lp__ were about -690. On a log scale, of course, that’s a fairly large difference. So I computed the mean for a certain parameter that was weighted by the post-prob at that sample.
Or

samps <- as.data.frame(stanOut) # Extract samples
sum(samps$parameter * exp(samps$lp__))/sum(exp(samps$lp__)) # Weighted mean

The computed means, of course, lined up fairly well with the generative parameters, since the lower-lp__ chains were much, much lower on the antilog scale.
From a performance-only point of view, this worked well.
But is this a catastrophically bad idea, and more sinful than selecting the highest lp__ chain when some multimodality exists?

You are right, it’s a wrong thing to do.

You should weight them according to the posterior mass in the mode, and not with mean lp, but the estimation of the mass ie the marginal posterior is difficult. A simple imporrtance sampling estimate would lead to the harmonic mean of the likelihood values 1/mean(1/exp(lp__))). This illustrates that it would be something else than mean lp__, Unfortunately this harmonic mean estimator is unstable for marginal likelihood. You could use bridge sampling http://www.sciencedirect.com/science/article/pii/S0022249617300640 or stacking [1704.02030] Using stacking to average Bayesian predictive distributions

It seems that although you wrote mean lp__ for the mode (region) here you are weighting each draw with the density, but this is wrong as the MCMC sample already has more draws where the density is high, and you are then in a way double weighting your expectation.

I mentioned bridge sampling http://www.sciencedirect.com/science/article/pii/S0022249617300640 and stacking [1704.02030] Using stacking to average Bayesian predictive distributions, but note that these will assume that you have found all relevant modes and the modes are not overlapping too much, which might a big problem with your mixture models.

I wasn’t sure whether the mean lp__ or per-sample lp__ would be more appropriate (the latter which I had concerns about that you mentioned - It’s doubly weighting). Turns out - Neither are :p. Thank you for your response.

I’ve played with the bridge sampling package, so I may try applying that to the problem at hand.

Practically speaking, based on my sims at least, the differences in the model posterior probs are so vastly different on the antilog scale, that I’m expecting the method to basically suggest what the wrong methods above already do, even if it does so in a more principled, correct manner. Thanks again for your response, i’ll consider my options.