Stability of model averaging- taking into account between sample variation?

I’m trying to compare a few candidate models for many different sets of data. I want to obtain a set of weights for model averaging that are reasonably stable. I’ve noticed that even with small amounts of divergence (rhat for all parameters <1.1), many weighting systems can give quite drastically different weights between different markov chains because the weight is in some way dependent on the exponential of the calculated elppd.

I’m wondering if there’s a robust way to account for between sample variation in weights (i.e. the effect that chain divergence has on WAIC, LOO etc). I know that it’s possible to compute standard errors for WAIC and LOO, but I’m not certain how these relate to the variance between chains. Is there a good way to account for this so that the obtained weights are consistent?


I have a model that I am trying to fit that has a fairly unusual structure. A series of experiments are performed, and each experiment may individually conform to one of a small number of candidate models. I want to know the value of a hyperparameter, common to all experiments. However, this hyperparameter does not inform the data for the experiment in the same way between models.

If we have a set of parameters \theta_j for m models M_j and a hyperparameter \phi and we fit these to a set of n experiments with data y_j, then I believe that the likelihood in this kind of model should look something like:


where P(M_j) are our weights obtained from model averaging at the individual experiment level. As the value of \phi is what we’re really interested in, it’s important that the obtained weights are consistent, or at least have consistent distributions that can be used for informative priors.

Hi, sorry for taking too long to respond.

I don’t think any of this can be done - even with “small amounts of divergence” we know the sampling didn’t work and so the computed posterior can be arbitrarily far away from the actual posterior. In fact, there is no “safe” amount of diagnostic failures (Rhat > 1.01, low n_eff, divergent iterations, …). I’ve seen cases where the model had a single divergent iteration, but upon inspection turned out to be completely off the true values.

I would suggest you build one large model including all the smaller models and average them explicitly, i.e. have p_{i,j} = P(M_{i} = j) as explicit parameters in this larger model. Fitting the models separately to complete datasets cannot IMHO be made to work.

Hope that helps and good luck with your project!

Maybe @yuling has useful advice too?

1 Like

I think it would help better if I described exactly my model and how it works.
For each experiment, we have two data vectors, let’s call them y_i and z_i.
We can fit these vectors using the model y_i=k f(x_i,\theta), z_i=c(1-f(x_i,\theta)). Our parameter of interest is B=c/k.

One of our alternative models is y_i= k f(x_i,\theta_1), z_i=c(1-f(x_i,\theta_2)). In this case our parameter of interest B\neq c/k. Another way of thinking about this is that this experiment ‘Failed’ under this model and so its estimate of c/k should be unpooled with the other experiments when modelling for all y_i,z_i simultaneously.

This presents some problems if we want to have the model itself choose the weights for specimens as I don’t think you can fit both models simultaneously under this framework and have the B parameter be estimated correctly without doing reversible jumps for each experiment which seems like it would be very hard to set up a sampler for.

As an illustration, here are both of the models mentioned here with model fits in a graphical format. For this experiment, it is apparent that model 1 fit significantly better than model 0 meaning that this experiment should have near negligible influence on the hyperparameter B. In the top two figures, we have x_i plotted against y_i and z_i and in the bottom two figures we have y_i plotted against z_i.

I suspect by these fits that the posterior being obtained is not far from the true posterior. I suspect that the uncertainty in elppd arising from divergence can be assessed by looking at the elppd estimated by several different chains.

One easy way to make your weighting more “robust” is to use bootstrap to take into account the sampling uncertainty, see section 3.4 of

1 Like

First thing:

are you having divergent transitions fitting each of the models separately? Or do you use “divergence” to mean another warning/pathology in the model? In any case, unless you know very well that you are doing and have other tests in place to ensure you can ignore the diagnostics, you cannot trust your estimates. Even when the visual fit looks good, elppd can be orders of magnitude wrong.

Isn’t model 0 strictly contained in model 1? E.g. for \theta_1 \simeq \theta_2 you get the simpler model, right? If you know model 0 is insufficient for some data, what is the reason you don’t want to use model 1 for all of the data?

I don’t know what you mean by “reversible jumps” (but I am not strong on MCMC theory so probably my fault), so I can’t react to this directly. But in the case when one model is fully contained in another, you usually cannot estimate both simultaneously. However, you can set up your prior structure to favor the sparser model (e.g. by putting a narrow prior on \theta_1 - \theta_2.

Does this make sense?