Stacking or pseudo BMA for a mixture of Negative Binomials

@avehtari

Greetings,

We are fitting a model where we try to estimate the size of an epidemic outbreak based on the number of mutations observed in samples of foot and mouth viruses in infected farms. The main idea behind the model is that the number of mutations are described by a negative binomial with parameters “z” and “shape”. If we observe all infected farms in an epidemic, the distribution of observed mutations should follow that NB. However, if we missed one farm in a chain of transmissions, the number of mutations after the missing farm will be distributed as a NB with 2z and 2shape. If you missed 3, you have NB(3z, 3shape) and so on. Thus, the data is a mixture of these NBs and we estimate what proportion each one of these contributes to the mix. Now, the issue is how to decide how many of these NBs to consider. What we have done is to fit models with just one NB, two, three, and up to eight. We fit the model with Stan and then used LooIC and found that the different models were not too far from each other. These are the delta LooIC:

model1 10.75
model2 10.84
model3 3.30
model4 0.00
model5 0.42
model6 1.97
model7 3.56
model8 4.56

So, we decided to use model averaging and our doubts are about whether to use stacking or pseudo BMA as they produce very different weights.

With stacking we get:
model1 0.108
model2 0.000
model3 0.035
model4 0.496
model5 0.361
model6 0.000
model7 0.000
model8 0.000

and with pseudoBMA
model1 0.026
model2 0.009
model3 0.113
model4 0.328
model5 0.290
model6 0.130
model7 0.062
model8 0.042

It is not totally clear to me why do we get these differences and if one option should be preferred over the other.

I understand that this may not have a simple answer! In any case, I’d really appreciate any comments about this. Thanks a lot in advance.

  1. I don’t like LOOIC as it’s just unnecessary multiplying by -2 (see CV-FAQ 21)
  2. It would be helpful to see also elpd_diff SE’s to get better understanding whether the differences are big or not (see CV-FAQ 15)
  3. In elpd scale the differences of less than 4 is usually small (see CV-FAQ 11 and Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison. In “LOOIC” scale this would be less than 8, so the most differences are small. Nice thing is that it seems the estimates are not very noisy as they form a U-shape quite nicely.
  4. Stacking weights are better than Pseudo-BMA weights as they will take into account how similar the models are to each other, that is, if the model is very similiar to others but slightly worse the weight is likely to be 0, but if the model is very different but only slightly worse than others it may have non-zero weights. Non-zero stacking weights also indicate that it’s likely that none of the models is the true model or there is not yet enough information on deciding which model is the true model. The fact that the model1 gets non-zero weight indicates that the prior for the later models might be too wide. Otherwise stacking weights indicate there is not enough information to decide on one model, but averaging over model4 and model5 gives better predictions than using any single model. The models 6,7,8 are not able to improve the prediction so their predictions are likely to be similar but slightly worse than model5. PseudoBMA weights reflect much more the elpd_differences, but doesn’t take into account how similar the models are . See example also in Using stacking to average Bayesian predictive distributions.
  5. Although models 4 and 5 have the largest stacking weights (and best elpd_loo’s) it doesn’t guarantee that certain number of farms has been missed if NB distribution is not actually the true distribution.
  6. Have you tested what happens if you repeat the experiment with simulated data in which case you can be sure that NB is the true distribution and you know the number of missing farms? You might still see that more than one model gets non.zero weight.
  7. Instead of predictive performance measures, can you see something useful from the posterior distribution of the biggest model?
2 Likes