Model stacking and LOO (brms models)

Thanks for the clear question. It’s obvious we need to clarify the vignette and help texts on this.

Akaike/WAIC/LOO/Pseudo-BMA(+) weights have different interpretation from stacking weights.

Let’s start with Akaike’s interpretation

  1. Note “the probability that the model will make the best predictions on new data” which is different from “the probability that the model is the true model” (the latter making sense only in in M-closed case assuming true model being included in the set of models).
  2. Akaike’s justification was asymptotic heuristic missing the uncertainty term. In nested model case even if the true model is included and the true model is not the most complex model, the variance term is also asymptotically large enough (see, e.g., @anon75146577’s excellent recent blog post I am the supercargo | Statistical Modeling, Causal Inference, and Social Science) that it should be taken into account as in Pseudo-BMA+ (see details in the stacking paper).
  3. When two models give exactly the same predictions then the weight is 0.5 and you can choose either model. When m models give exactly the same predictions the weight is 1/m for each model and you can choose any one of them. If the predictions are similar the weights are still close, and when we take the uncertainty into account as in Pseudo-BMA+ the weights tend to stay away from 0 and 1 unless one of the models has much better predictive accuracy.
  4. In case of nested models *IC and Pseudo-BMA(+) weights are relevant for model selection so that models having weight 0 can be removed from the model list. For the rest of models you know they have similar predictive performance but you need other justifications to choose one or you can do model averaging instead of choosing any one of them.

For your models interpretation of Pseudo-BMA+ weights is that model5 is the best, but there is a non-negligible probability that model3 or model4 could give better predictions on new data.

Stacking is different as then the goal is to find the best weights for combining the predictive distributions. For your models the interpretation is that you can get better predictions than any single model by combining the predictive distributions of model1, model3 and model5 with the given stacking weights. Since model1 and model3 are included also in the model5, it means that the prior for model5 is not that good as it could be. Clearly the interaction term is useful, but adding a bit of model3 hints that the prior is too vague for x1.

Did this help? I’m hoping to get feedback so that we know what we should add to the documentation.

Good.

3 Likes