LOO Visualization / Stacking

I would like to present the result of a model comparison to a broad audience and am looking into different options for presenting the differences between models.

The comparison consists of six simple Bernoulli models fit with stan_glm and one fixed effect, which is one of six measurement metric being evaluated. All models are fit to the same data set of 110k observations, and the intent is to compare the probable explanatory effect of each metric.

The covariate for each model has been scaled to the unit interval. The covariate for each model consistently has a coefficient of .06 to .07 with a tight interval; the intercept for each model tends to be around .7, again with a very tight interval. The models are fit with a normal prior (0,2) on the Intercept and a normal prior of (0,.12) on the covariate, with 4000 draws from the posterior. The models ran and loo was calculated for each model without any noted errors for either function. The loo values are almost identical to the WAIC values calculated by INLA.

I have two questions on which I would appreciate the community’s views. The first is whether there is a convenient graphical representation of the following loo model comparison table, which includes both the six models of interest and a seventh intercept-only model, described as “BASE”:

model elpd_diff se_diff
model3 -0.0 0.0
model5 -4.4 7.9
model1 -4.7 9.1
model2 -8.7 9.4
model4 -16.6 8.9
model6 -21.3 8.0
BASE -41.4 9.1

For example, ordinarily to compare distributions with a standard error, I would present overlapping (presumably Gaussian) density plots so viewers could visualize how some distributions overlap and others do not. I suspect that LOO comparisons may not be readily amenable to this treatment as the samples are from a tail and from a predictive distribution that may not even be Gaussian in nature. But I wanted to check and see if anyone felt differently or could report success in visualizing LOO predictive distribution comparisons to a lay audience.

The second and more important question arises from strange behavior I am seeing from the stacking algorithm included with loo depending on whether the intercept-only (“BASE”) model is included.

Below, the first table reports the stacking, bma_plus (pseudo BMA with bootstrap), and bma (pseudo BMA, no bootstrap) weights calculated by the loo_model_weights function, for the six models in question plus the BASE intercept model. The second table calculates the same weights but this time excludes the additional BASE model. Note that the models have been placed in name order rather than descending quality order as in the previous table.

model stack_weight bmaplus_weight bma_weight
model1 .097 .216 .009
model2 .221 .034 .000
model3 .347 .551 .979
model4 .130 .003 .000
model5 .204 .196 .012
model6 .000 .000 .000
BASE .000 .000 .000
model stack_weight bmaplus_weight bma_weight
model1 .301 .213 .009
model2 .090 .029 .000
model3 .418 .586 .979
model4 .000 .006 .000
model5 .191 .166 .012
model6 .000 .000 .000
BASE not incl not incl not incl

When comparing the two tables, the pseudoBMA with no bootstrap weights are identical. The pseudoBMA with bootstrap results are slightly different, but very similar. The stacking weights, although they correctly assign zero weight to the base model, change dramatically for reasons I cannot ascertain. Why would the addition of a model found to receive (essentially) no weight affect the weights of models that are being (substantially) weighted?

My understanding of stacking from reading the paper of Yao et al is that stacking is intended to avoid something like this happening, and to be more robust to this problem than the pseudo-BMA methods I have displayed also. Aside from specifying the weighting method, I left all function defaults in place.

Any thoughts on what might be happening?

Our resident loo and model comparison generally is @avehtari . If you’re lucky he may have time to provide some guidance.

  1. Model 3 is baseline 0, and for all others you could plot elpd_diff and interval based on se_diff
  2. Or you could make BASE as the baseline to show whether other models are different from that. You would need to compute those differences yourself (but this gave me an idea that we could add a feature to make it easy to choose the baseline model)

With this amount of data, it’s likely that the related uncertainties are close to Gaussian.

I don’t understand what you mean by this. It feels like misunderstanding, so if you can elaborate we can check this out.

Me neither. Looks like a bug or weak identifiability of weights and unstable optimization. This should be investigated more. Pinging @yuling

non-zero bmaplus weights indicate the models have similar predictive performance, and non.zero stacking weights indicate that in addition the predictive distributions are that way different that the combination is better than any single.

Aki, no misunderstanding, I was just speculating on possible reasons why the distributions might not be approximately Gaussian; because it sounds like they probably are, this becomes a non-issue.

I’m happy to backchannel with my code if there is interest in investigating this issue. Many thanks. In the meantime, I will run with Pseudo-BMA+ for my analysis, as it seems to be providing reasonable results.

It depends on which distribution you are talking. The pointwise elpd values can have highly non-Gaussian distribution, but with certain conditions the uncertainty in the expectation is well modelled with Gaussian (usual CLT conditions). We’ll soon have a new paper discussing this more.

Yes, we should be aware of the failure modes or fix the bugs in code.