I would like to present the result of a model comparison to a broad audience and am looking into different options for presenting the differences between models.
The comparison consists of six simple Bernoulli models fit with
stan_glm and one fixed effect, which is one of six measurement metric being evaluated. All models are fit to the same data set of 110k observations, and the intent is to compare the probable explanatory effect of each metric.
The covariate for each model has been scaled to the unit interval. The covariate for each model consistently has a coefficient of .06 to .07 with a tight interval; the intercept for each model tends to be around .7, again with a very tight interval. The models are fit with a normal prior (0,2) on the Intercept and a normal prior of (0,.12) on the covariate, with 4000 draws from the posterior. The models ran and loo was calculated for each model without any noted errors for either function. The loo values are almost identical to the WAIC values calculated by INLA.
I have two questions on which I would appreciate the community’s views. The first is whether there is a convenient graphical representation of the following loo model comparison table, which includes both the six models of interest and a seventh intercept-only model, described as “BASE”:
For example, ordinarily to compare distributions with a standard error, I would present overlapping (presumably Gaussian) density plots so viewers could visualize how some distributions overlap and others do not. I suspect that LOO comparisons may not be readily amenable to this treatment as the samples are from a tail and from a predictive distribution that may not even be Gaussian in nature. But I wanted to check and see if anyone felt differently or could report success in visualizing LOO predictive distribution comparisons to a lay audience.
The second and more important question arises from strange behavior I am seeing from the stacking algorithm included with loo depending on whether the intercept-only (“BASE”) model is included.
Below, the first table reports the stacking, bma_plus (pseudo BMA with bootstrap), and bma (pseudo BMA, no bootstrap) weights calculated by the
loo_model_weights function, for the six models in question plus the BASE intercept model. The second table calculates the same weights but this time excludes the additional BASE model. Note that the models have been placed in name order rather than descending quality order as in the previous table.
|BASE||not incl||not incl||not incl|
When comparing the two tables, the pseudoBMA with no bootstrap weights are identical. The pseudoBMA with bootstrap results are slightly different, but very similar. The stacking weights, although they correctly assign zero weight to the base model, change dramatically for reasons I cannot ascertain. Why would the addition of a model found to receive (essentially) no weight affect the weights of models that are being (substantially) weighted?
My understanding of stacking from reading the paper of Yao et al is that stacking is intended to avoid something like this happening, and to be more robust to this problem than the pseudo-BMA methods I have displayed also. Aside from specifying the weighting method, I left all function defaults in place.
Any thoughts on what might be happening?