Model stacking and LOO (brms models)

jrrae · June 23, 2018, 1:18pm

I’m using brms to fit five linear regression models. While trying different ways to evaluate my models, it seems like LOO comparisons and model stacking are providing conflicting information and I’d like to get some insight into why (and confirm that my interpretations are correct). Here’s what I did after fitting my models in brms (m1, m2, m3, m4, m5):

> # Add ICs
> loo1 <- add_ic(m1,ic=c("loo","R2"))
> loo2 <- add_ic(m2,ic=c("loo","R2"))
> loo3 <- add_ic(m3,ic=c("loo","R2"))
> loo4 <- add_ic(m4,ic=c("loo","R2"))
> loo5 <- add_ic(m5,ic=c("loo","R2"))

Next, I used model stacking to compare the models:

loo_list <- list(loo1$loo, loo2$loo,loo3$loo,loo4$loo,loo5$loo)
loo_model_weights(loo_list,method = c("stacking"))
Method: stacking
------
       weight
model1 0.027 
model2 0.000 
model3 0.345 
model4 0.000 
model5 0.628

I was then thinking that there’s at least some evidence that m5 is the preferred model but that m3 is a good candidate too. Just try look at things another way, I then tried pseudo BMA and got basically the same result except now m4 and m3 share similar weights:

> loo_model_weights(loo_list,method = "pseudobma")`
    Method: pseudo-BMA+ with Bayesian bootstrap
    ------
           weight
    model1 0.000 
    model2 0.000 
    model3 0.122 
    model4 0.133 
    model5 0.745

Finally, just to see what would happen I compared the models using LOO:

> compare_ic(loo1,loo2,loo3,loo4,loo5)
           LOOIC     SE
m1      33613.97 310.00
m2      33359.18 312.84
m3      27141.53 374.63
m4      27135.29 374.62
m5      27127.36 374.28
m1 - m2   254.79  32.00
m1 - m3  6472.44 139.52
m1 - m4  6478.68 139.63
m1 - m5  6486.61 139.37
m2 - m3  6217.65 140.22
m2 - m4  6223.89 139.68
m2 - m5  6231.81 139.48
m3 - m4     6.24   8.65
m3 - m5    14.16  13.46
m4 - m5     7.92   8.71

Here’s where I’m confused and surprised: It looks to me that m4 and m5 aren’t really improving the predictive accuracy at all over m3, which seems surprising giving the stacking and pseudo BMA results. Is this unusual?

I’m posting this with the assumption that I might be just misinterpreting how I should be using the brms functions (maybe this isn’t really a brms issue though).

Operating System: Windows 10
brms Version: 3.2.1

paul.buerkner · June 23, 2018, 1:28pm

I guess that’s a question for @avehtari.

Just one point from me: you can also obtain model weights based on LOO using

model_weights(model1, model2, ..., weights = "loo")

When looking at those model weights, are the results provided still conflicting?

jrrae · June 24, 2018, 12:45pm

Great, thanks for letting me know another route for getting model weights. The results are still conflicting using the approach you suggested:

model_weights(m1,m2,m3,m4,m5,weights = "loo")
          m1           m2           m3           m4           m5 
0.0000000000 0.0000000000 0.0008244602 0.0186562072 0.9805193326

Perhaps one thing that’s unusual about my models is that they contain five outcome variables. That is, they are of the form:

brm(cbind(y1,y2,y3,y4,y5) ~ x))

Could this perhaps be a source of divergence that I’m seeing?

avehtari · June 27, 2018, 10:33am

Check this vignette Bayesian Stacking and Pseudo-BMA weights using the loo package • loo. It has related examples and discussion when LOO comparison / Pseudo-BMA+ weights differ from stacking weights. In your case it is likely that models 3 and 5 have different shape of the predictive distribution and stacking them produces better predictive distribution, while model 4 is likely to be very similar to model 3 or 5 and thus it can be removed in stacking. Stacking is optimizing the weights jointly to produce better predictive distribution, while LOO comparison / Pseudo-BMA+ weights consider only the mean and sd of LOO log predictive densities. Based on pseudo-BMA+ weights it’s likely that model 3 and 4 are very similar sharing the weight. The above mentioned vignette has similar examples. Please ask again if this explanation did not clarify the difference of these two approaches.

First I want to check that are all five models having the same five outcome variables?
Otherwise I guess that’s a question for @paul.buerkner, that is, how log_lik’s are computed in this case.

paul.buerkner · June 27, 2018, 12:03pm

In a multivariate model, the log-likelihood is pointwise per observation in brms that is the joint likelihood for each point is computed across all response variables. If you model the responses as uncorrelated set_rescor(FALSE), this joint likelihood is just the product of the single likelihoods per response variable value (or the some of the log-likelihoods). If you model them as correlated (set_rescor(TRUE)), which is the default for normal or student-t models, we use the (log of the) multivariate normal / student-t distribution to come to the desired pointwise likelihood.

jrrae · June 27, 2018, 2:17pm

avehtari:

Check this vignette http://mc-stan.org/loo/articles/loo2-weights.html . It has related examples and discussion when LOO comparison / Pseudo-BMA+ weights differ from stacking weights. In your case it is likely that models 3 and 5 have different shape of the predictive distribution and stacking them produces better predictive distribution, while model 4 is likely to be very similar to model 3 or 5 and thus it can be removed in stacking. Stacking is optimizing the weights jointly to produce better predictive distribution, while LOO comparison / Pseudo-BMA+ weights consider only the mean and sd of LOO log predictive densities. Based on pseudo-BMA+ weights it’s likely that model 3 and 4 are very similar sharing the weight. The above mentioned vignette has similar examples. Please ask again if this explanation did not clarify the difference of these two approaches.

I still think I might need a bit more clarification because I thought the weights were going to be relevant to model selection.The reason I thought this is because WAIC weights are discussed in Statistical Rethinking Ch. 6 as being relevant to model selection. Specifically, it says:

"But what do these weights mean? There actually isn’t a consensus about that. But here’s Akaike’s interpretation, which is common.

A model’s weight is an estimate of the probability that the model will make the best predictions on new data, conditional on the set of models considered."

Applied to my own data, I interpreted the stacking weights as indicating that model 5 as having a roughly a 6. to .7 probability of making the best predictions with new data (compared to my other models). Under this logic, I was thinking that if a model had to be selected, model 5 might be the best. However, am I correct in saying that these weights just useful for model averaging as you said (they are the weights for combining to get a better predictive distribution) and are irrelevant for model section? By contrast, the LOO comparisons (which are helpful for model selection, no?) suggest that models 4 and 5 don’t really improve upon the predictions made by model 3 (i.e., model 3 can be selected).

I fit a series of increasingly complex models using the same 5 outcome variables:

Null: brm(cbind(y1,y2,y3,y4,y5) ~ 1))
x1 only: brm(cbind(y1,y2,y3,y4,y5) ~ x1))
x2 only: brm(cbind(y1,y2,y3,y4,y5) ~ x2))
Both x1 and x2: brm(cbind(y1,y2,y3,y4,y5) ~ x1 + x2))
Interaction between x1 and x2: brm(cbind(y1,y2,y3,y4,y5) ~ x1*x2))

avehtari · June 27, 2018, 6:32pm

Thanks for the clear question. It’s obvious we need to clarify the vignette and help texts on this.

Akaike/WAIC/LOO/Pseudo-BMA(+) weights have different interpretation from stacking weights.

Let’s start with Akaike’s interpretation

Note “the probability that the model will make the best predictions on new data” which is different from “the probability that the model is the true model” (the latter making sense only in in M-closed case assuming true model being included in the set of models).
Akaike’s justification was asymptotic heuristic missing the uncertainty term. In nested model case even if the true model is included and the true model is not the most complex model, the variance term is also asymptotically large enough (see, e.g., @anon75146577’s excellent recent blog post I am the supercargo | Statistical Modeling, Causal Inference, and Social Science) that it should be taken into account as in Pseudo-BMA+ (see details in the stacking paper).
When two models give exactly the same predictions then the weight is 0.5 and you can choose either model. When m models give exactly the same predictions the weight is 1/m for each model and you can choose any one of them. If the predictions are similar the weights are still close, and when we take the uncertainty into account as in Pseudo-BMA+ the weights tend to stay away from 0 and 1 unless one of the models has much better predictive accuracy.
In case of nested models *IC and Pseudo-BMA(+) weights are relevant for model selection so that models having weight 0 can be removed from the model list. For the rest of models you know they have similar predictive performance but you need other justifications to choose one or you can do model averaging instead of choosing any one of them.

For your models interpretation of Pseudo-BMA+ weights is that model5 is the best, but there is a non-negligible probability that model3 or model4 could give better predictions on new data.

Stacking is different as then the goal is to find the best weights for combining the predictive distributions. For your models the interpretation is that you can get better predictions than any single model by combining the predictive distributions of model1, model3 and model5 with the given stacking weights. Since model1 and model3 are included also in the model5, it means that the prior for model5 is not that good as it could be. Clearly the interaction term is useful, but adding a bit of model3 hints that the prior is too vague for x1.

Did this help? I’m hoping to get feedback so that we know what we should add to the documentation.

Good.

jrrae · June 28, 2018, 11:58am

I’m not certain this will help for changing the documentation, but below I’ll note what I found to be clear/unclear. Keep in mind this is coming from a non-statistician that’s new to Bayesian stats (but perhaps it’s helpful to have this viewpoint when making the documentation?).

avehtari:

Let’s start with Akaike’s interpretation

Note “the probability that the model will make the best predictions on new data” which is different from “the probability that the model is the true model” (the latter making sense only in in M-closed case assuming true model being included in the set of models).

Akaike’s justification was asymptotic heuristic missing the uncertainty term. In nested model case even if the true model is included and the true model is not the most complex model, the variance term is also asymptotically large enough (see, e.g., @Daniel_Simpson’s excellent recent blog post http://andrewgelman.com/2018/06/21/i-am-the-supercargo/ ) that it should be taken into account as in Pseudo-BMA+ (see details in the stacking paper).

When two models give exactly the same predictions then the weight is 0.5 and you can choose either model. When m models give exactly the same predictions the weight is 1/m for each model and you can choose any one of them. If the predictions are similar the weights are still close, and when we take the uncertainty into account as in Pseudo-BMA+ the weights tend to stay away from 0 and 1 unless one of the models has much better predictive accuracy.

In case of nested models *IC and Pseudo-BMA(+) weights are relevant for model selection so that models having weight 0 can be removed from the model list. For the rest of models you know they have similar predictive performance but you need other justifications to choose one or you can do model averaging instead of choosing any one of them.

I found these points to be helpful and clear.

While this certainly clarifies how I should be interpreting the Pseudo-BMA+ weights, one outstanding question is how should I be thinking about why these weights seem to diverge from the LOO comparisons. That is, can I gain any useful insight from knowing that, for example, model 5 is given a majority of the weight by the Pseudo-BMA+ (and stacking) method but only represents a small improvement (about 1 SE) when I compare it to model 3 when using LOO?

This is helpful and clear.

This seems to be an important point but I’m not quite following how you reached this conclusion (i.e., the prior for model 5 isn’t as good as it could and that the prior is too vague for x1). Could you say a bit more? I’d like to understand so I can keep these considerations in mind when fitting models in the future.

In conclusion, I think it’s really helpful to get some additional high-level insight into how WAIC/LOO/stacking weights/Psuedo-BMA+ weights relate to each other and how comparing them can provide insight into issues and concerns with the models (like the issue you brought up about the priors being suboptimal for model 5). Your assistance is greatly appreciated!

avehtari · June 28, 2018, 1:17pm

Thanks

Based on what you posted in your first post, it seems they agree quite well. m5 is best, m4 the second best, m3 the third best, and the uncertainties in pairwise comparisons are such that comparisons agree with the weights.

When you look at compare and about 1 SE improvement you are considering only whether m3-m5 (btw personally I don’t like IC scale multiplication by -2) is positive, while Pseudo-BMA is taking into account also the magnitude of the difference. Thus it’s not easy to calculate weights in your head from the pairwise differences and SEs for multiple models. In both cases you should consider these comparisons to have lot of uncertainty and only trust differences of several SEs or weights 0.000.

Model 5 includes model 3, and thus an equivalent model for the weighted average of model 5 and 3 could be formed by increasing the coefficient for x2 in model 5. Two possibilities why model 5 is not doing that already: 1) prior for coefficients is narrow, 2) model 5 has also additive term x1, which is not that useful alone based on predictive performance of model 2 (but seems to be helpful in the interaction) and if the prior for all coefficients is vague then coefficient for x1 is fitting to the noise in model 5, but which is not happening in the model 3. So we might say that averaginf model 5 and 3 corresponds to adding more mass to zero coefficient for covariate x1 and the interaction term x1:x2.

Show your priors, and tell how much data you have, and I can tell more. It would also help a lot to see loo output for model 5 alone (as for rstan and rstanarm models with p_loo and Pareto diagnostics, I’m not certain what loo prints for brms models).

jrrae · June 28, 2018, 1:53pm

avehtari:

Based on what you posted in your first post, it seems they agree quite well. m5 is best, m4 the second best, m3 the third best, and the uncertainties in pairwise comparisons are such that comparisons agree with the weights.

jrrae:

That is, can I gain any useful insight from knowing that, for example, model 5 is given a majority of the weight by the Pseudo-BMA+ (and stacking) method but only represents a small improvement (about 1 SE) when I compare it to model 3 when using LOO?

When you look at compare and about 1 SE improvement you are considering only whether m3-m5 (btw I hate IC scale multiplication by -2, ping @paul.buerkner) is positive, while Pseudo-BMA is taking into account also the magnitude of the difference. Thus it’s not easy to calculate weights in your head from the pairwise differences and SEs for multiple models. In both cases you should consider these comparisons to have lot of uncertainty and only trust differences of several SEs or weights 0.000.

I understand my error now. In my head I was thinking that a weight of .60 to .70 felt like much stronger support for a model than a 1 SE LOO difference. I now see that this conceptualization is incorrect.

This is super helpful. I understand now.

I only set priors on regression intercepts and and coefficients (both N(0,5)). I scaled my regression inputs by 2 sd’s (to facilitate comparisons between the coefficients since x1 is dichotomous and x2 is continuous). For my 5 outcome measures, they are logit transformed probabilities (doing multivariate beta regression isn’t possible or I would have done it) that haven’t otherwise been transformed. I have a sample size of 4,524.

Here’s the loo output for model 5:

Computed from 4000 by 4524 log-likelihood matrix

         Estimate    SE
elpd_loo -13563.7 187.1
p_loo        45.2   2.0
looic     27127.4 374.3
------
Monte Carlo SE of elpd_loo is 0.1.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

avehtari · June 29, 2018, 6:13pm

After the rescaling that prior is really wide. Even if you don’t have many parameters hs prior might be better for model 5. I would expect only a small differences in the results, but for example in stacking I would guess that model 5 would get a bigger weight. In general, I would favor N(0,sigma) priors with unknown sigma, producing a prior with more mass near zero than N(0,5).

If I count correctly, I think the model 5 should have 5*5=25 parameters, so I’m bit confused with p_loo=45 being much higher, but maybe I misunderstood how the multioutput model is coded. It’s good that all k<0.5. It would be good to do also posterior predictive checking. @paul.buerkner does bayesplot ppc work also for multioutput models?

paul.buerkner · June 29, 2018, 7:15pm

bayesplot pp checks work with multivariate models. Just go for

pp_check(<model>, resp = "<variable>")

Topic		Replies	Views
Documentation of pp_average "loo2" brms loo	3	615	January 9, 2019
Should I use pseudo Bayesian model averaging+ for comparing nested models? Modeling loo	10	937	July 25, 2019
Model_weights function with reloo brms	2	706	April 28, 2019
Model selection with loo_compare and loo_model_weights Modeling rstan , techniques , loo	18	1209	July 25, 2023
Interpreting varying and unstable model comparison results brms loo , interpret-results	2	774	August 9, 2018

Model stacking and LOO (brms models)

Related topics