Model comparison methods


#1

Hello. I am an ecologist trying to get to grips with Bayesian model comparison for an analysis, but I’m finding it quite difficult to decide on the appropriate method for my needs.

I was wondering if you would be willing to offer some advice if I outline our aims and the methods I have tried so far?

Our overall question is whether species traits can predict the global impact of pathogen species within a genus of plant pathogens (Phytophthora).
Within this analysis, I would also like to ask whether using trait syndromes as predictors (e.g. some axes of trait-space capturing important covariance among traits) can outperform models with individual traits. The key point here is that I want to compare both nested and non-nested models within the same framework.

I currently have a total of 659 candidate models including all possible subsets of predictors (and including models with individual traits or trait syndromes, as predictors) and I would like to rank these models based on their predictive performance. The models were fitted in R using Paul Buerkner’s brms package as an interface to stan. The models have a negative binomial response (impact). They include phylogenetically correlated random intercepts to account for non-independence due to phylogenetic relatedness among species. The observations are also weighted. This is to account for the fact that our knowledge about species impacts tends to accumulate over time and is usually less reliable for more recently described species. Therefore I wanted to place greater weight on observations for species that we have known about for longer.

Initially I followed the guidance in this paper https://arxiv.org/abs/1507.04544 to use PSIS-LOO, but the Pareto diagnostics suggested the LOO-CV approximations wouldn’t be robust for a lot of my models. I then used WAIC to rank these models, but your paper here https://arxiv.org/abs/1503.08650 suggested this could introduce bias when many models are compared. I was now thinking of using Bayes factors to compare models instead. What would you recommend? Given the structure of my models (weighting and a correlated random effect), are there any issues I should be aware of when choosing and interpreting these model comparison metrics?

I would be happy to provide any additional information about the models if that would be helpful?

Best wishes and thank you in advance.


#2

This is not a good process. The main reason to use PSIS-LOO is because it tells you when its assumptions are not satisfied. If not, then the next course of action is to reconsider the model, not to try another model comparison tool that does not even tell you when its assumptions are not satisfied.

Also, the point about bias when there are many models being compared applies to PSIS-LOO and Bayes factors, as well as WAIC. Comparing models by Bayes factors is completely different than comparing models by PSIS-LOO (or a less robust information criterion), so you should generally decide which route to take before even doing the analysis. PSIS-LOO attempts to answer the question “Which model is expected to best predict new data, given the existing data?” The Bayes factor is largely driven by the pre-data expectation of the likelihood function with respect to the prior distribution, so it is sort of a contest to see which model has the best priors associated with it.

In your case, where you “have a total of 659 candidate models including all possible subsets of predictors”, none of the above approaches really make sense. There is a procedure for this situation outlined in


but the implementation in

currently only works for models estimated by stan_glm in the rstanarm R package but not for the negative binomial. You could try it with a Poisson likelihood and dummy variables for all the intercept shifts, but you are going to need to use a prior distribution that regularizes heavily.


#3

This looks like a cool study.

First of do you really need to build all subsets of models? Wouldn’t it also make sense to build models based on different trait classes and compare predictive ability between these classes? Such as build a set of models containing root traits, another set of models containing leaf traits, do some kind of model selection within these classes and then compare the “best” model of different trait categories?

Last time the PSIS-LOO failed for me I could still go through with k-fold cross-validation, of course this takes way more time and might be prohibitive for such large set of models.

Otherwise you could follow Ben Goodrich advice and turn to stan_glm with some offset to account for time accumulation and a dummy variables related to https://peerj.com/articles/616/ but included as a “fixed” term with some strong prior attached to these coefficients (see: https://cran.r-project.org/web/packages/rstanarm/vignettes/priors.html). Then the classical posterior predictive checks (see pp_check in bayesplot) should tell you if the overdispersion part is well taken into account. For the phylogenetic signal I guess that plots of phylogenetic distance vs model residuals should tell you if things are fine.

Hope this help.


#4

Ben and Lionel gave already good answers and I will just add a bit more

With this many models, the projection predictive approach mentioned by Ben https://link.springer.com/article/10.1007/s11222-016-9649-y
is the best way to avoid the overfitting in the selection process. As Ben mentioned projpred package https://github.com/stan-dev/projpred doesn’t support yet negbin,but it also currently works only with the nested case, where the encompassing model has all the covariates. The projection predictive approach requires the super model, which in non-nested case requires model averaging of non-nested models, too. You could do the model averaging using Bayesian stacking https://arxiv.org/abs/1704.02030 However, it would be better to reduce the number of models so that you could have just non-nested models where each model is the encompassing model for each nested model set. For the encompassing models regularized horseshoe prior could be a good choice https://arxiv.org/abs/1707.01694 (soon in rstanarm in CRAN, too). Then the model averaging of non-nested models would be easier. You may also consider whether you could also use continuous model expansion to include different non-nested cases also in the same model.

If the Pareto diagnostics give a warning then WAIC should not be trusted either. We need to add this comment ot the output of WAIC (ie, for diagnostics run loo, and if diagnostics give a warning don’t use WAIC).