Should I use pseudo Bayesian model averaging+ for comparing nested models?

Hi folks,

I’ve made a couple models using brms because I want to know whether/how various life history traits (ecoregion, dietary guild, body mass) influence habitat loss (y). None of these traits were highly colinear according to a quick vif check. The formulae for my models, which have the same family and priors, are:

y ~ eco
y ~ mass
y ~ diet
y ~ eco + mass
y ~ eco + mass + diet

I looed (and relooed as necessary) the models. Here are a couple of the metrics I obtained to evaluate them using loo_model_weights and bayes_R2:
image

My current understanding, after reading very helpful comments by @avehtari here: Model stacking and LOO (brms models) is that it doesn’t make much sense to use model stacking weights to compare nested models, but that it might make sense to use PBMA+ with a better understanding than I have. Here are my questions:

  1. My ELPD’s are all the same. How do these models have such different PBMA+ weights? Also, does the “eco” y ~ eco model actually predict new data better or describe the current fit better more often than the full model?

  2. What is going on with those approximate Bayesian R^2s? I know one of the advantages of Bayesian R^2 is overfitting is less of an issue, but why should the eco + mass model have a worse R2 than either the eco or mass individual models on their own?

Could you

  • tell the number of observations (I guess n is small),
  • show parameter estimates for y ~ eco + mass + diet (I guess there is not much to see),
  • show Bayesian-R2 distribution (I guess it’s very wide),
  • compute LOO-R2 (I guess it’s much closer to 0),
  • plot posterior predictive checking (I guess there is a model misspecification),
  • add model y ~ 1 (this will help to have a baseline result for ELPD),
  • show full loo output for y ~ eco + mass + diet (can help to diagnose model misspecification),
  • show elpd_diff_se’s when comparing to the best model or full model (when comparing models, this is better than elpd_se)

Overfitting is still an issue for point estimate of Bayesian-R2. Showing the distribution of R2 helps to recognize the related uncertainty,

Hi Aki,

Thanks! I have 95 observations.

The parameter estimates for y ~ eco + mass + diet are:

This is what I have for my Bayesian R2 estimate:
image

My LOO R2 is much closer to zero: 0.02234089

Here’s the posterior predictive checking for the model. It looks a lot better than some of the other priors I tried, but I’d be happy to improve it:

Here’s the table I showed previously, with the 1 intercept model included. I’m sorry I’m not showing elpd_diff_se - I couldn’t figure out how to calculate that for my models or loo objects. If you can give me a hint on how to get that, I would be extra grateful.
image

Finally, here’s the full model loo output:
image

I’m not surprised that these models have such weak explanatory power - habitat loss has a lot to do with human activity, not just species’ properties. I just don’t know how to reconcile the different options for determining whether any of these models are better than each other (ELPD, PBMA+ weights), which is something I must know if I’m going to report them in a paper.

This means your model doesn’t have any predictive ability.

The parameter estimates are also very uncertain supporting this conclusion.

Is y strictly positive? If it is, it would be good to model log(y) or use log-normal or log-t model for y. Or is y restricted between 0 and 1, as loss in percentage, in which case logistic transformation could be used or beta-regression.

Use loo_compare. For brms, see loo_compare.brmsfit: Model comparison with the 'loo' package in brms: Bayesian Regression Models using 'Stan'

p_loo is larger than the number of parameters in the model, which indicates strong model misspecification, which can explain high variation in point estimates of R2. Min n_eff is suspiciously low. Can you show also convergence diagnostics (Rhat and n_eff (ESS in more recent versions))?

Currently it seems all your models are bad and none of them should be used and comparison is not sensible before you have better models.

Well, that is certainly critical to know, thank you! My data are percentages, but I used the fitdist function in the fitdistrplus package to compare WAIC values for Weibull, gamma, beta, and some other common fits. Weibull came out with the best WAIC score, so that’s what I proceeded with. However, it sounds like I should use beta since these are known percentages, right?

Here are my Rhats:
image

What should I do to improve the specification of my model?

Thank you for all your help so far!

You should have mentioned that also in the first post. You mentioned only the formulas, but not that you didn’t use the default family.

Probably.

Wow, thin=200! Usually Stan performs for this kind of models well with thin=1 and total post-warmup samples = 4000.

Also, you had not mentioned before that you have also group-level effects! It’s difficult to help if you keep hiding information. Group-level effects are likely to explain high p_loo and covariates have not much to do. Group-level effects are likely to mess R2 and posterior predictive checks, too, making it look like there would be non-zero R2 and reasonable ppc fit. LOO-R2 agrees with the coefficient estimates, ie, there is not much happening.

Try beta, but it is possible that the covariates do not have any predictive information. Compute LOO-R2 for all models, as it is easier to interpret the practical relevance form it than from elpd.

Couple clarifications to my previous message.

The convergence diagnostics look great and there is no problem with MCMC.

phylo has 95 levels, that is, there is an additional parameter for each observation. You are now explaining the data mostly with (1 | phylo) term. Is there some additional prior for this term? If not, then you could try without that term (and pissibly with beta regression).

As each phylo group has one observation, removing that observation changes the corresponding parameter posterior a lot and loo has problems. It is surprising that all khat values are below 0.7, but high p_loo and low n_eff’s indicate additional problems. Thus it is possible that small differences between elpds are explained by failing loo. When loo fails, elpd estimates end to be optimistic (too high)

Hi Aki,

I thought that since the family and group-level term were constant across models, they wouldn’t affect model weights. Now that we know I have misspecification, they’re clearly important. Thank you for your patience - I’m still very new to Bayesian stats.

The phylo term accounts for the fact that the species/samples are not phylogenetically independent. I followed this vignette to set it up. I am not sure what kind of prior I’d assign to it.

I did run the full model with the beta family and better thinning/warm up numbers. Here’s how I specified it for completeness:
image
LOO R2 = 0.0521, so it’s still very bad at explaining the data.

The posterior predict plot looks like this:

image

image

Is this sufficiently well specified to state that the model isn’t really predictive, and that phylogeny is the only thing with any signal at all? Either way, what do you look for to feel confident about your models’ specifications?

No problem, happy to help.

Understanding the phylo term is important for understanding how much model is explaining the data with the random effects and how much by covariates.

I assume A includes the phylogenetic dependencies?

Yes.

Now that you changed thin=1, you can also reduce the number of iterations from 110000 to 8000 and still get very good sampling, but much faster.

loo works now better and is more reliable.

Given the information you have provided I would agree with “the model isn’t really predictive, and that phylogeny is the only thing with any signal at all”. Also it seems there is no such model misspecification that the diagnostics you have shown would show that.

1 Like

Thanks so much for your help, Aki! I am so excited to feel good about these models and my results. Before you go, what’s a good n_eff rule of thumb, for future reference? I know I was too low before, but what’s a good target number to meet?

Also, should I keep warmup = 4000 with iter = 8000? Thanks again!