Importance of model selection (ELPD/LOOIC) with variables of a priori interest

igwill · September 5, 2024, 10:36am

Hi all,

This is more a general stats modeling conceptual question, but based on brms fits and downstream model selection methods.

In a nutshell, I have an animal that can be infected by two pathogens, and is subjected to four different environmental treatments. Often, you can get a good sense of host mortality (dependent var) simply based on which pathogen in which environment (categorical independent var).
But, we also had hypotheses about if the actual abundance of that pathogen (continuous independent var) can better predict the host outcome. As a rough sketch this looks like: Host_mortality ~ Environment*Pathogen_abundance + (1|batch), family = binomial. I have a couple models, all converge fine and look ok, but not amazing, based on ppcheck().

In some cases, we find EnvironmentbyPathogen_abundance interactions that seem to have a clear effect (posterior 95% credible interval does not include zero), but without any major/broad effect at the reference level/across environment types. Interesting, and not unreasonable biologically. (well, one effect in one model does seem odd)

I followed this by some LOOIC model comparisons testing ~Environment*Pathogen vs. ~Environment only (and vs. a ~1 null). Here, I found LOOIC differences around 40 favoring the full models (by difference and weight).
But digging deeper, saw that ELPD_se and ELPD_difference were actually about the same in magnitude in most cases. Also, I had to reloo=T about 4-10 refits per model (with sample sizes around 40) due to pareto_k > 0.7 warnings (this, sensibly, just made the “leading” model even less clear of a leader).
I also have WAIC that doesn’t give me any kind of error or confidence measurement that I’m aware of - it puts a lot of weight on the full models. But my sense is that LOOIC/ELPD is preferred, I quite like having an error measurement.

My take home message is that, in general, knowing the Pathogen_abundance doesn’t contribute in a efficient/meaningful way to building better predictive models of host mortality. But, in a few specific cases, Pathogen_abundance does indeed tell you about host outcomes and these effects are statistically robust, just only applicable in certain situations. Since we cared about this Pathogen_abundance effect ahead of time, let’s talk about it.
Or … is the model selection here telling me that the effect is probably so small or wobbly that we should be very suspicious about Pathogen_abundance and just say we can’t clearly say it matters at all?

[LOOIC and WAIC initially given by compare_performance() from the performance package. But ELPD and updated LOOIC given by loo_compare() after add_criterion() on my models.]

Thanks for your thoughts!

avehtari · September 7, 2024, 2:59pm

See Nabiximols case study for illustration when LOO-CV can be weak to distinguish models. In that case study, it’s easy to look at the posterior as the treatment is independent from other predictors. It can get more complicated if the predictors are collinear. That case study also shows that we can improve LOO-CV’s capability to distinguish different models if we remove the aleatoric uncertainty part and look at the expected predictions.

avehtari · September 7, 2024, 3:02pm

Topic		Replies	Views
Understanding LOOIC Modeling loo , interpret-results , cognitive-science	15	15717	November 8, 2021
Favors better LOO-CV results or Pareto k diagnostics for model selection Modeling loo , model-comparison	11	461	June 19, 2024
Model comparison methods General ecology	3	5276	November 26, 2017
Advice about LOO General	1	422	June 29, 2021
Model comparsion for linear regression using loo and Bayesian R2 Modeling techniques , loo	0	431	October 18, 2022

Importance of model selection (ELPD/LOOIC) with variables of a priori interest

Related topics