I have two models that share a random effect of species but differ in their fixed effects structure. PSIS-LOO confirms that model_2 is dramatically superior to model_1. I am interested to know whether model_2 is clearly superior for all species, or whether perhaps there are some species for which model_1 is performing better. A couple of questions:
I’m pretty sure that I get the inference I need by just re-running LOO on the log-likelihoods for only the points belonging to a given species. Does this sound right, @avehtari ?
Because I need to use moment-matching, it is substantially easier to use brms then to manually prepare a fitted model for LOO. I think that I can do the following; does this look right @paul.buerkner ?
One thing that I notice (and I need to double-confirm this before I claim with certainty that it’s true) is that looping step 2 over all 51 species in my data takes substantially less time than doing LOO on the full model object without subsetting (i.e. myfit <- add_criterion(brm(...), "loo", moment_match = T)). Is this to be expected, or might it suggest that a big speedup my be lurking somewhere inside brms and/or loo (or alternatively that my whole approach is missing the mark!).
Edited to add: I think I must be doing something wrong, either conceptually or in implementation, because if I sum the elpd_diff over the 51 comparisons (all of which favor model_2) computed via step 2 above, I get a number that is several times greater than the elpd_diff for the full models (i.e. including all species at once). I’d really appreciate getting set straight here.
I struggled a bit to think through this precise point about k-fold versus LOO. My interest isn’t really in how well the relationships from the remaining 50 species generalize to a single held-out species, but rather whether the relationships modeled for all 51 species together are uniformly achieving improved predictive performance for all 51 species, or alternatively whether the improvements are primarily restricted to a subset of species. I think this question is properly answered by looking at LOO from the full model, but based on the predictive performance of one species at a time.
In other news, I realized that loo accepts a newdata argument, and when I do that I get a different answer than I see by just swapping in the new data for the $data part of a brmsfit (which was just a shot-in-the-dark). I’m redoing the analysis describe above right now and will report back about whether the results look more reasonable.
Edit: I can report that when I pass the single-species data to loo (via add_criterion), the elpd summed over species is the same as the elpd for the full model, and the timings are more comparable as well. I suspect that the issue is that the model (originally fit via cmdstanr) was getting recompiled, but rather than just recompiling the generated stan code, new stan code was being generated that yielded a different model based on the updated $data (maybe I had threading turned on when I originally fit the model?). So I was previously getting nonsense, which was fixed by sensibly passing the new data via newdata.
You can also run just one and then examine for the pointwise results in the generated loo object. Gather the pointwise elpds for each species and compute the difference and diff_se for each species separately. Moment-matching will also provide the pointwise elpds.