LOO/WAIC for 'multi-species' [distribution] models


Not focused on any specific modeling problem–more curious regarding some considerations for LOO/WAIC in the context of what are commonly called 'multi-species occupancy [abundance, etc.] models]–multivariate species (s) incidence/count data replicated across spatial locations or units (i) with further sampling replication at these locations (j) employed to deal with observation error (missing a species or individual organism that’s there). I know some folks active here are familiar with these models, but will try to quickly summarize the models at the bottom.

Anyway, a common circumstance is that some species are widespread and easily detectible, while others are very sparsely observed because they are rare/cryptic/etc. I guess my intuition is that in most applications, models tend to overfit the ‘rare’ species and underfit the ‘common’ species. I’m curious if anybody has used LOO-PSIS to compare different models based on the log likelihood at the species*site grouping and noticed Pareto’s k flagging sparse (or common) species? If so, just thinking about some of the advice here, is there some other strategy beyond changing some of the model structure to allow sparsity to vary across species? Was also wondering whether anybody had instead focused on LOGO at the species level or site level, or perhaps grouped the log-likelihood differently for estimating WAIC or LOO-PSIS. E.g., would summing LL_i across species (grouping by site) for each iteration be sensible in certain circumstances (maybe accepting that not all species can be fit adequately, but hoping that the the lpd across the assemblage is relatively stable or something)?

Thanks for your patience with this non-specific post and my naiveté on the subject. Again, not a pressing issue–more a set of half-formed thoughts or concerns about some of these models. I apologize if this type of post falls outside the purview of the board.


(brief model description below)

A common structure for these sorts of models (below, focusing on species occupancy at sites) is:
logit(\psi_{i,s}) =inprod(\beta_s,X_i)
\beta_s \sim N(\mu_\beta, \sigma_\beta)
\alpha_s \sim N(\mu_\alpha, \sigma_\alpha)

with data likelihood \prod_{s=1}^{NSpec}\prod_{i=1}^{NSites}\psi_{i,s}[y_{i,.,s}|p_{i,s}] + I(y_{i,.,s}=0)(1-\psi_{i,s}).

For what it’s worth, this is not my intuition. I think that in many applications the hierarchical prior (usually Guassian random effects) actually is approximately well calibrated, and is providing the “correct” amount of shrinkage for both rare and common species. I’ve thrown mixed predictive checks at some pretty huge multispecies models and have generally been unable to detect important deviations from Gaussian random effect assumptions. Edit: but this is in the context of fairly elaborate trait- and phylogeny-based predictors that succeed in mopping up a lot of variation.

Species that have extremely few observations will tend to get flagged because leaving out one point with an observation of that species can substantially alter the posterior. Likewise, ubiquitous species that are detected at nearly every point can get flagged because leaving out a single point where they are never detected can substantially alter the posterior. We usually don’t think about this second case because it’s relatively infrequent in real biological survey data.

LOGO over species will generally lead to poor LOO-PSIS diagnostics and performance, because the LOO posterior will be missing all of the information that informs the random effect for a given species and will tend to diverge massively from the posterior informed by all the data. Leaving one site out is fair game, and might be useful, but it’s not immediately obvious to me when it would be more informative than leaving out individual species-by-site combinations.

Finally, note that in the absence of visit-specific detection covariates (e.g. time-of-day, as opposed to properties of sites), the occupancy model is precisely a zero-inflated binomial regression (optionally with random effects on the binomial probability and the zero-inflation probability, corresponding to the multispecies case). This is useful to notice because the machinery for cross-validating and LOO-ing zero-inflated binomial regression is relatively more discoverable than the machinery for occupancy models. For example, you can look at what brms does for the binomial regression case, and what you’ll find is the equivalent of holdouts at the species-by-site level.

A tangential but important aside: Note that you want to compute the likelihood conditional on the occupancy probability \psi, and NOT conditional on the posterior distribution of the latent occupancy state Z.


Thanks a bunch—thats all very helpful. Should have thought about looking under the hood of brms.