Group-level effects for phylogenetic mixed model in brms - groups having different number of observations

Hi there, first-time user of brms (and Bayes) with what I suspect are some basic questions I was hoping for advice on, please.

I’m working with a comparative dataset of multiple species at multiple sites. For ~2/3rds of species I have > 1 row (observations; max 90) per species; the remaining ~1/3rd have just 1 row (observation). For the 497 sites, ~2/3rds have a single row while the other 1/3rd have > 1 row (max. 49).

I think I ought to be accounting for multiple observations for some species and sites (i.e., ‘species ID’ and ‘group ID’ should be group effects). But I have lots of single rows for the majority of sites, and for a good number of species too…
Q1: will these models cope OK with that? Based on one of the package’s papers I think one early sense check might be to assess the SDs of ‘site ID’ and ‘species ID’ as group-level effects - sound sensible?

Based on the vignette on phylo multi-level models, I can see it’s recommended to include the phylogenetic covariance (i.e., the phylogeny) AND species ID as separate group-level effects, the rationale for the latter being that it would account for effects independent of phylogenetic effects.
Q2: out of curiosity, do folk find this is necessary, i.e., that there really are species-level effects that ought to be accounted for, completely unrelated to phylogeny? And should one first check (via the group-level SDs) that ‘species ID’ really is explaining something that the phylogenetic term is not?

Thank you!

  • Operating System: Microsoft Windows 10 Enterprise
  • brms Version: 2.21.0

Yes, you can include group-level intercepts for site and species. This would be my preferred approach for highly unbalanced data like yours. The estimates for these effects will benefit from “partial pooling” which will be skeptical of sites/species that appear to have extreme deviations from the average site/species (but weak evidence that the deviation is really that extreme because of low replication). Treating site/species as a population-level effect won’t do this, and it becomes very easy to overfit the data and/or to have huge uncertainties around their effects.

There are two reasons to include a group-level effect for species. First, it avoids the problem of pseudoreplication from repeated measures on the same species. This wouldn’t be necessary if you had built a phylogeny of individuals that also had relationships within species, but I suspect you have a species-level tree. Second, I think there’s often species-level variation that can’t be explained by phylogeny. Think about a trait that is evolutionarily labile. Variation that’s not explained by phylogeny will be common.

1 Like

I agree with @wpetry. I would just include both the phylogenetic and non-phylogenetic effect of species. I would typically expect that there is more variation explained by species than just the effect of phylogeny, and so both effects are important. If one of them turns out to be small, that is fine, you can leave that small effect in the model and it will not influence the other parameters very much.

Something to think about is whether the effect of site is expected to be equal for all species, or whether is should be allowed to vary.

1 Like

Great point! @Ax3man, I’m curious how you think it’s best to model varying site effects by species. Would you nest the group-level terms? If so, is there an advantage to putting species within site (1|site/species) or vice versa (1|species/site)?

Hi both, thanks so much for your helpful replies!

@wpetry re. ‘partial pooling’: thanks for the suggestion - before I head off down that particular rabbit hole, do you have a favourite source I should head to for reference, please?

Something to think about is whether the effect of site is expected to be equal for all species, or whether is should be allowed to vary.

@Ax3man thanks for this too. I wouldn’t have thought of this myself, so sitting tight to hear your response to wpetry’s questions on this. If it’s at all informative for your response, for some sites I have data for multiple species (other sites just a single species)

Thank you!

Richard McElreath’s Statistical Rethinking book and lecture series are fantastic. The first 30 min of this video covers partial pooling. It’s fast-paced and rich, but I found it very helpful for building intuition about how/why this works. The text would be a good follow-up to solidify the concepts. Solomon Kurz has a really nice brms translation of the examples, including reedfrogs.

1 Like

Yes, these are good questions. Not sure if I have general recommendation, it depends on the system and questions.

If we do not have an a priori reason that sites explain a lot variation regardless of species, I think a model with both (1 | species) and (1 | species:site) could work well*. Then we have all the partial pooling for species, and I’d usually be happy to have the site effects not be pooled across species.

If there is plenty of data and/or interest in the random effects, one could try to fit the full model (1 | species) + (1 | site) + (1| species:site), which I think will better allow for partial pooling within sites across species.

Does this sounds right to you? Curious what others think.

(*) this is all in addition to the phylogenetic random effect, to be clear.
(**) I always forget the direction of (1 | a/b) so I usually just write out both the terms.

3 Likes

This feels right to me. When sites have effects that are correlated across species, those effects are almost never in the same direction for all species. This is what motivates the joint species distribution models based on a latent factor approach (see https://www.sciencedirect.com/science/article/abs/pii/S0169534715002402), which can be thought of as a technique for capturing a reduced-rank approximation to the covariances between species across sites, but these models can be difficult to fit and are often overkill.

1 Like

Thanks for the further info, both.

My interest in getting the group-level terms ‘right’ (aside from doing robust stats), is that:
a) I’d like to be confident that results I have from my population-level effects really are broad overall effects across observations (and not explained by species and/or site)
b) I am curious re. how much variance is explained by species- and/or site-level effects

If we do not have an a priori reason that sites explain a lot variation regardless of species, I think a model with both (1 | species) and (1 | species:site) could work well*. Then we have all the partial pooling for species, and I’d usually be happy to have the site effects not be pooled across species.

If there is plenty of data and/or interest in the random effects, one could try to fit the full model (1 | species) + (1 | site) + (1| species:site), which I think will better allow for partial pooling within sites across species.

Thank you! With these suggestions I come back round to an earlier question, please: would you say it’s important, then, during the early stages of modelling to assess which group-level effects ought to be included, e.g., by comparing intercept-only models with various combinations of group-level effects*, [e.g., (1 | species) v (1 | species) + (1 | site) v (1 | species) + (1 | site) + (1| species:site)] ?

*in addition to the phylo effect

Thank you!

Sorry for the slow response, I was traveling.

I generally try to fit the model that contains 1) the effects I am interested in estimating, and 2) the effects I want to adjust for. Then I do my inference from that maximal model. Again, if effects turn out to small, that’s okay. If you have trouble with the model converging, you can try a simpler model. Which model to fit requires thinking carefully about the biology. E.g. what do you think the site effects look like? Should they be the same across species?

Alternatively, you can fit a few models are compare them something like wAIC, and select the “best” model. This may be a good solution if you truly don’t know what effects could be important. But, after model selection, the model you choose is now biased. One way to think about this, is that the credible intervals of the parameters of interest now assume that you know what effects are important, even though you did not (you had to compare models to find out). While in the first approach, the uncertainty about those effects is still in the model, as the posterior for a random effect with low variance can still include scenarios where the effect is larger. But there are a lot more informed opinions about model selection than mine, so read around.

1 Like