I’m hoping to confirm that the model I’m specifying is indeed what I’ve been conceptualizing. I feel as though I’m struggling with something that’s rather simple, as I’m sure that mine is a common situation when estimating parameters using hierarchical models. And there are a couple of other threads I’ve found that seem to address this issue (most notably here), yet I still don’t feel confident that I’m specifying my model as I intend.
Data:
I’m working with avian malaria infection data, and am attempting to explain variation in infection at the species level while accounting for certain individual-level variables. The data include an individual-level binary response y
(infection status), a categorical individual-level predictor age_sex
(with three levels), and 6 predictors measured at the group (species) level (foraging stratum
, average group_size
, nest
structure, centrality
in a mixed-species flock network, maximum longevity
, and an indicator of whether or not a species is migratory
). The group-level data include both categorical and continuous predictors, and each is necessarily invariant within a particular species).
Model:
My chief interest is in understanding how these higher-level predictors influence the prevalence of infection across species. At the same time, I also need to account for the age and sex of sampled individuals, as this is known to affect parasite exposure and susceptibility. Essentially, my conceptualization of the likelihood is something like the following:
where y_{ij} is the infection status of the i^{\text{th}} individual of species j, x_1 and x_2 are dummy variables corresponding to contrasts on the age_sex
variable, and \beta_{0,j} is a species-specific intercept which is itself a function of the group-level predictors listed above. My first choice for specifying this model in brm()
is to do something like:
y ~ age_sex + stratum + group_size + nest + centrality + longevity + migratory + (1|species)
In addition to the group-level predictors, I include a group intercept term to capture variation in species’ infection rates due to other, unobserved variables at the species level (e.g., immune investment). However, I don’t necessarily expect that the effect of (e.g.) group_size
varies between species, so I don’t include any random slope terms.
The part that I’m struggling to understand is whether, by including a species intercept, I’ve functionally accounted for all species-level variation in the response without actually allowing any of the species-level data (e.g., group_size
) to explain any of that variation. That is to say, because the species
grouping factor perfectly predicts all of the species-level measures, it seems as though it can consume all species-level variation in the response without partitioning any of it to the predictors in hand.
Update:
After taking another look at the worked examples provided in the brms
paper and vignette, I’m now beginning to doubt whether the group term is appropriate and needed (or helpful) for answering my questions (how does an individual’s probability of infection change with variation in [e.g.] group_size
). It seems that in most (if not all) of those examples, although population coefficients and intercepts are specified to vary at the group level, the covariates themselves are measured at the individual level.
On one hand, I essentially have repeated measures (of species), and I do expect that there is likely to be correlation in the response among species that is not captured by the species-level covariates in hand (e.g., due to differences in immune investment between species). For these reasons, it seems appropriate and necessary to specify the model as stated above.
On the other hand, as most of my predictors are measured at the group (species) level, and so could be perfectly predicted by species (i.e., once species is known, all values of the predictors are known), it would seem that including the group effect may not be allowing the group-level predictors to be fully utilized in explaining variation in the response.