`brms` formula for population-level effects measured at the group level

I’m hoping to confirm that the model I’m specifying is indeed what I’ve been conceptualizing. I feel as though I’m struggling with something that’s rather simple, as I’m sure that mine is a common situation when estimating parameters using hierarchical models. And there are a couple of other threads I’ve found that seem to address this issue (most notably here), yet I still don’t feel confident that I’m specifying my model as I intend.

Data:

I’m working with avian malaria infection data, and am attempting to explain variation in infection at the species level while accounting for certain individual-level variables. The data include an individual-level binary response y (infection status), a categorical individual-level predictor age_sex (with three levels), and 6 predictors measured at the group (species) level (foraging stratum, average group_size, nest structure, centrality in a mixed-species flock network, maximum longevity, and an indicator of whether or not a species is migratory). The group-level data include both categorical and continuous predictors, and each is necessarily invariant within a particular species).

Model:

My chief interest is in understanding how these higher-level predictors influence the prevalence of infection across species. At the same time, I also need to account for the age and sex of sampled individuals, as this is known to affect parasite exposure and susceptibility. Essentially, my conceptualization of the likelihood is something like the following:

\text{Bernoulli}\left(y_{ij} \mid \text{ilogit}(\beta_{0,j} + \beta_1 x_{1,ij} + \beta_2 x_{2,ij} )\right),

where y_{ij} is the infection status of the i^{\text{th}} individual of species j, x_1 and x_2 are dummy variables corresponding to contrasts on the age_sex variable, and \beta_{0,j} is a species-specific intercept which is itself a function of the group-level predictors listed above. My first choice for specifying this model in brm() is to do something like:

y ~ age_sex + stratum + group_size + nest + centrality + longevity + migratory + (1|species)

In addition to the group-level predictors, I include a group intercept term to capture variation in species’ infection rates due to other, unobserved variables at the species level (e.g., immune investment). However, I don’t necessarily expect that the effect of (e.g.) group_size varies between species, so I don’t include any random slope terms.

The part that I’m struggling to understand is whether, by including a species intercept, I’ve functionally accounted for all species-level variation in the response without actually allowing any of the species-level data (e.g., group_size) to explain any of that variation. That is to say, because the species grouping factor perfectly predicts all of the species-level measures, it seems as though it can consume all species-level variation in the response without partitioning any of it to the predictors in hand.

Update:

After taking another look at the worked examples provided in the brms paper and vignette, I’m now beginning to doubt whether the group term is appropriate and needed (or helpful) for answering my questions (how does an individual’s probability of infection change with variation in [e.g.] group_size). It seems that in most (if not all) of those examples, although population coefficients and intercepts are specified to vary at the group level, the covariates themselves are measured at the individual level.

On one hand, I essentially have repeated measures (of species), and I do expect that there is likely to be correlation in the response among species that is not captured by the species-level covariates in hand (e.g., due to differences in immune investment between species). For these reasons, it seems appropriate and necessary to specify the model as stated above.

On the other hand, as most of my predictors are measured at the group (species) level, and so could be perfectly predicted by species (i.e., once species is known, all values of the predictors are known), it would seem that including the group effect may not be allowing the group-level predictors to be fully utilized in explaining variation in the response.

1 Like

Welcome to discourse, and thanks for a really clear and well-written question!

tl;dr

The modeling aim of including these random effects is conceptually sound, but in practice it works only if the data are strong enough to sufficiently constrain the true probabilities for a sufficient number of groups to reliably estimate the random-effect variance. As long as the model can constrain the random effect variance, you don’t need to worry that the model will get too trigger happy about using the random effect to mop variation that is best attributed to the covariates, because (assuming there is not prior/data conflict) there is no penalty for adjusting the covariates to their true values, but there is a penalty for unnecessarily broadening the random effect distribution.

Details

I think it might be useful to strip away some of the complexity in your model and try to expose the core nugget of the question, and then build intuition from there.

Suppose I have just one covariate, that varies between groups but not within groups. And to begin, assume that I have just one observation per group. Standard logistic regression assumes that the true logit-probabilities (e.g. of an individual bird having malaria) are given exactly by a linear combination of my covariates. As you’ve noted, this might not be true. Suppose that we know the true probabilities with certainty. Then we might think to regress their logits against our single covariate in a standard linear regression, i.e. logit(p_i) = a + bx_i + \epsilon_i, where \epsilon_i \sim normal(0, \sigma).

To add a layer of realism, now consider the case where we don’t observe the true probabilities, but instead observe Bernoulli samples from the true probabilities. Just one observation per group. Now we have:
y_i \sim Bernoulli(p_i)
logit(p_i) = a + bx_i + \epsilon_i
\epsilon_i \sim normal(0, \sigma)

This is exactly what we (attempt to) fit with brm(y ~ x + (1 | group), family = "Bernoulli"), where each observation has its own group.

So this sort of model makes perfect sense conceptually. We can think of it as a latent regression of the logit-probabilities against the covariate, followed by Bernoulli sampling. But does this model work in practice? And what are the practical consequences of leaving out the random effect?

Well, the practical consequences of ignoring the random effect aren’t ideal: consistent underestimation of the slope term.

library(brms)

# Simulation:
N <- 10000 # number of observations; we'll use a very large number of observations
           # so that some stochastic aspects of the simulation get smoothed over
covariate <- rnorm(N)
alpha <- 0 # logit-scale intercept
beta <- 1 # logit-scale slope
sigma <- 1  # logit-scale residual sd in a regression of the true latent probability 
            # on the covariate
probabilities <- boot::inv.logit(rnorm(N, alpha + beta*covariate, sigma))
observations <- rbinom(N, 1, probabilities)

# Plotting:
plot(observations ~ covariate)
points(probabilities ~ covariate, pch = 16)

# Analysis:
test_data <- data.frame(x = covariate, y = observations, group = as.factor(1:N))
test_model <- brm(y ~ x, data = test_data, family = "Bernoulli", backend = 'cmdstanr', cores=4)
summary(test_model)

Can we fix the bad slope estimate by using the true generative model? Well no. We run into divergences, and the standard deviation parameter doesn’t seem to be identified in the model. The below code takes a few minutes to run. You can see the same problems with much smaller N, but it’s nice to run the large-N as well to see that the problems don’t go away even with large sample sizes.

test_model2 <- brm(y ~ x + (1|group), data = test_data, family = "Bernoulli", backend = 'cmdstanr', cores=4)
summary(test_model2)

The problem here is that a single Bernoulli trial just doesn’t give us enough information about the true underlying Bernoulli probability, which makes it virtually impossible for the model to constrain the random effect variance. What if we constrain the random effect variance to be near its true value via the prior? Then we get good inference on the slope!

iprior <- c(prior_(~normal(1,.01), class = ~sd))
test_model3 <- brm(y ~ x + (1|group), data = test_data, family = "Bernoulli", backend = 'cmdstanr', cores=4, prior = iprior)
summary(test_model3)

So this framework makes a lot of sense as long as we can constrain the random effect variance to be near its true value. We’ve seen two options for doing this. We can constrain the variance via the prior, or we can allow the model to constrain the variance by allowing the model to be sufficiently certain about the true probabilities (the model has no problems if the true probabilities are known exactly–then we have a standard linear regression).

If we don’t have strong prior knowledge about the random effect variance, our best avenue forward is to allow the model to constrain the probabilities by giving it more data per group. For example, ten observations per groups is more than enough:

N <- 100
n_trial <- 10
covariate <- rnorm(N)
alpha <- 0 # logit-scale intercept
beta <- 1 # logit-scale slope
sigma <- 1  # logit-scale error in a regression of the 
# true latent infection probability on the 
# covariate
probabilities <- boot::inv.logit(rnorm(N, alpha + beta*covariate, sigma))
observations <- rbinom(N, n_trial, probabilities)
expectations <- 10*probabilities
plot(observations ~ covariate)
points(expectations ~ covariate, pch = 16)
test_data4 <- data.frame(y = observations, trials = rep(n_trial, N), x = covariate, group = as.factor(1:N))
test_model4 <- brm(y | trials(trials) ~ x + (1 | group), data = test_data2, family = "binomial", backend = 'cmdstanr', cores=4)
summary(test_model4)

So my practical recommendation is to give your model a try, but to be sensitive going in that it’s possible that the model just won’t work. If you’ve got enough data per group, you’re in with a good chance. But if you don’t then no amount of clever reparameterization tricks can save you; you just won’t be able to estimate the random effect.

5 Likes

Thanks for your reply, @jsocolar ! The explanation and worked example were extremely helpful. One thing I neglected to mention in my original post was that I had already successfully fit the model I was considering (with some strengthening of the default priors, see below). However, the results from sampling the posterior (spoiler alert: all my coefficients are belong to zero [1]), together with my lack of experience, led me to doubt whether I was specifying the model appropriately.

In case anyone is interested, I’ve reproduced my workflow and thought process for fitting the model below. I’d be more than happy to hear any thoughts or critiques you may have regarding the choices I’ve made.

Data

birds.csv (67.6 KB)

I have a decent sample—905 individuals of 30 species. For a few species, sample size is >100; many are between 30 and 100. Several species were poorly sampled, with <10 samples/species.

library(brms)
library(bayesplot)
br <- read.csv("tmp/birds.csv")                                     # import data and set factor references
br$age_sex <- factor(br$age_sex, levels=c("HY","AHY-F","AHY-M"))
br$stratum <- factor(br$stratum, levels=c("G","B","LC","HC"))

hist(table(br$species))

In a maximum likelihood framework I’d be inclined to drop these samples, considering it’s difficult to estimate species-level prevalence with so few trials. However, my understanding of the Bayesian approach is that it shouldn’t be detrimental to keep those species; they’ll help to shape the posterior for population effects, and I simply won’t try to make any strong inferences about prevalence in those particular species. Ultimately, due to missing predictor data, I end up losing a few of the poorly-sampled species anyway, dropping to 894 samples from 25 species.

Priors

                                                                    # define model structure and priors
model_form <- bf(y ~ age_sex + stratum + group_size + nest + centrality + longevity + migratory + (1|species))
model_priors <- c(set_prior("normal(0,1.5)","b"), set_prior("normal(0,1.5)","Intercept"))

I defined the model structure as described in my earlier post. My first attempt to fit this with the default (flat) priors that brms places on the \beta coefficients led to divergent transitions and warnings about effective sample sizes. Following a suggestion made in another thread, I tried placing somewhat tighter priors (\text{Normal}(0,1.5)) on both the \beta coefficients and the intercept. This seemed to look better than the defaults when folded back into probability space, whereas the default prior on the intercept seems to place a lot of density near 0 and 1 (though perhaps I’m thinking about this incorrectly?).

hist(plogis(rnorm(1e3,0,1.5)))                                      # quick prior check
hist(plogis(rstudent_t(1e3,3,0,2.5)))                               # default prior for the intercept and scale parameters

Sampling from the prior still showed a lot of density around 0 and 1, but there was at least some coverage of all reasonably-expected values, and I proceeded with sampling from the posterior.

                                                                    # draw samples from prior
prior_samples <- brm(model_form, data=br, family=bernoulli(), chains=1, cores=1, prior=model_priors, seed=1138, sample_prior="only")

br_y <- get_y(prior_samples)                                        # prior predictive checks
br_yrep_naive <- posterior_predict(prior_samples)
br_grps <- prior_samples$data$species
prevalence <- function(y) mean(y==1)                                # prevalence computation for ppc_stat()

pp_check(prior_samples)
ppc_stat(br_y, br_yrep_naive, stat="prevalence")
ppc_stat_grouped(br_y, br_yrep_naive, br_grps, stat="prevalence")

Posterior

Using the more informed priors allowed the chains to converge without issue. There were no warnings about divergences or effective sample sizes as when using the naïve priors. Additionally, it seems that the likelihood was able to dominate the posterior, which leads me to believe that the priors were not inappropriate (or if they were, then at least they were not too strong).

                                                                    # sample posterior
post_samples <- update(prior_samples, chains=4, cores=4, sample_prior="no")

br_yrep <- posterior_predict(post_samples)                          # posterior predictive checks

pp_check(post_samples)
ppc_stat(br_y, br_yrep, stat="prevalence")
ppc_stat_grouped(br_y, br_yrep, br_grps, stat="prevalence")

Conclusion

summary(post_samples)

All of this was simply a prelude to finding that there is a lot of posterior density around 0 for all of the group-level effects. This wasn’t totally unexpected, particularly considering that prevalence was extremely low in the majority of species sampled (including several of the best-sampled species). I’m happy to accept that result and conclude a lack of evidence for any strong effects from the predictors—provided that I am confident that I’ve specified the model appropriately to address that question. But, I didn’t want to draw that conclusion prematurely if it turned that the reason for it was a model mis-specification. I think, now, that it wasn’t, and that the results can be accepted and interpreted accordingly.

Thanks again for your help, Jacob!

2 Likes