Handling data with reported population- and group-level observations


I’m currently attempting to model proportion data measured across numerous studies using a beta regression. The data are organized such that proportions are measured at numerous sites within a study. Here’s an example of the data:

Study N label prop
Study1 1 site 0.9
Study1 1 site 0.93
Study1 1 site 0.89
Study2 1 site 0.8
Study2 1 site 0.82
Study3 5 study-average 0.7

If all observations were site-level within a study, I would simply run a random-intercept model to account for study-to-study variability:

mod.rnd <- brm( bf(prop ~ (1|Study), phi ~ (1|Study), family=Beta()), data=data)

However, the only observation available for Study3 is a study-level average of 5 sites and I don’t have access to the individual sites that make up that average proportion of 0.7.

What is the appropriate way to handle this type of heterogenous data in brms? My first thought was to include a nested random-intercept using the label factor:

mod.nst <- brm( bf(prop ~ (1|Study) + (1:Study:label), phi ~ (1|Study) + (1:Study:label)), family=Beta(), data=data)

The implicit nesting would simply be (1|Study/label)

My basic understanding of nested random effect models is this would estimate random intercepts for both factors within label and would account for variability at the site and study level. However, I wanted to check to see if this is the appropriate way to handle this scenario, or if there might be a better way to handle this type of data.

Thank you very much for the help with this problem.

  • Operating System: RHEL 8
  • brms Version: 2.20.4

hey @zult , did you ever find out anything here?
(I’ve been following this thread for a while because I’m interested in approaches to the problem.)
My first thought was that you’re describing a meta-analysis model but where you also (more often?) have access to the primary data sometimes (rather than the summary stat at the study level) – and so naturally I thought of brms’s functionality for adding se() on the response side of the formula, where the se encodes variation at the study level. However, even if that did work for mixed primary/study-level data, you wouldn’t be able to use the Beta() family.
All that to say: I dunno!
My gut says this must be possible somehow though.

I’m not sure if your random effects formula really respects the fact that the primary vs. study-level observations (site vs. average) should carry different weights (e.g., maybe one study average is worth 3 observations at site level), but it’s probably something to start with.