I’m currently attempting to model proportion data measured across numerous studies using a beta regression. The data are organized such that proportions are measured at numerous sites within a study. Here’s an example of the data:
Study
N
label
prop
Study1
1
site
0.9
Study1
1
site
0.93
Study1
1
site
0.89
Study2
1
site
0.8
Study2
1
site
0.82
Study3
5
study-average
0.7
If all observations were site-level within a study, I would simply run a random-intercept model to account for study-to-study variability:
However, the only observation available for Study3 is a study-level average of 5 sites and I don’t have access to the individual sites that make up that average proportion of 0.7.
What is the appropriate way to handle this type of heterogenous data in brms? My first thought was to include a nested random-intercept using the label factor:
The implicit nesting would simply be (1|Study/label)
My basic understanding of nested random effect models is this would estimate random intercepts for both factors within label and would account for variability at the site and study level. However, I wanted to check to see if this is the appropriate way to handle this scenario, or if there might be a better way to handle this type of data.
Thank you very much for the help with this problem.
hey @zult , did you ever find out anything here?
(I’ve been following this thread for a while because I’m interested in approaches to the problem.)
My first thought was that you’re describing a meta-analysis model but where you also (more often?) have access to the primary data sometimes (rather than the summary stat at the study level) – and so naturally I thought of brms’s functionality for adding se() on the response side of the formula, where the se encodes variation at the study level. However, even if that did work for mixed primary/study-level data, you wouldn’t be able to use the Beta() family.
All that to say: I dunno!
My gut says this must be possible somehow though.
I’m not sure if your random effects formula really respects the fact that the primary vs. study-level observations (site vs. average) should carry different weights (e.g., maybe one study average is worth 3 observations at site level), but it’s probably something to start with.
Thanks for thinking it over @zacho. Unfortunately, I didn’t come up with anything very satisfying and decided to just proceed with the original formulation:
Most of the time, I only have one or two study-average studies within 100s of individual site observations. I figure this approach at least updates the mean of prop given the site-average even though that observation won’t contribute to the estimate of the group-level effects.
One other option I tried was using the weights to at least count that mean N times:
From what I understand, this will have the study-average mean contribute N times to the likelihood. Unfortunately, this resulted in excess divergences that I wasn’t able to iron out.
All that to say, I’m just going with the original approach to not let perfect be the enemy of good. I’m hoping it’s okay for this dataset, but it’s definitely something that I’d be interested in figuring out down the road. Sorry for the unsatisfying answer there.
I hear ya. At least it sounds like your data is dominated by individual observations, so your 1st model should do pretty well.
Didn’t someone quip that “a model is never finished, merely abandoned”…