Predicting a new study in a brms model using subset()

I am presently conducting an epidemiological meta-analysis looking at the prevalence of a disorder that is sometimes measured in the current moment (“Do they have it now?”), in a particular period (“Did they have it in the last 6 months?”) or over their entire lives (“Have they ever had it?”). Few studies report all three – in fact most report only 1 with a few reporting 2 and a couple reporting 3. I have been asked to estimate the prevalence within each time window and to evaluate predictors. I decided this might be a good time to use brms’ amazing multivariate features. I did so using the following code:

#lp = lifetime prevalence estimate
#cp = current prevalence estimate
#pp = period prevalence estimate

m1 = bf(lp | trials(N) + subset(is_lt) ~ iv + (iv | S | sample), family=binomial)
m2 = bf(cp | trials(N) + subset(is_c) ~ iv + (iv | S | sample), family=binomial)
m3 = bf(pp | trials(N) + subset(is_p) ~ iv + (iv | S | sample), family=binomial)

mv_mod = brm(m1 + m2 + m3,
data=dat, iter = 10000,
cores = 4, chains = 4, save_all_pars=TRUE,
control = list(adapt_delta=.999, max_treedepth=20),
prior = c(
set_prior(“normal(-3, 1)”, class = “Intercept”, resp=‘lp’),
set_prior(“normal(0, 1)”, class = “b”, resp=‘lp’),
set_prior(“normal(0, 1)”, class = “sd”, resp=‘lp’),
set_prior(“normal(-3, 1)”, class = “Intercept”, resp=‘pp’),
set_prior(“normal(0, 1)”, class = “b”, resp=‘pp’),
set_prior(“normal(0, 1)”, class = “sd”, resp=‘pp’),
set_prior(“normal(-3, 1)”, class = “Intercept”, resp=‘cp’),
set_prior(“normal(0, 1)”, class = “b”, resp=‘cp’),
set_prior(“normal(0, 1)”, class = “sd”, resp=‘cp’),
set_prior(“lkj(4)”, class = “cor”)

I use the subset function applied to a stacked data frame to make it all work. is_lt, is_c and is_p are used to indicate which rows correspond to which analysis (0 = exclude and 1 = include).

It works great! And the output looks nice too. Modelling it this way seems to provide slightly tighter confidence intervals across a few parameters compared to univariate models. I assume this is because estimates across the different time windows co-inform each other. That is with one exception. If I try to predict a new study using the following code:

new_study = posterior_linpred(mv_mod,
newdata = data.frame(N = 1, iv = c(‘a’, ‘b’),
is_lt=1, is_p=1, is_c=1,
transform = TRUE, allow_new_levels=TRUE, resp=‘lp’)
posterior_summary(new_study, robust=TRUE)*100

The resulting interval is much broader for the multivariate than the univariate. This is odd to me because the confidence intervals for each parameter seem slightly narrower for the multivariate than the univariate model. I am worried I am doing something wrong with respect to how I am predicting this new study: It took some trial and error to get the preceding code to run and I was surprised that I needed to set is_lt, is_p and is_c each to 1 no matter which response I was predicting. I instinctively assumed when predicting resp=‘lp’ I would set is_p and is_c to 0. If those variables are excluded or any are set to 0, you get:

Error: All rows of ‘data’ were removed via ‘subset’. Please make sure that variables do not contain NAs even in rows unused by the subsetted model.

Am I missing something? The only other thing I could think of is that imprecision in the correlations between response variables is somehow coming into play…which would be unfortunate!


  • Operating System: MacOS 10.14.3
  • brms Version: 2.8.0

The post-processing functionality of subset models is currently a little bit cluncy hence the need to specify all is_ variables. In the end, they won’t be used for the predictions if you just want to get results for a specific response.

The residual correlations are not included in the predictions via posterior_linpred´ (but inposterior_predict`). Can you give me a mininmal reproducible example for the behavior you find weird so that I can have a look?