I am involved in the project, so I’ll try to provide some details.
As Tiril described, we have for each participant 4 outcomes, each measured 3 times. The data are in a data frame my_data
with the colums:
outcometype
time
-
outcome
(sum score values) exposure
-
adjustment
(variables we want to correct for) -
max
(the maximum possible sum score for an outcome)
Because the outcomes are sum scores from rating scales, we use brms’ custom family function to implement a beta binomial family.
In the simple analysis, the brms model is like this:
my_prior = c(set_prior(“normal(0,2)”, class = “b”),
set_prior(“normal(0,3)”, class = “Intercept”)
set_prior(“normal(0,3)”, class = “Intercept”, dpar = “phi”))
my_data_o1t1 = my_data[my_data$outcometype == 1 & my_data$time == 1,]
my_stan_var = stanvar(as.integer(my_data_o1t1$max),
name = “trials”)
my_stan_ctrl = list(adapt_delta = 0.99,
max_treedepth = 15)
sm = brm(outcome ~ exposure + adjustment,
prior = my_prior,
data = my_data_o1t1 ,
family = beta_binomial2, # defined as in the brms vignette
stan_funs = stan_funs, # defined as in the brms vignette
stanvars = my_stan_var,
control = my_stan_ctrl )
This model allows us estimating the effect of the exposure on the outcomes in 12 separate analyses (outcometype1_time1
, outcometype1_time2
, …, outcometype4_time3
), where we only change the outcomes, and the exposure and adjustment variables remain the same.
However, we would rather analyse all exposure-outcome associations together, because the outcomes are correlated. A simple way to do this is to implement a hierarchical model, in which the association of exposure and outcome varies across outcome-types and time. This model should do this:
my_stan_var = stanvar(as.integer(my_data$max),
name = “trials”)
hm = brm(bf(outcome ~ exposure + adjustment + (exposure | outcometype:time),
phi ~ (1|outcometype:time)),
prior = my_prior,
data = my_data,
family = beta_binomial2,
stan_funs = stan_funs,
stanvars = my_stan_var,
control = my_stan_ctrl )
This model is so slow for the complete data set (N > 20000), that we are not getting anywhere. This model converges with a subset of the data, but once we use the full data, sampling slows down much more than expected. That is, if the model works with 10% of the data, i would expect the full model to be around 10 times slower, but not 50 times slower (does this make sense?).
So we have 2 main questions at the moment:
- Does the hierarchical model look OK.
- Are there reasons to expect the relationship between amount of data and speed of sampling to be non-linear?
Thanks in advance!