So one way to do this with multivariate data in brms
is to convert it to “long format” (each dimension-value pair gets its own row). Than one can do “sort-of” multivariate model by having a varying intercept per each row in the original data. The model is not fully identified by default, but one can fix the sd
of the varying intercept to make it identifiable.
So assuming simulated data like:
rcumulative_logit <- function(N, mean, thresholds) {
raw <- rlogis(N) + mean
res <- rep(1, N)
for(t in thresholds) {
res <- res + (t < raw)
}
res
}
set.seed(58224522)
N_patients <- 7
simple_test <- data.frame(patient = factor(1:N_patients)) %>%
crossing(time_from_diagnosis = c(0, 3, 5)) %>%
mutate(q1 = rcumulative_logit(n(), rnorm(N_patients)[as.integer(patient)] + time_from_diagnosis * 0.5, thresholds = c(-1,-0.3,0,1)),
q2 = rcumulative_logit(n(), rnorm(N_patients, sd = 0.5)[as.integer(patient)] + time_from_diagnosis * 0.1, thresholds = c(-0.4,-0.1,0.5,1.5)),
q3 = rcumulative_logit(n(), rnorm(N_patients, sd = 2)[as.integer(patient)] + time_from_diagnosis * 0.3, thresholds = c(-2,-0.8,-0.2,1.1)),
obs_id = 1:n()
)
simple_test_long <- simple_test %>%
pivot_longer(starts_with("q"), names_to = "question", values_to = "answer")
obs_id
now binds all the response that were previously the same row.
The idea is that we add a varying intercept that corresponds to the correlated residuals, i.e. we add one parameter per observed answer as a varying intercept of the form (0 + question | obs_id)
where obs_id
uniquely identifies each row in the wide dataset. To keep the model identified, we fix the sd
of this varying intercept.
We then increase the disc
parameter of the cumulative("logit")
distribution (which I currently do by adding disc ~ 1
to the formula and setting a constant prior on this intercept. This is almost the same as using actual multivariate distribution, the difference is that the observation noise now has two components: one independent from the logistic distribution of ordered_logistic
(with roughly sd = 1/disc
) and one correlated from the varying intercept (where we set the sd = 1
). So as disc -> inf
we get a closer and closer approximation to multivariate probit, but also increase risk of computational issues. disc = 10
seems to work quite OK.
Note: it does not really matter which link we put in the cumulative
response (as we are trying to minimize it’s effect), but the logit
link is much more numerically stable than the probit
link, so we use it. Still, this is a probit
model, because the main noise component is normally distributed.
fit_thres_rescor_approx <- brm(bf(
answer | thres(gr = question) ~ time_from_diagnosis + (1 | patient) + (0 + question | obs_id), disc ~ 1, family = cumulative(link = "logit", link_disc = "identity")),
prior = c(
prior(constant(1), class = "sd", group = "obs_id"),
prior(student_t(3, 0, 5*2.5), class = "Intercept"),
prior(constant(10), class = "Intercept", dpar = "disc")
),
data = simple_test_long
)
I know this is not a very good approximation if you have many questions (roughly more than 5, but didn’t investigate thoroughly), still it models some correlation even in the higher-dimensional case. The model will also have trouble initializing properly for large data - reducing the constant(10)
makes it better behaved but also further from a pure multivariate probit. But is multivariate, ordinal, correlated, works with brms
out of the box and I verified it is a faithful approximation to multivariate probit for 3 questions.
For fitting, you simply drop the rows with missing question-answer pairs from the long dataset. To impute them, you predict for the whole dataset.
This code was developed and tested as a part of ongoing project with @jroon, so if you end up using it, some form of acknowledgement would be great :-) (I feel a bit shameful asking for this, but my employer is recently being a bit more inquisitive about my publication output).
You can hack brms
to work with latent variables - actually working on this in the project, but I don’t need to handle missing values which would complicate things another bit. However, it gets quite complex quite fast and requires you to be comfortable with both brms
internals and coding in Stan, so if the above version works well for you, I would stick with it.
I also took the liberty to move this to a new thread as I think it is getting far from the original question.