Multivariate formula with different number of observations


I would like to model two response variables (y1, y2) using brms multivariate syntax to (1) model each response with a different distribution family, and (2) model the random effects g as correlated. The number of obs for y1 and y2 differ.

The idea is to have something like:

bf_1 <- bf(y1 ~ 1 + (1|ID|g)) + normal()
bf_2 <- bf(y2 ~ 1 + (1|ID|g)) + lognormal()

According to the brms multivariate vignette, it seems that we need the same number of observations for each component in the formula (in the vignette tarsus and back are two columns in the dataset). Can this assumption be relaxed? A previous user raised the same question here, but the suggested solution does not make specifying different families possible.

Would you be able to help?


Sorry, short on time, so just a quick notice - If I understand the problem correctly the subset term should let you use only some rows for each outcome (see the help for brmsformula for details).

No worries. Thank you for taking the time! I’ll try to use subset and I’ll report back.

It worked! Using subset was the trick. Thank you 🙏🏻

1 Like

I’m curious to see the solution using subset. Can you post the brms formula you used?

Found an example here in the last post:


Thanks for posting!

I was hoping that I could have a different number of observations for two different outcomes and still fit the models in brms using the multivariate normal. But, rescor still requires the outcomes to have the same number of observations :( I need rescor = TRUE so I can model the covariances between predictors and residuals.

For now, I’ll keep trying to do this in Stan… but it’s still beyond my skill level… I might be ready to give it another few-month break…

1 Like

Oh I didn’t spot set_rescor(FALSE) there. Also not useful for me so since I too need to model rescor = TRUE 😭

Haha, sorry to be bearer the bad news! Well, if you figure out a solution for rStan, be sure to let me know! :) (or find a vignette… but I’ve searched high and low for one…)

1 Like

haha no worries - better you told me than I did a bunch of work before noticing.

Hmm you could possibly treat the outcomes with ‘absent rows’ as missing values, but I have not tried this myself as yet.

See the vignette here:

Ya, I’ve thought of that, but the data aren’t truly “missing”. I do behavioral experiments where the unbalanced nature is occurs as part of the experiments. e.g., if I’m typically only interested in correct trial data (response times). The number of errors is never constant across participants…

But does that matter if it is a dependent variable? When you extend the model as in the example there, my understanding is that the ‘missing values’ are treated as parameters - so for missing outcome variables it is similar to a doing a prediction for new data. If you want you could make a model with this parameters included in order to include your uneven outcomes, and then ignore these parameters when interpreting your results… I don’t think it can effect your coefficients other than by allowing you to include all the data

Why is it that you have different number of observations?

I have unequal observations, because I look at response time data based on response accuracy (and only trials where a participant actually responds.

For example, one person could have 300 correct trials and 20 error trials and another could have 290 correct trials and 25 error trials (with 5 discarded non-response trials).

@paul.buerkner you actually helped me with a similar problem before (Modeling unbalanced multivariate outcomes in brms), which was referenced above.

My hope is to get an identical output to using
bf(congruent ~ 0 + (1+trial|p|id)) +
bf(incongruent ~ 0 + (1+trial|q|id) + rescor(TRUE)

In this case it would be something like
bf(correct ~ 0 + (1+trial|p|id)) +
bf(error ~ 0 + (1+trial|q|id) + rescor(TRUE)

That output of this brms fit has all of the (co)variance components that I need, and none that I don’t :) I don’t know if that helps to clarify what I’m after… I can only get it to work if I have the same number of correct and error trials across participants.

@jroon That is an interesting idea… maybe the issue is that I don’t understand what’s happening when mi() is used, as in… bf(congruent | mi() ~ 0 + (1+trial|p|id))

What keeps you from stacking correct and error trials into the same variable (long format) and then modeling that?

I couldn’t figure out how to get the covariance between residual variances for the two model types… :)

I want the sd(correct_Intercept), sd(error_Intercept), sd(correct_trial), sd(error_trial) and the covariances among them. I also need the sigma_correct, sigma_error, and rescor(correct, error).

1 Like

Is this the same model however?

If I understand you correctly you mean make one long format output variable and an indicator variable to say which outcome in the wide format that each row corresponds to. Then when you fit that model all outcomes are considered to come from the same univariate distribution (e.g. normal) with the indicator having a moderating effect, whereas in the multivariate version the outcomes are considered to come from a multivariate normal ?

I’ve seen this done in e.g. jags models in the past - but I’ve never understood are these things truly equivalent.

You can only define a correlation between two variables when there is a one to one correspondence between values. This is not a brms limitation but how correlations work. So i am unsure how you would define this residual correlation in the first place.

Hm, all participants have observations for each event type, so it’s not like any participants don’t have a person-specific estimate of correct or error trial. I’m interested in the group-level effects, so I figured as long as everyone had at least a few good observations of each trial type it should be doable…