Unbalanced Multivariate Related-Outcomes Multilevel Model - trouble with different number measurements


I’m trying to fit a multivariate related-outcomes multilevel model. One purpose of such a model is to obtain covariances between random effects and covariances between residual errors across my two outcomes (repeated measurements: meas1, meas2) (if interested, see Baldwin et al. (2014)).

The model below works fine as long as my data have the same number of observations per participant for meas1 and meas2. (You can see that I’ve been trying to set it up to allow for unbalanced data though.) All participants have multiple observations for meas1 and meas2.

stan_code <- "
data { 
  int<lower=1> NOBS1; //number of obs for meas 1  
  int<lower=1> NOBS2; //number of obs for meas 2
  int<lower=1> NSUB; //number of subjects

  int<lower=1,upper=NSUB> id1[NOBS1]; //id membership for meas 1 
  int<lower=1,upper=NSUB> id2[NOBS2]; //id membership for meas 2
  int<lower=1> meas1_trl[NOBS1]; //trl membership for meas 1
  int<lower=1> meas2_trl[NOBS1]; //trl membership for meas 2

  vector[NOBS1] meas1; //scores for meas 1
  vector[NOBS2] meas2; //scores for meas 2

transformed data { 
  vector[NOBS1] meas1_idv = rep_vector(1,NOBS1); //random effs int meas 1
  vector[NOBS2] meas2_idv = rep_vector(1,NOBS2); //random effs int meas 2
  vector[2] Y[NOBS1]; //store meas1 and meas2 into same vector for multi_normal_cholesky
                      //only works b/c NOBS1=NOBS2

  for (n in 1:NOBS1) //only works b/c NOBS1=NOBS2
    Y[n] = [meas1[n], meas2[n]]';

parameters { 
  vector[2] pop_int; //population-level intercepts
  vector<lower=0>[2] sig_err; //population-level residual
  cholesky_factor_corr[2] Lrescor; //population-level residuals
  cholesky_factor_corr[4] L_omega; //random effects

  vector<lower=0>[4] L_sigma; //person-level standard deviations 
  matrix[4, NSUB] ind_groeffs; //unscaled person-level effects 

transformed parameters { 
  matrix[NSUB, 4] rs = (diag_pre_multiply(L_sigma, L_omega) * ind_groeffs)'; 
    //correlations for random effects
  vector[2] L_sigma_res = [sig_err[1], sig_err[2]]';

model { 
  //vector[NOBS1] mu1; 
  //vector[NOBS2] mu2;

  vector[2] Mu[NOBS1]; //only possible b/c NOBS1=NOBS2
  for (n in 1:NOBS1) { 
    Mu[n,1] = pop_int[1] + 
      (rs[,1][id1[n]]) * meas1_idv[n] + 
      (rs[,2][id1[n]]) * meas1_trl[n]; 
  for (n in 1:NOBS2) { 
    Mu[n,2] = pop_int[2] + 
      (rs[,3][id2[n]]) * meas2_idv[n] + 
      (rs[,4][id2[n]]) * meas2_trl[n]; 

  //set priors
  pop_int ~ normal(0,5);
  sig_err ~ cauchy(0,10);
  L_sigma ~ cauchy(0,10);
  Lrescor ~  lkj_corr_cholesky(2);
  L_omega ~  lkj_corr_cholesky(2);
  to_vector(ind_groeffs) ~ normal(0,1);
  Y ~ multi_normal_cholesky(Mu, diag_pre_multiply(L_sigma_res, Lrescor));

generated quantities { 
  matrix[4,4] omega_ind;
  matrix[4,4] sigma_ind;
  matrix[2,2] omega_res;
  matrix[2,2] sigma_res;
  omega_ind = multiply_lower_tri_self_transpose(L_omega);
  sigma_ind = quad_form_diag(omega_ind, L_sigma);
  omega_res = multiply_lower_tri_self_transpose(Lrescor);
  sigma_res = quad_form_diag(omega_res, L_sigma_res);

The problem is that I can’t figure out how to use the multi_normal_cholesky line without making Y contain an equal number of observations for both measurements.

Y ~ multi_normal_cholesky(Mu, diag_pre_multiply(L_sigma_res, Lrescor));

If I try to estimate them separately the errors separately using something like…

meas1 ~ normal(mu1,sig_err[1]);
meas2 ~ normal(mu2,sig_err[2]);

… then I’m unable to estimate the covariance between sig_err[1] and sig_err[2], which I need.

Does anyone have any suggestions about the code I need to use in order to get the covariances between residuals once I put in a different number of observations for meas1 and meas2?


If you want to use:

y ~ multi_normal(..., ...)

where you don’t have all the values of y, that sounds like a missing data problem to me.

And unless I’m being a goof and missing something, I think you can marginalize out your missing data and just do the thing in section 11.5, “Missing Multivariate Data” in the 2.17 manual.

Does that sound like what you’re doing?

Thanks for the response.

I saw that section of the manual and just re-read it. I’m not sure it applies, but I might be missing something.

(I should have explained this in my previous post.) These are repeated measurements taken during behavioral experiments, and whether a response is included as a measurement has to do with the accuracy of the participant’s response. So, if a participant has more of one type of measurement (say, meas1), that’s simply due to having better accuracy.

I don’t quite think of this as a missing-data issue, but I should have explained the nature of the measurements in my initial post. I can see how it was misleading. Or, am I missing something else about 11.5?


:) I’ve continued to ponder this. I just read footnote 2 (pasted below) in section 11.5, where it talks about this applying to situations in which you don’t necessarily want to impute data.

“Note that this is not the same as missing components of a multivariate predictor in a regression problem; in that case, you will need to represent the missing data as a parameter and impute missing values in order to feed them into the regression.”

I’m going to give this a shot and let you know how things play out.

Thanks for the help.

Is there a connection between meas1 and meas2 here? Like, is it simply that experiments were repeated and sometimes meas1 was taken, sometimes meas2, and sometimes both?

If there’s some sort of measurement process at play here, then we’d wanna work that into the model.

Or maybe writing things out like this looks more like your problem:

mu ~ multi_normal(..., ...);
meas1 ~ normal(mu[,1], sigma1);
meas2 ~ normal(mu[,2], sigma2); // There's no way I got the indexing totally right here

Where sigma1 and sigma2 are just measurement noise of some sort? So mu is some underlying latent variable, and you just made various numbers of measurements of meas1 and meas2 for each subject in each case?

This case works out to be the same thing as the marginalized missing data in 11.5 just parameterized a bit differently (and you wouldn’t want to do it like this cause sampling mu is unnecessary).

I do psychophsyiological experiments; I use EEG. All of my participants have EEG1 for meas1 and EEG2 for meas2. These measurements were recorded in the same paradigm, but represent different outcome variables of interest. I predict each outcome using the same predictors (id/person [my grouping variable] and trial in my case here). So, both measurements were always taken for each participant, but each participant does not have the same number of observations for meas1 and meas2.

If a participant commits more errors, then he might end up with more observations of meas1 than of meas2. If a participant commits fewer errors, then he might end up with more observations of meas2 than of meas1. However, it’s not an exact 1 to 1 relationship (adding one observation from meas1 does not mean subtracting one observations from meas2). There are a few other data processing factors that impact whether a measurement observation is included in the model.

That being said, I might still be able to use the following code you just posted

But without using multi_normal, I don’t understand how to get the covariance between sigma1 and sigma2 (more or less the measurement noise from each measurement), in your example.

Thanks for the continued help! I hope I’m not being a headache :)


I hope I’m not being wrong :P

This is just coming from saying you have data is generated from some correlated underlying process with additive independent Gaussian noise:

y = \mu + \xi


\mu \sim \mathcal{N}(q, \Sigma)


\xi \sim \mathcal{N}(0, \sigma^2 I)


y \sim \mathcal{N}(q, \Sigma + \sigma^2 I)

Something like that? Just start with the top thing with all your indices and then get to the bottom thing and see what pops out.

Ok, I’ll give it a shot. I should have some time to work through this later this afternoon. I’ll let you know how things play out.


Ok. I’m sorry. I thought I followed you this morning. But, I’m a bit lost. I attempted to read section 18 of the manual, but I had trouble making sense of that as well :) I’m much more of a basic, narrow user of Stan… this is my first foray in mutlivariate outcomes.

Do the formulas for the additive independent Gaussian noise assume that noise is the same for both measurement equations? Because, that would not be the case for my application (especially if I use an EEG and some behavioral measure).

If so, then I’ll try to build my model after the examples in section 18 of the manual.


The GP section is something different (is that the section 18 stuff?).

I think you have a situation like this:

y_i = \mu_{\text{group}(i)} + \xi_i

\mu \sim \mathcal{N}(q, \Sigma)

\xi_i \sim \mathcal{N}(0, \sigma_1)


p(y_i | \mu_{\text{group}(i)}) = \text{normal_pdf}(\mu_{\text{group}(i)} | \sigma_1)

Write out:
p(y, \mu) = p(\mu) \prod_i p(y_i | \mu_{\text{group}(i)})

And the thing you want to write your model is:

Which means you’d really like to be able to analytically integrate out \mu

p(y) = \int p(y, \mu) d\mu

But you might not need to depending on how many $\mu$s you have. You can just leave them in the model as parameters and sample them with Stan, and then if you ignore them in the posterior that’s the same as integrating over them.

I’m scared I confused things more but best thing to do is figure out what the generating process looks like. I’m thinking it’s close to what I wrote above, but maybe with more indices. Once you have that, wiggle the eqs around and see if you can get anything to pop out (edit: reworded to sound less annoying).

@Peter_Clayson, Did you mange to sort out the challenge? I am in a similar situation. I have 2 responses one is recorded every day while the other is recorded every 10 days for each participant. Unlike your case, the number of measurements for each response is fixed by design and want to fit the same model like the one you shared us.

Looking forward to hear from you

Hi @belay. Unfortunately, I wasn’t able to figure it out. I also had to put it on the back burner, because I had some pressing deadlines. Hopefully I’ll get back to sorting this out in the future.