Correct way of sampling correlated multidimensional vectors


#1

I’m still getting used to Stan’s syntax and I wanted to make sure that I’m doing this correctly.

I’m working with a hierarchical regression where I’m modeling the contribution of latent factors (\vec{\mu_a}, \vec{\mu_b}, etc.) to participants’ responses. If I want to sample a JxK vector of weights (\vec{\beta}) from a normal distribution given a K-d vector of means (\vec{\mu}) and a covariance matrix (\Sigma), e.g. \vec{\beta} \sim \mathcal{N}( \vec{\mu}, \Sigma), where \vec{\mu} is the sum of the latent factors, is the right way to do that to split the variables up like this (full model below). The model seems to be working but some of the CI are much larger than I was expecting from visual inspection so I wanted to check the model is doing what I think it is. Thanks a ton for the help (feedback on any part of the model is super welcome).

parameters { 
  vector[K] mu_a;
  vector[K] mu_b;
  vector[K] mu_c;
  matrix[K, J] z;
  vector<lower=0>[K] sd;
  cholesky_factor_corr[K] L;
}
transformed parameters { 
  matrix[J, K] beta = (diag_pre_multiply(sd, L) * z)';
}
model { 
  vector[K] mu[J];
  for (j in 1:J) {
    mu[j] = a[j] * mu_a + b[j] * mu_b + c[j] * mu_c;
    // a[j] is scaler, mu_a is a K-d vector
  }

  for (j in 1:J) {
    for (k in 1:K) {
      target += normal_lpdf(z[k,j] | mu[j][k], 1);
    }
  }
}

full model:

data { 
  int<lower=1> N;  // number of observations 
  int Y[N];  // response variable 
  int<lower=2> ncat;  // number of categories 
  
  int<lower=0> instructions[N]; // dummy coded var

  // data for stimuli
  int<lower=1> stimuli[N];
  int<lower=1> n_stimuli;
  int<lower=1> M_1;
  int<lower=1> NC_1;
  
  // data for participants
  int<lower=1> participants[N];
  int<lower=1> n_participants;
  int<lower=0> diagnosis_bysubject[n_participants]; // dummy coded
  real working_memory_errors_zscore_bysubject[n_participants];
  int<lower=1> M_2;
  int<lower=1> NC_2;
} 
transformed data { 
  vector[M_2] mu_participant_blank[n_participants];
  vector[N] mu_response_blank;

  for (j in 1:n_participants) {
    for (k in 1:M_2) {
      mu_participant_blank[j][k] = 0; // initialize 
    }
  }
  
  for (n in 1:N) {
    mu_response_blank[n] = 0; // initialize 
  }
} 
parameters { 

  ordered[ncat-1] temp_Intercept;  // temporary thresholds 

  vector[M_2] mu_p; // <mu_p0, mu_pr> values
  vector[M_2] mu_pd; // <mu_pd0, mu_pdr> values
  vector[M_2] mu_pw; // <mu_pw0, mu_pwr> values
  vector[M_2] mu_pdw; // <mu_pdw0, mu_pdwr> values
  
  matrix[M_1, n_stimuli] z_1;  // unscaled group-level effects
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  cholesky_factor_corr[M_1] L_1;  // cholesky factor of correlation matrix

  matrix[M_2, n_participants] z_2;  // unscaled group-level effects
  vector<lower=0>[M_2] sd_2;  // group-level standard deviations
  cholesky_factor_corr[M_2] L_2;  // cholesky factor of correlation matrix

} 
transformed parameters { 

  // group-level effects 1 -- stimuli
  matrix[n_stimuli, M_1] beta_x = (diag_pre_multiply(sd_1, L_1) * z_1)'; // (n_stimuli x 2) matrix of beta weights 
  vector[n_stimuli] beta_x0 = beta_x[, 1]; // rand intercepts (n_stimuli x 1)
  vector[n_stimuli] beta_xr = beta_x[, 2]; // rand slopes (n_stimuli x 1)

  // group-level effects 2 -- participants
  matrix[n_participants, M_2] beta_p = (diag_pre_multiply(sd_2, L_2) * z_2)'; // (n_participants x 2) matrix of beta weights
  vector[n_participants] beta_p0 = beta_p[, 1]; // rand intercepts (n_participants x 1)
  vector[n_participants] beta_pr = beta_p[, 2]; // rand slopes (n_participants x 1)
  
} 
model { 
  vector[M_2] mu_participant[n_participants] = mu_participant_blank; // n-d vector of <mu_p0, mu_pr> values
  vector[N] mu_response = mu_response_blank;

  ////////////////////// update beta_x (stimuli)
 
  target += normal_lpdf(to_vector(z_1) | 0, 1);
  target += student_t_lpdf(sd_1 | 3, 0, 10)
    - 2 * student_t_lccdf(0 | 3, 0, 10); 
  target += lkj_corr_cholesky_lpdf(L_1 | 1); 

  ////////////////////// update beta_p (participants)

  for (j in 1:n_participants) {
    mu_participant[j] = 
      mu_p + 
      diagnosis_bysubject[j] * mu_pd + 
      working_memory_errors_zscore_bysubject[j] * mu_pw + 
      (diagnosis_bysubject[j] * working_memory_errors_zscore_bysubject[j]) * mu_pdw;
  }

  for (j in 1:n_participants) {
    for (k in 1:M_2) {
      target += normal_lpdf(z_2[k,j] | mu_participant[j][k], 1);
    }
  }

  target += student_t_lpdf(sd_2 | 3, 0, 10)
    - 2 * student_t_lccdf(0 | 3, 0, 10); 
  target += lkj_corr_cholesky_lpdf(L_2 | 1); 

  //////////////////////

  for (n in 1:N) { 
    mu_response[n] = 
      beta_x0[stimuli[n]] * 1 + // stimulus intercept
      beta_xr[stimuli[n]] * instructions[n] + // stimulus slope
      beta_p0[participants[n]] * 1 + // participant intercept
      beta_pr[participants[n]] * instructions[n]; // participant slope
  } 

  target += student_t_lpdf(temp_Intercept | 3, 0, 10); 

  // likelihood including all constants 
  for (n in 1:N) {
    target += ordered_logistic_lpmf(Y[n] | mu_response[n], temp_Intercept);
  }

} 
generated quantities { 

  corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
  vector<lower=-1,upper=1>[NC_1] cor_1;

  corr_matrix[M_2] Cor_2 = multiply_lower_tri_self_transpose(L_2);
  vector<lower=-1,upper=1>[NC_2] cor_2;

  // take only relevant parts of correlation matrix
  cor_1[1] = Cor_1[1,2]; 
  
  // take only relevant parts of correlation matrix
  cor_2[1] = Cor_2[1,2]; 
} 

#2

What do you think of this:

parameters { 
  vector[K] mu_a;
  vector[K] mu_b;
  vector[K] mu_c;
  matrix[K, J] z;
  cholesky_factor_corr[K] L;
}
transformed parameters { 
  matrix[K, J] beta = diag_pre_multiply(sd, L) * z;

  for(j in 1:J) {
    beta[, j] = beta[, j] + a[j] * mu_a + b[j] * mu_b + c[j] * mu_c;
  }
}
model {
  to_array_1d(z) ~ normal(0, 1);
}

Beta should have the prior:

\beta \sim \mathcal{N}(A \mu_a + B \mu_b + C \mu_c, \sigma^2 L L^T)

Where A is a diagonal matrix with the elements a[j], B is a diagonal matrix with the elements b[j], and C is a diagonal matrix with the elements c[j].

Watch out cause I transposed your beta in the transformed parameters. Is that what you were after?

I don’t think you want to do:

target += normal_lpdf(z[k,j] | mu[j][k], 1);

If you want to do the Cholesky parameterization here:

matrix[J, K] beta = (diag_pre_multiply(sd, L) * z)'

For this to be the prior you probably want, the zs need to be zero mean random variables.


#3

This is super helpful. Thank you. Did I get the beta prior right with multi_normal_cholesky()? (I read that well-formed stan models usually don’t use a multi_normal but thought I’d make sure I had the function before worrying about form.)

parameters { 
  vector[K] mu_a;
  vector[K] mu_b;
  vector[K] mu_c;

  matrix[K, J] z;
  vector<lower=0>[K] sd;
  cholesky_factor_corr[K] L;
} 
transformed parameters { 
  matrix[K, J] beta = diag_pre_multiply(sd, L) * z;

  for(j in 1:J) {  // Add mu weights to random effects
    beta[, j] = 
      beta[, j] +
      a[j] * mu_a + 
      b[j] * mu_b + 
      c[j] * mu_c
  }
} 
model { 
  to_array_1d(z) ~ normal(0, 1);  // participant-level random effects

  target += student_t_lpdf(sd | 3, 0, 10)
    - 2 * student_t_lccdf(0 | 3, 0, 10); 
  target += lkj_corr_cholesky_lpdf(L | 1); 

  for(j in 1:J) {
    beta[, j] ~ multi_normal_cholesky(
      diag_matrix(rep_vector(a[j], K)) * mu_a + 
      diag_matrix(rep_vector(b[j], K)) * mu_b + 
      diag_matrix(rep_vector(c[j], K)) * mu_c,
      // 
      diag_pre_multiply(sd, L));
  }
} 

and the full model

data { 
  int<lower=1> N;  // total number of observations 
  int<lower=1,upper=5> Y[N];  // response variable 
  int<lower=0,upper=1> instructions[N];

  // data for group-level effects of ID 1 -- stimuli
  int<lower=1> stimuli[N];
  int<lower=1> n_stimuli;
  int<lower=1> K_1;
  int<lower=1> NC_1;
  
  // data for group-level effects of ID 2
  int<lower=1> participants[N];
  int<lower=1> n_participants;
  int<lower=0,upper=1> diagnosis_bysubject[n_participants];
  real working_memory_errors_zscore_bysubject[n_participants];
  int<lower=1> K_2;
  int<lower=1> NC_2;
} 
parameters { 
  ordered[4] temp_Intercept;  // temporary thresholds 

  vector[K_2] mu_p; // <mu_pd0, mu_pdr> values
  vector[K_2] mu_pd; // <mu_pd0, mu_pdr> values
  vector[K_2] mu_pw; // <mu_pw0, mu_pwr> values
  vector[K_2] mu_pdw; // <mu_pdw0, mu_pdwr> values
  
  matrix[K_1, n_stimuli] z_1;  // unscaled group-level effects
  vector<lower=0>[K_1] sd_1;  // group-level standard deviations
  cholesky_factor_corr[K_1] L_1;  // cholesky factor of correlation matrix

  matrix[K_2, n_participants] z_2;  // unscaled group-level effects
  vector<lower=0>[K_2] sd_2;  // group-level standard deviations
  cholesky_factor_corr[K_2] L_2;  // cholesky factor of correlation matrix
} 
transformed parameters { 
  // group-level effects 1 -- stimuli
  matrix[K_1, n_stimuli] beta_x = diag_pre_multiply(sd_1, L_1) * z_1; // (n_stimuli x 2) item random effects part of matrix of beta weights
  vector[n_stimuli] beta_x0 = beta_x[1, ]'; // rand intercepts (n_stimuli x 1)
  vector[n_stimuli] beta_xr = beta_x[2, ]'; // rand slopes (n_stimuli x 1)

  // group-level effects 2 -- participants
  matrix[K_2, n_participants] beta_p = diag_pre_multiply(sd_2, L_2) * z_2; // (n_participants x 2) participant random effects part of matrix of beta weights
  vector[n_participants] beta_p0 = beta_p[1, ]'; // rand intercepts (n_participants x 1)
  vector[n_participants] beta_pr = beta_p[2, ]'; // rand slopes (n_participants x 1)

  // Add mu weights to participants' random effects
  for(j in 1:n_participants) {
    beta_p[, j] = 
      beta_p[, j] + // participant's random effects (intercept and slope, with same correlation as mu latent factors)
      1 * mu_p + 
      diagnosis_bysubject[j] * mu_pd + 
      working_memory_errors_zscore_bysubject[j] * mu_pw + 
      (diagnosis_bysubject[j] * working_memory_errors_zscore_bysubject[j]) * mu_pdw;
  }
} 
model { 
  vector[N] mu_response = rep_vector(0,N);

  ////////////////////// update beta_x
 
  target += normal_lpdf(to_vector(z_1) | 0, 1);  // item-level random effects (no latent factors)

  target += student_t_lpdf(sd_1 | 3, 0, 10)
    - 2 * student_t_lccdf(0 | 3, 0, 10); 
  target += lkj_corr_cholesky_lpdf(L_1 | 1); 


  ////////////////////// update beta_p

  to_array_1d(z_2) ~ normal(0, 1);  // participant-level random effects

  target += student_t_lpdf(sd_2 | 3, 0, 10)
    - 2 * student_t_lccdf(0 | 3, 0, 10); 
  target += lkj_corr_cholesky_lpdf(L_2 | 1); 

  //////////////////////

  for(j in 1:n_participants) {
    beta_p[, j] ~ multi_normal_cholesky(
      1 * mu_p + 
      diag_matrix(rep_vector(diagnosis_bysubject[j], K_2)) * mu_pd + 
      diag_matrix(rep_vector(working_memory_errors_zscore_bysubject[j], K_2)) * mu_pw + 
      diag_matrix(rep_vector(diagnosis_bysubject[j] * working_memory_errors_zscore_bysubject[j], K_2)) * mu_pdw,
      // 
      diag_pre_multiply(sd_2, L_2));
  }

  for (n in 1:N) { 
    mu_response[n] = 
      beta_x0[stimuli[n]] * 1 + // stimulus intercept
      beta_xr[stimuli[n]] * instructions[n] + // stimulus slope
      beta_p0[participants[n]] * 1 + // participant intercept
      beta_pr[participants[n]] * instructions[n]; // participant slope
  } 

  target += student_t_lpdf(temp_Intercept | 3, 0, 10); 

  // likelihood including all constants 
  for (n in 1:N) {
    target += ordered_logistic_lpmf(Y[n] | mu_response[n], temp_Intercept);
  }
} 
generated quantities { 

  corr_matrix[K_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
  vector<lower=-1,upper=1>[NC_1] cor_1;

  corr_matrix[K_2] Cor_2 = multiply_lower_tri_self_transpose(L_2);
  vector<lower=-1,upper=1>[NC_2] cor_2;

  // take only relevant parts of correlation matrix
  cor_1[1] = Cor_1[1,2]; 
  
  // take only relevant parts of correlation matrix
  cor_2[1] = Cor_2[1,2]; 
} 

#4

The good news is, you don’t need the multi_normal_cholesky at all :D.

parameters { 
  vector[K] mu_a;
  vector[K] mu_b;
  vector[K] mu_c;

  matrix[K, J] z;
  vector<lower=0>[K] sd;
  cholesky_factor_corr[K] L;
} 
transformed parameters { 
  matrix[K, J] beta = diag_pre_multiply(sd, L) * z;

  for(j in 1:J) {  // Add mu weights to random effects
    beta[, j] = 
      beta[, j] +
      a[j] * mu_a + 
      b[j] * mu_b + 
      c[j] * mu_c
  }
} 
model { 
  to_array_1d(z) ~ normal(0, 1);  // participant-level random effects
}

Implies a distribution on beta equivalent to:

parameters { 
  vector[K] mu_a;
  vector[K] mu_b;
  vector[K] mu_c;

  matrix[K, J] beta
  vector<lower=0>[K] sd;
  cholesky_factor_corr[K] L;
} 

model { for(j in 1:J) {
  matrix[K, J] mu;

  for(j in 1:J) {  // Add mu weights to random effects
    mu[, j] =  a[j] * mu_a +  b[j] * mu_b +  c[j] * mu_c;
  }

  beta ~ multi_normal_cholesky(mu, diag_pre_multiply(sd, L));
}

You do one thing or the other, depending on the situation. Sometimes one is easier to sample than the other, but probability-wise, you can answer the same questions with either one.

The difference is in the second model you are sampling beta directly, but in the first beta is a transformed parameter (and you’re sampling z). The way you transform z implies a prior distribution on beta (but you aren’t sampling it directly)

This is the difference in a centered (second) vs. non-centered (first) parameterization of a multivariate normal. Here’s the explanation of why it’s useful in the single variable case: https://mc-stan.org/users/documentation/case-studies/divergences_and_bias.html . There’s some discussion of the multivariate case in the Gaussian process section of the manual: https://mc-stan.org/docs/2_18/stan-users-guide/gaussian-processes-chapter.html

That make any sense? Also, I could have easily made a typo in these codes. I didn’t double check them. So if you see something weird ask about it.

Edit: Accidentally posted too early