Correct way of sampling correlated multidimensional vectors

I’m still getting used to Stan’s syntax and I wanted to make sure that I’m doing this correctly.

I’m working with a hierarchical regression where I’m modeling the contribution of latent factors (\vec{\mu_a}, \vec{\mu_b}, etc.) to participants’ responses. If I want to sample a JxK vector of weights (\vec{\beta}) from a normal distribution given a K-d vector of means (\vec{\mu}) and a covariance matrix (\Sigma), e.g. \vec{\beta} \sim \mathcal{N}( \vec{\mu}, \Sigma), where \vec{\mu} is the sum of the latent factors, is the right way to do that to split the variables up like this (full model below). The model seems to be working but some of the CI are much larger than I was expecting from visual inspection so I wanted to check the model is doing what I think it is. Thanks a ton for the help (feedback on any part of the model is super welcome).

parameters {
vector[K] mu_a;
vector[K] mu_b;
vector[K] mu_c;
matrix[K, J] z;
vector<lower=0>[K] sd;
cholesky_factor_corr[K] L;
}
transformed parameters {
matrix[J, K] beta = (diag_pre_multiply(sd, L) * z)';
}
model {
vector[K] mu[J];
for (j in 1:J) {
mu[j] = a[j] * mu_a + b[j] * mu_b + c[j] * mu_c;
// a[j] is scaler, mu_a is a K-d vector
}

for (j in 1:J) {
for (k in 1:K) {
target += normal_lpdf(z[k,j] | mu[j][k], 1);
}
}
}


full model:

data {
int<lower=1> N;  // number of observations
int Y[N];  // response variable
int<lower=2> ncat;  // number of categories

int<lower=0> instructions[N]; // dummy coded var

// data for stimuli
int<lower=1> stimuli[N];
int<lower=1> n_stimuli;
int<lower=1> M_1;
int<lower=1> NC_1;

// data for participants
int<lower=1> participants[N];
int<lower=1> n_participants;
int<lower=0> diagnosis_bysubject[n_participants]; // dummy coded
real working_memory_errors_zscore_bysubject[n_participants];
int<lower=1> M_2;
int<lower=1> NC_2;
}
transformed data {
vector[M_2] mu_participant_blank[n_participants];
vector[N] mu_response_blank;

for (j in 1:n_participants) {
for (k in 1:M_2) {
mu_participant_blank[j][k] = 0; // initialize
}
}

for (n in 1:N) {
mu_response_blank[n] = 0; // initialize
}
}
parameters {

ordered[ncat-1] temp_Intercept;  // temporary thresholds

vector[M_2] mu_p; // <mu_p0, mu_pr> values
vector[M_2] mu_pd; // <mu_pd0, mu_pdr> values
vector[M_2] mu_pw; // <mu_pw0, mu_pwr> values
vector[M_2] mu_pdw; // <mu_pdw0, mu_pdwr> values

matrix[M_1, n_stimuli] z_1;  // unscaled group-level effects
vector<lower=0>[M_1] sd_1;  // group-level standard deviations
cholesky_factor_corr[M_1] L_1;  // cholesky factor of correlation matrix

matrix[M_2, n_participants] z_2;  // unscaled group-level effects
vector<lower=0>[M_2] sd_2;  // group-level standard deviations
cholesky_factor_corr[M_2] L_2;  // cholesky factor of correlation matrix

}
transformed parameters {

// group-level effects 1 -- stimuli
matrix[n_stimuli, M_1] beta_x = (diag_pre_multiply(sd_1, L_1) * z_1)'; // (n_stimuli x 2) matrix of beta weights
vector[n_stimuli] beta_x0 = beta_x[, 1]; // rand intercepts (n_stimuli x 1)
vector[n_stimuli] beta_xr = beta_x[, 2]; // rand slopes (n_stimuli x 1)

// group-level effects 2 -- participants
matrix[n_participants, M_2] beta_p = (diag_pre_multiply(sd_2, L_2) * z_2)'; // (n_participants x 2) matrix of beta weights
vector[n_participants] beta_p0 = beta_p[, 1]; // rand intercepts (n_participants x 1)
vector[n_participants] beta_pr = beta_p[, 2]; // rand slopes (n_participants x 1)

}
model {
vector[M_2] mu_participant[n_participants] = mu_participant_blank; // n-d vector of <mu_p0, mu_pr> values
vector[N] mu_response = mu_response_blank;

////////////////////// update beta_x (stimuli)

target += normal_lpdf(to_vector(z_1) | 0, 1);
target += student_t_lpdf(sd_1 | 3, 0, 10)
- 2 * student_t_lccdf(0 | 3, 0, 10);
target += lkj_corr_cholesky_lpdf(L_1 | 1);

////////////////////// update beta_p (participants)

for (j in 1:n_participants) {
mu_participant[j] =
mu_p +
diagnosis_bysubject[j] * mu_pd +
working_memory_errors_zscore_bysubject[j] * mu_pw +
(diagnosis_bysubject[j] * working_memory_errors_zscore_bysubject[j]) * mu_pdw;
}

for (j in 1:n_participants) {
for (k in 1:M_2) {
target += normal_lpdf(z_2[k,j] | mu_participant[j][k], 1);
}
}

target += student_t_lpdf(sd_2 | 3, 0, 10)
- 2 * student_t_lccdf(0 | 3, 0, 10);
target += lkj_corr_cholesky_lpdf(L_2 | 1);

//////////////////////

for (n in 1:N) {
mu_response[n] =
beta_x0[stimuli[n]] * 1 + // stimulus intercept
beta_xr[stimuli[n]] * instructions[n] + // stimulus slope
beta_p0[participants[n]] * 1 + // participant intercept
beta_pr[participants[n]] * instructions[n]; // participant slope
}

target += student_t_lpdf(temp_Intercept | 3, 0, 10);

// likelihood including all constants
for (n in 1:N) {
target += ordered_logistic_lpmf(Y[n] | mu_response[n], temp_Intercept);
}

}
generated quantities {

corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
vector<lower=-1,upper=1>[NC_1] cor_1;

corr_matrix[M_2] Cor_2 = multiply_lower_tri_self_transpose(L_2);
vector<lower=-1,upper=1>[NC_2] cor_2;

// take only relevant parts of correlation matrix
cor_1[1] = Cor_1[1,2];

// take only relevant parts of correlation matrix
cor_2[1] = Cor_2[1,2];
}

1 Like

What do you think of this:

parameters {
vector[K] mu_a;
vector[K] mu_b;
vector[K] mu_c;
matrix[K, J] z;
cholesky_factor_corr[K] L;
}
transformed parameters {
matrix[K, J] beta = diag_pre_multiply(sd, L) * z;

for(j in 1:J) {
beta[, j] = beta[, j] + a[j] * mu_a + b[j] * mu_b + c[j] * mu_c;
}
}
model {
to_array_1d(z) ~ normal(0, 1);
}


Beta should have the prior:

\beta \sim \mathcal{N}(A \mu_a + B \mu_b + C \mu_c, \sigma^2 L L^T)

Where A is a diagonal matrix with the elements a[j], B is a diagonal matrix with the elements b[j], and C is a diagonal matrix with the elements c[j].

Watch out cause I transposed your beta in the transformed parameters. Is that what you were after?

I don’t think you want to do:

target += normal_lpdf(z[k,j] | mu[j][k], 1);


If you want to do the Cholesky parameterization here:

matrix[J, K] beta = (diag_pre_multiply(sd, L) * z)'


For this to be the prior you probably want, the zs need to be zero mean random variables.

1 Like

This is super helpful. Thank you. Did I get the beta prior right with multi_normal_cholesky()? (I read that well-formed stan models usually don’t use a multi_normal but thought I’d make sure I had the function before worrying about form.)

parameters {
vector[K] mu_a;
vector[K] mu_b;
vector[K] mu_c;

matrix[K, J] z;
vector<lower=0>[K] sd;
cholesky_factor_corr[K] L;
}
transformed parameters {
matrix[K, J] beta = diag_pre_multiply(sd, L) * z;

for(j in 1:J) {  // Add mu weights to random effects
beta[, j] =
beta[, j] +
a[j] * mu_a +
b[j] * mu_b +
c[j] * mu_c
}
}
model {
to_array_1d(z) ~ normal(0, 1);  // participant-level random effects

target += student_t_lpdf(sd | 3, 0, 10)
- 2 * student_t_lccdf(0 | 3, 0, 10);
target += lkj_corr_cholesky_lpdf(L | 1);

for(j in 1:J) {
beta[, j] ~ multi_normal_cholesky(
diag_matrix(rep_vector(a[j], K)) * mu_a +
diag_matrix(rep_vector(b[j], K)) * mu_b +
diag_matrix(rep_vector(c[j], K)) * mu_c,
//
diag_pre_multiply(sd, L));
}
}


and the full model

data {
int<lower=1> N;  // total number of observations
int<lower=1,upper=5> Y[N];  // response variable
int<lower=0,upper=1> instructions[N];

// data for group-level effects of ID 1 -- stimuli
int<lower=1> stimuli[N];
int<lower=1> n_stimuli;
int<lower=1> K_1;
int<lower=1> NC_1;

// data for group-level effects of ID 2
int<lower=1> participants[N];
int<lower=1> n_participants;
int<lower=0,upper=1> diagnosis_bysubject[n_participants];
real working_memory_errors_zscore_bysubject[n_participants];
int<lower=1> K_2;
int<lower=1> NC_2;
}
parameters {
ordered[4] temp_Intercept;  // temporary thresholds

vector[K_2] mu_p; // <mu_pd0, mu_pdr> values
vector[K_2] mu_pd; // <mu_pd0, mu_pdr> values
vector[K_2] mu_pw; // <mu_pw0, mu_pwr> values
vector[K_2] mu_pdw; // <mu_pdw0, mu_pdwr> values

matrix[K_1, n_stimuli] z_1;  // unscaled group-level effects
vector<lower=0>[K_1] sd_1;  // group-level standard deviations
cholesky_factor_corr[K_1] L_1;  // cholesky factor of correlation matrix

matrix[K_2, n_participants] z_2;  // unscaled group-level effects
vector<lower=0>[K_2] sd_2;  // group-level standard deviations
cholesky_factor_corr[K_2] L_2;  // cholesky factor of correlation matrix
}
transformed parameters {
// group-level effects 1 -- stimuli
matrix[K_1, n_stimuli] beta_x = diag_pre_multiply(sd_1, L_1) * z_1; // (n_stimuli x 2) item random effects part of matrix of beta weights
vector[n_stimuli] beta_x0 = beta_x[1, ]'; // rand intercepts (n_stimuli x 1)
vector[n_stimuli] beta_xr = beta_x[2, ]'; // rand slopes (n_stimuli x 1)

// group-level effects 2 -- participants
matrix[K_2, n_participants] beta_p = diag_pre_multiply(sd_2, L_2) * z_2; // (n_participants x 2) participant random effects part of matrix of beta weights
vector[n_participants] beta_p0 = beta_p[1, ]'; // rand intercepts (n_participants x 1)
vector[n_participants] beta_pr = beta_p[2, ]'; // rand slopes (n_participants x 1)

// Add mu weights to participants' random effects
for(j in 1:n_participants) {
beta_p[, j] =
beta_p[, j] + // participant's random effects (intercept and slope, with same correlation as mu latent factors)
1 * mu_p +
diagnosis_bysubject[j] * mu_pd +
working_memory_errors_zscore_bysubject[j] * mu_pw +
(diagnosis_bysubject[j] * working_memory_errors_zscore_bysubject[j]) * mu_pdw;
}
}
model {
vector[N] mu_response = rep_vector(0,N);

////////////////////// update beta_x

target += normal_lpdf(to_vector(z_1) | 0, 1);  // item-level random effects (no latent factors)

target += student_t_lpdf(sd_1 | 3, 0, 10)
- 2 * student_t_lccdf(0 | 3, 0, 10);
target += lkj_corr_cholesky_lpdf(L_1 | 1);

////////////////////// update beta_p

to_array_1d(z_2) ~ normal(0, 1);  // participant-level random effects

target += student_t_lpdf(sd_2 | 3, 0, 10)
- 2 * student_t_lccdf(0 | 3, 0, 10);
target += lkj_corr_cholesky_lpdf(L_2 | 1);

//////////////////////

for(j in 1:n_participants) {
beta_p[, j] ~ multi_normal_cholesky(
1 * mu_p +
diag_matrix(rep_vector(diagnosis_bysubject[j], K_2)) * mu_pd +
diag_matrix(rep_vector(working_memory_errors_zscore_bysubject[j], K_2)) * mu_pw +
diag_matrix(rep_vector(diagnosis_bysubject[j] * working_memory_errors_zscore_bysubject[j], K_2)) * mu_pdw,
//
diag_pre_multiply(sd_2, L_2));
}

for (n in 1:N) {
mu_response[n] =
beta_x0[stimuli[n]] * 1 + // stimulus intercept
beta_xr[stimuli[n]] * instructions[n] + // stimulus slope
beta_p0[participants[n]] * 1 + // participant intercept
beta_pr[participants[n]] * instructions[n]; // participant slope
}

target += student_t_lpdf(temp_Intercept | 3, 0, 10);

// likelihood including all constants
for (n in 1:N) {
target += ordered_logistic_lpmf(Y[n] | mu_response[n], temp_Intercept);
}
}
generated quantities {

corr_matrix[K_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
vector<lower=-1,upper=1>[NC_1] cor_1;

corr_matrix[K_2] Cor_2 = multiply_lower_tri_self_transpose(L_2);
vector<lower=-1,upper=1>[NC_2] cor_2;

// take only relevant parts of correlation matrix
cor_1[1] = Cor_1[1,2];

// take only relevant parts of correlation matrix
cor_2[1] = Cor_2[1,2];
}


The good news is, you don’t need the multi_normal_cholesky at all :D.

parameters {
vector[K] mu_a;
vector[K] mu_b;
vector[K] mu_c;

matrix[K, J] z;
vector<lower=0>[K] sd;
cholesky_factor_corr[K] L;
}
transformed parameters {
matrix[K, J] beta = diag_pre_multiply(sd, L) * z;

for(j in 1:J) {  // Add mu weights to random effects
beta[, j] =
beta[, j] +
a[j] * mu_a +
b[j] * mu_b +
c[j] * mu_c
}
}
model {
to_array_1d(z) ~ normal(0, 1);  // participant-level random effects
}


Implies a distribution on beta equivalent to:

parameters {
vector[K] mu_a;
vector[K] mu_b;
vector[K] mu_c;

matrix[K, J] beta
vector<lower=0>[K] sd;
cholesky_factor_corr[K] L;
}

model { for(j in 1:J) {
matrix[K, J] mu;

for(j in 1:J) {  // Add mu weights to random effects
mu[, j] =  a[j] * mu_a +  b[j] * mu_b +  c[j] * mu_c;
}

beta ~ multi_normal_cholesky(mu, diag_pre_multiply(sd, L));
}


You do one thing or the other, depending on the situation. Sometimes one is easier to sample than the other, but probability-wise, you can answer the same questions with either one.

The difference is in the second model you are sampling beta directly, but in the first beta is a transformed parameter (and you’re sampling z). The way you transform z implies a prior distribution on beta (but you aren’t sampling it directly)

This is the difference in a centered (second) vs. non-centered (first) parameterization of a multivariate normal. Here’s the explanation of why it’s useful in the single variable case: https://mc-stan.org/users/documentation/case-studies/divergences_and_bias.html . There’s some discussion of the multivariate case in the Gaussian process section of the manual: https://mc-stan.org/docs/2_18/stan-users-guide/gaussian-processes-chapter.html

That make any sense? Also, I could have easily made a typo in these codes. I didn’t double check them. So if you see something weird ask about it.

Edit: Accidentally posted too early

1 Like