# Multivariate priors for hierarchical models - Group-level predictors for prior scale

Section 1.13 in the Stan User’s Guide states (https://mc-stan.org/docs/stan-users-guide/multivariate-hierarchical-priors.html):

“As for the prior means, if there is information about the scale of variation of coefficients across groups, it should be incorporated into the prior for tau. For large numbers of exchangeable coefficients, the components of tau itself (perhaps excluding the intercept) may themselves be given a hierarchical prior.”

How would the Stan code in this example look, if extended to inlcude group-level predictors for the prior scale as well? Suppose we use the same transposed preditor matrix matrix[L, J] u that is also used for the prior mean.

The easiest thing to do is model the log scales hierarchically in exactly the same way. Then just transform to the scales using exp(). This itself is a form of seemingly unrelated regression, so you could also model covariance among the log scales.

So that’d be something like a vector \log(\tau) of scales, with a prior

\log(\tau) \sim \textrm{multiNormalCholesky}(...\textrm{some regression}..., L_{\Sigma^\tau}).
\tau = \exp(\log \tau).

And it’s elephants all the way down in that you now need to specify a covariance matrix \Sigma^\tau as a prior (it may just be diagonal, in which case you’d just use \textrm{normal} rather than \textrm{MultiNormalCholesky}).

1 Like

Thanks, that’s helpful. However, using the transposed predictor matrix matrix[L, J] to model “the scale of variation of coefficients across groups” means that the resulting \tau_k is not a vector, but a matrix or an array of vectors, as it varies over j, right? How does that work with diag_pre_multiply?

You can sometimes run into problems where you need a container to be in two different shapes for different operations and the only thing to do is to manually transpose, etc. But here, we have

(\textrm{diag}(u) \cdot v)^{\top} = v^{\top} \cdot \textrm{diag}(u)^\top = v^\top \cdot \textrm{diag}(u),

so we just diag_post_multiply instead.

Sorry if I’m missing something fundamental here. I still count on my fingers for matrices.

So, this runs, but plenty of divergences and almost all transitions hit maximum treedepth:

data {
int<lower=0> N;              // num individuals
int<lower=1> K;              // num ind predictors
int<lower=1> J;              // num groups
int<lower=1> L;              // num group predictors
array[N] int<lower=1, upper=J> jj;  // group for individual
matrix[N, K] x;              // individual predictors
matrix[L, J] u;              // group predictors
vector[N] y;                 // outcomes
}

parameters {
matrix[K, J] z;
cholesky_factor_corr[K] L_Omega;
cholesky_factor_corr[K] L_Omega_tau;
vector[K] log_tau;                         // prior scale logged
matrix[K, L] gamma;                        // group coeffs
matrix[K, L] zeta;                         //
real<lower=0> sigma;                       // prediction error scale
vector<lower=0, upper=pi() / 2>[K] L_log_tau_unif;
}

transformed parameters {
vector<lower=0>[K] L_log_tau = 2.5 * tan(L_log_tau_unif);
vector<lower=0>[K] tau = exp(log_tau);
matrix[K, J] beta = gamma * u + diag_pre_multiply(tau, L_Omega) * z;
}

model {
vector[N] mu;
array[J] vector[K] mu_tau;
matrix[K, K] L_Sigma_log_tau;
for(n in 1:N) {
mu[n] = x[n, ] * beta[, jj[n]];
}
to_vector(z) ~ std_normal();
L_Omega ~ lkj_corr_cholesky(2);
to_vector(gamma) ~ normal(0, 1);
for (j in 1:J) {
mu_tau[j] = zeta * to_vector(u'[j,]);
}
L_Sigma_log_tau = diag_pre_multiply(L_log_tau, L_Omega_tau);
to_vector(zeta) ~ normal(0, 1);
L_Omega_tau ~ lkj_corr_cholesky(2);
log_tau ~ multi_normal_cholesky(mu_tau, L_Sigma_log_tau);
y ~ normal(mu, sigma);
}


Figured that the non-centered parameterization pasted below might help get rid of the divergences etc. However, the line:

vector[K] log_tau = mu_tau + diag_pre_multiply(L_log_tau, L_Omega_tau) * z_2;


produces an “Ill-typed arguments supplied to infix operator +” error, which makes sense, as mu_tau is an array[] vector, not a vector.

data {
int<lower=0> N;                     // num individuals
int<lower=1> K;                     // num ind predictors
int<lower=1> J;                     // num groups
int<lower=1> L;                     // num group predictors
array[N] int<lower=1, upper=J> jj;  // group for individual
matrix[N, K] x;                     // individual predictors
matrix[L, J] u;                     // group predictors
array[J] vector[L] s;               // group predictors
vector[N] y;                        // outcomes
}

parameters {
matrix[K, J] z;
vector[K] z_2;
cholesky_factor_corr[K] L_Omega;
cholesky_factor_corr[K] L_Omega_tau;
matrix[K, L] gamma;                        // group coeffs - location
matrix[K, L] zeta;                         // group coeffs - scale
real<lower=0> sigma;                       // prediction error scale
vector<lower=0, upper=pi() / 2>[K] L_log_tau_unif;
}

transformed parameters {
vector<lower=0>[K] L_log_tau = 2.5 * tan(L_log_tau_unif);
array[J] vector[K] mu_tau;
for (j in 1:J) {
mu_tau[j] = zeta * s[j];
}
vector[K] log_tau = mu_tau + diag_pre_multiply(L_log_tau, L_Omega_tau) * z_2;
matrix[K, J] beta = gamma * u + diag_pre_multiply(exp(log_tau), L_Omega) * z;
}

model {
vector[N] mu;
matrix[K, K] L_Sigma_log_tau;
for(n in 1:N) {
mu[n] = x[n, ] * beta[, jj[n]];
}
to_vector(z) ~ std_normal();
z_2 ~ std_normal();
L_Omega ~ lkj_corr_cholesky(2);
L_Omega_tau ~ lkj_corr_cholesky(2);
to_vector(gamma) ~ normal(0, 0.5);
to_vector(zeta) ~ normal(0, 0.5);
y ~ normal(mu, sigma);
}


Appreciate any ideas on how to implement this/get rid of the divergences and treedepth warnings.