Multivariate priors for hierarchical models - Group-level predictors for prior scale

Section 1.13 in the Stan User’s Guide states (https://mc-stan.org/docs/stan-users-guide/multivariate-hierarchical-priors.html):

“As for the prior means, if there is information about the scale of variation of coefficients across groups, it should be incorporated into the prior for tau. For large numbers of exchangeable coefficients, the components of tau itself (perhaps excluding the intercept) may themselves be given a hierarchical prior.”

How would the Stan code in this example look, if extended to inlcude group-level predictors for the prior scale as well? Suppose we use the same transposed preditor matrix matrix[L, J] u that is also used for the prior mean.

The easiest thing to do is model the log scales hierarchically in exactly the same way. Then just transform to the scales using exp(). This itself is a form of seemingly unrelated regression, so you could also model covariance among the log scales.

So that’d be something like a vector \log(\tau) of scales, with a prior

\log(\tau) \sim \textrm{multiNormalCholesky}(...\textrm{some regression}..., L_{\Sigma^\tau}).
\tau = \exp(\log \tau).

And it’s elephants all the way down in that you now need to specify a covariance matrix \Sigma^\tau as a prior (it may just be diagonal, in which case you’d just use \textrm{normal} rather than \textrm{MultiNormalCholesky}).

1 Like

Thanks, that’s helpful. However, using the transposed predictor matrix matrix[L, J] to model “the scale of variation of coefficients across groups” means that the resulting \tau_k is not a vector, but a matrix or an array of vectors, as it varies over j, right? How does that work with diag_pre_multiply?

You can sometimes run into problems where you need a container to be in two different shapes for different operations and the only thing to do is to manually transpose, etc. But here, we have

(\textrm{diag}(u) \cdot v)^{\top} = v^{\top} \cdot \textrm{diag}(u)^\top = v^\top \cdot \textrm{diag}(u),

so we just diag_post_multiply instead.

Sorry if I’m missing something fundamental here. I still count on my fingers for matrices.

So, this runs, but plenty of divergences and almost all transitions hit maximum treedepth:

data {
  int<lower=0> N;              // num individuals
  int<lower=1> K;              // num ind predictors
  int<lower=1> J;              // num groups
  int<lower=1> L;              // num group predictors
  array[N] int<lower=1, upper=J> jj;  // group for individual
  matrix[N, K] x;              // individual predictors
  matrix[L, J] u;              // group predictors
  vector[N] y;                 // outcomes
}

parameters {
  matrix[K, J] z;
  cholesky_factor_corr[K] L_Omega;
  cholesky_factor_corr[K] L_Omega_tau;
  vector[K] log_tau;                         // prior scale logged
  matrix[K, L] gamma;                        // group coeffs
  matrix[K, L] zeta;                         // 
  real<lower=0> sigma;                       // prediction error scale
  vector<lower=0, upper=pi() / 2>[K] L_log_tau_unif;
}

transformed parameters {
  vector<lower=0>[K] L_log_tau = 2.5 * tan(L_log_tau_unif);
  vector<lower=0>[K] tau = exp(log_tau);
  matrix[K, J] beta = gamma * u + diag_pre_multiply(tau, L_Omega) * z;
}

model {
  vector[N] mu;
  array[J] vector[K] mu_tau;
  matrix[K, K] L_Sigma_log_tau;
  for(n in 1:N) {
    mu[n] = x[n, ] * beta[, jj[n]];
  }
  to_vector(z) ~ std_normal();
  L_Omega ~ lkj_corr_cholesky(2);
  to_vector(gamma) ~ normal(0, 1);
  for (j in 1:J) {
    mu_tau[j] = zeta * to_vector(u'[j,]);
  }
  L_Sigma_log_tau = diag_pre_multiply(L_log_tau, L_Omega_tau);
  to_vector(zeta) ~ normal(0, 1);
  L_Omega_tau ~ lkj_corr_cholesky(2);
  log_tau ~ multi_normal_cholesky(mu_tau, L_Sigma_log_tau);
  y ~ normal(mu, sigma);
}

Figured that the non-centered parameterization pasted below might help get rid of the divergences etc. However, the line:

vector[K] log_tau = mu_tau + diag_pre_multiply(L_log_tau, L_Omega_tau) * z_2;

produces an “Ill-typed arguments supplied to infix operator +” error, which makes sense, as mu_tau is an array[] vector, not a vector.

data {
  int<lower=0> N;                     // num individuals
  int<lower=1> K;                     // num ind predictors
  int<lower=1> J;                     // num groups
  int<lower=1> L;                     // num group predictors
  array[N] int<lower=1, upper=J> jj;  // group for individual
  matrix[N, K] x;                     // individual predictors
  matrix[L, J] u;                     // group predictors
  array[J] vector[L] s;               // group predictors
  vector[N] y;                        // outcomes
}

parameters {
  matrix[K, J] z;
  vector[K] z_2;
  cholesky_factor_corr[K] L_Omega;
  cholesky_factor_corr[K] L_Omega_tau;
  matrix[K, L] gamma;                        // group coeffs - location
  matrix[K, L] zeta;                         // group coeffs - scale
  real<lower=0> sigma;                       // prediction error scale
  vector<lower=0, upper=pi() / 2>[K] L_log_tau_unif;
}

transformed parameters {
  vector<lower=0>[K] L_log_tau = 2.5 * tan(L_log_tau_unif);
  array[J] vector[K] mu_tau;
  for (j in 1:J) {
    mu_tau[j] = zeta * s[j];
  }
  vector[K] log_tau = mu_tau + diag_pre_multiply(L_log_tau, L_Omega_tau) * z_2;
  matrix[K, J] beta = gamma * u + diag_pre_multiply(exp(log_tau), L_Omega) * z;
}

model {
  vector[N] mu;
  matrix[K, K] L_Sigma_log_tau;
  for(n in 1:N) {
    mu[n] = x[n, ] * beta[, jj[n]];
  }
  to_vector(z) ~ std_normal();
  z_2 ~ std_normal();
  L_Omega ~ lkj_corr_cholesky(2);
  L_Omega_tau ~ lkj_corr_cholesky(2);
  to_vector(gamma) ~ normal(0, 0.5);
  to_vector(zeta) ~ normal(0, 0.5);
  y ~ normal(mu, sigma);
}

Appreciate any ideas on how to implement this/get rid of the divergences and treedepth warnings.
radon_dat.RData (4.0 KB)
Data attached, also available here https://github.com/stan-dev/example-models/blob/master/ARM/Ch.17/radon_multi_varying_coef_17.2.data.R