Hierarchical model stubborn after adding multivariate prior

As part of demographic modeling exercise, I’ve switched from independent normal priors on 8 varying coefficients to a 8-variate normal distribution prior. Since then, I get serious problems with divergences (some chains up to 50%), much hitting maximum treepdepth (set at 10), high auto-correlation (> 0.5 for a lag of 20) and low E-BFMI (between 0.1 and 0.15). Playing around with initial values, seeds, adapt_alpha, step sizes or changing from centered to non-centered parametrization didn’t help so far. I think it is interesting that in almost all cases I run the model at last one of the chains actually will converge and gives very plausible posteriors (and, as expected very close to the posteriors of the implementation with independent normal priors).

Since other than moving from independent to dependent priors on the varying coefficients, I also changes a bit of syntax, I am wondering if you can spot any immediate problems or inefficiencies in my code?

Some background:

  • The demographic model actually has 4 parameters (let’s call them “demographic parameters”), but these are fit separately by subgroup in each country/cluster. Thus we arrive at 8 parameters per cluster in my model.
  • There are 75 clusters/countries, and about 400 to 800 observations for each (~50,000 in total).

My personal best guess of the potential problem:

The positive correlations in the 8x8 correlation matrix I expect are quite extreme, especially between the same “demographic parameter” for differnet subgroups (these cells should have > 0.7). Then again, I expect negative correlation between the different “demographic parameters” of the same country and subgroup because they can partially compensate each other in the demographic process. So, I have a lot of prior knowledge on the covariance structure (going back to the independent priors therefore also seems undesirable) but cannot really put it into the LKJ prior. Apparently, the “non-informative” LKJ(1) cannot deal with this situation well?

data {
  int<lower=0> N; // Number of data points
  int<lower=1> K; // Number of clusters
  array[N] int<lower=0> x_chd; // Observed data
  array[N] int<lower=18, upper=49> x_age; // Vector of ages
  array[N] int<lower=0, upper=1> x_geq_h; // Dummies for subgroup
  array[N] int<lower=0, upper=1> x_geq_l;
  array[N] int<lower=1, upper=K> x_cluster; // Vector of clusters
}

transformed data{
  int max_age = max(x_age);
}

parameters {

  // country-specific parameters
  array[K] vector[8] alpha;
  // population average parameters
  vector[8] alpha_0;
  // Cholesky factor for the correlation matrix
  cholesky_factor_corr[8] L_Omega;
  // SD among countries of each parameter
  vector<lower=0>[8] L_std;

}

transformed parameters{
  
  array[K] real<lower=0> c1_h, c1_l, mu_h, mu_l, sigma1_h, sigma1_l, sigma2_h, sigma2_l;
  array[K, max_age] real<lower=0> f_h, f_l, lambda_h, lambda_l;
  
  // obtain the 8 demographic parameters from underlying components on log scale
  for(k in 1:K){
  
      c1_h[k] = exp(alpha[k][1] + alpha_0[1]);
      c1_l[k] = exp(alpha[k][2] + alpha_0[2]);
      mu_h[k] = exp(alpha[k][3] + alpha_0[3]);
      mu_l[k] = exp(alpha[k][4] + alpha_0[4]);
      sigma1_h[k] = exp(alpha[k][5] + alpha_0[5]);
      sigma1_l[k] = exp(alpha[k][6] + alpha_0[6]);
      sigma2_h[k] = exp(alpha[k][7] + alpha_0[7]);
      sigma2_l[k] = exp(alpha[k][8] + alpha_0[8]);
  
    // compute age-dependent means
    for(j in 1:max_age){
    
    f_h[k,j] = c1_h[k] * exp(-pow((j - mu_h[k]) / ((j <= mu_h[k]) * sigma1_h[k] + (j > mu_h[k]) * sigma2_h[k]), 2));
    f_l[k,j] = c1_l[k] * exp(-pow((j - mu_l[k]) / ((j <= mu_l[k]) * sigma1_l[k] + (j > mu_l[k]) * sigma2_l[k]), 2));
    
    }

  // sum up age-dependent means to obtain expectation parameter for the poisson likelihood
  lambda_h[k,] = cumulative_sum(f_h[k,]);
  lambda_l[k,] = cumulative_sum(f_l[k,]);
  }

}

model {

  // likelihood
  for (i in 1:N)
    x_chd[i] ~ poisson(lambda_h[x_cluster[i], x_age[i]] * x_geq_h[i] + lambda_l[x_cluster[i], x_age[i]] * x_geq_l[i]);
  
  // prior on population average paremeters on log-scale informed by prior literature
  alpha_0[1:2] ~ normal(log(0.1), 1);
  alpha_0[3:4] ~ normal(log(27.5), 1);
  alpha_0[5:8] ~ normal(log(7.5), 1);

  // Prior on Cholesky decomposition of correlation matrix
  L_Omega ~ lkj_corr_cholesky(1);
  // Prior on standard deviations for each parameter
  L_std ~ normal(0, 1);

  alpha ~ multi_normal_cholesky(rep_vector(0,8), diag_pre_multiply(L_std, L_Omega));

}

generated quantities {
  // correlation matrix
  matrix[8,8] Omega;
  Omega = multiply_lower_tri_self_transpose(L_Omega);
}

I happen to be comparing a bunch of correlation matrix parameterizations so I have this code which may help. In the LKJ paper they parameterize through a cvine and mention about putting different beta priors on large correlations.

In the code below the beta distribution is shifted and scaled to be over -1, 1. Add in more informative priors here. You can remove eta as that’s just to show equivalence with an LKJ.

functions{
  vector lb_ub_lp (vector y, real lb, real ub) {
    target += log(ub - lb) + log_inv_logit(y) + log1m_inv_logit(y);
    
    return lb + (ub - lb) * inv_logit(y);
  }
  
    real lb_ub_lp (real y, real lb, real ub) {
    target += log(ub - lb) + log_inv_logit(y) + log1m_inv_logit(y);
    
    return lb + (ub - lb) * inv_logit(y);
  }
  
  matrix corr_constrain_cvine_lp (vector col_one_raw, vector off_raw, real eta) {
    int K = num_elements(col_one_raw) + 1;
    vector[K - 1] z = lb_ub_lp(col_one_raw, -1, 1);
    matrix[K, K] C = identity_matrix(K);
    matrix[K, K] P;
    C[1, 1] = 1;
    C[2:K, 1] = z[1:K - 1];
    C[1, 2:K] = C[2:K, 1]';
    P = C;
    int cnt = 1;
    
    target += beta_lpdf(0.5 * (z + 1) | eta + 0.5 * (K - 1), eta + 0.5 * (K - 1));
    target += -0.5 * log1m(P[1, 2:K]^2);
    
    for (i in 2:K - 1) {
      for (j in i + 1:K) {
        P[i, j] = lb_ub_lp(off_raw[cnt], lb, ub);
        cnt += 1;
        real tem = P[i, j];
         target += -0.5 * i * log1m(P[i, j]^2);
         target += beta_lpdf(0.5 * (P[i, j] + 1) | eta + 0.5 * (K - 1), eta + 0.5 * (K - 1));
        int m = i;
        for (k in 1:i - 1) {
          m += -1;
          tem = P[m, i] * P[m, j] + tem * sqrt( (1 - P[m, i]^2) * (1 - P[m, j]^2) ); 
        }
        
        C[i, j] = tem;
        C[j, i] = C[i, j];
        }
      }
        return C;
  }
}
data {
  int<lower=2> K; // dimension of correlation matrix
  real<lower=0> eta;
}
transformed data {
  int k_choose_2 = (K * (K - 1)) %/% 2;
  int km1_choose_2 = ((K - 1) * (K - 2)) %/% 2;
}
parameters {
  vector[K - 1] col_one_raw;
  vector[km1_choose_2] off_raw;
}
transformed parameters {
  matrix[K, K] Omega = corr_constrain_cvine_lp(col_one_raw, off_raw, eta);
}
model {

}

This is very interesting, thank you. I’ll need to dive into this code. Stronger priors on the correlations could maybe indeed solve the issue.

Where in your code would you assign individual, cell-wise scaled beta priors?

1 Like

In the LKJ paper you’ll see that when \alpha and \beta parameters in the beta distribution are the same it leads to a nice representation of the LKJ distribution as proportional to the determinant of the correlation matrix. Note that the beta priors are on the partial correlation parameters, not the Pearson correlation. The novelty of the LKJ paper is to show the bijection between partial correlations and correlation coefficients.

In this code, you can modify the beta distribution parameters. Again note that this is on the partial correlation coefficients which will flow to the correlation coefficients but I don’t have a good intuition on what changes to \alpha and \beta to make to induce exactly a specific correlation.

// this line influences the correlation parameters in the first column
target += beta_lpdf(0.5 * (z + 1) | eta + 0.5 * (K - 1), eta + 0.5 * (K - 1));

...
// this line influences all the other correlations
 target += beta_lpdf(0.5 * (P[i, j] + 1) | eta + 0.5 * (K - 1), eta + 0.5 * (K - 1));
1 Like

Thanks! I will report on my experiences