Comparing Wishart prior against LKJ prior

I am trying to fit a 3-dimensional multivariate normal model, assuming either Wishart or LKJ prior for covariance matrix, I chose standard choices of hyperparameters for those, which are nu = 4, and Sigma = diag(3) for Wishart and nu = 2 and taus follow independent half cauchy(0,2.5). It run fine with Wishart but always gives divergent transitions after warmup error for LKJ prior. Anyone with similiar experience? I tried to change the parameters for LKJ priors but still no luck of getting rid of the issue. What else should I try for LKJ prior, or under which circumstances will it perform worse than Wishart?

1 Like

Can you share the models that you’re using for comparison here? Generally we would expect the opposite results (fewer divergences under the LKJ)

data {
int<lower=0> n;
vector[3] y[n];
real<lower=0> se[1, 3, n];
vector<lower=-1,upper=1>[3] rw;
}
parameters {
vector[3] mu[n];
vector[3] beta;
corr_matrix[3] Omega;
vector<lower=0>[3] tau;

}
transformed parameters {
matrix[3,3] S[n];
for (j in 1:n){
S[j,1,1] = square(se[1,1,j]);
S[j,1,2] = rw[1] * se[1,1,j] * se[1,2,j];
S[j,1,3] = rw[2]*se[1,1,j]*se[1,3,j];
S[j,2,1] = S[j,1,2];
S[j,2,2] = square(se[1,2,j]);
S[j,2,3] = rw[3]*se[1,2,j]*se[1,3,j];
S[j,3,1] = S[j,1,3];
S[j,3,2] = S[j,2,3];
S[j,3,3] = square(se[1,3,j]);
}
}
model {
// Priors
for (m in 1:n){
mu[m] ~ multi_normal(beta, quad_form_diag(Omega, tau));
}
Omega ~ lkj_corr(2);
tau ~ cauchy(0, 2.5);

beta ~ normal(0,10);

// Data
for (i in 1:n){
y[i] ~ multi_normal(mu[i], S[i]);
}
}

For your LKJ question specifically, it’s recommended to use the cholesky_factor_corr parameter type with the lkj_corr_cholesky distribution for more efficient and stable sampling. This would then be passed to the multi_normal_cholesky instead of multi_normal.

But I do have some other suggestions for your model more generally. In your construction of the covariance matrix S, you’re only using inputs passed as data, no parameters. This should be moved to the transformed data block, so that it is only constructed once (at the start of sampling). By constructing this in the transformed parameters block, it is being re-created at every iteration.

Additionally, the multi_normal_cholesky likelihood will sample more efficiently (as it has analytic gradients), so you can create the covariance matrix and then take the cholesky factor to subsequently pass to multi_normal_cholesky:

transformed data {
  matrix[3,3] L_S[n];
  for (j in 1:n){
    matrix[3,3] S;

    S[j,1,1] = square(se[1,1,j]);
    S[j,1,2] = rw[1] * se[1,1,j] * se[1,2,j];
    S[j,1,3] = rw[2]*se[1,1,j]*se[1,3,j];
    S[j,2,1] = S[j,1,2];
    S[j,2,2] = square(se[1,2,j]);
    S[j,2,3] = rw[3]*se[1,2,j]*se[1,3,j];
    S[j,3,1] = S[j,1,3];
    S[j,3,2] = S[j,2,3];
    S[j,3,3] = square(se[1,3,j]);

    L_S[i] = cholesky_decompose(S);
  }
}

Additionally, a likely source of your divergences is the construction of mu. Hierarchical models like these can result in some tricky posterior geometry that the sampler can have trouble with. To avoid this, it’s recommended to use the non-centered parameterisation to construct the hierarchical parameters from a transformed standard-normal rather than sampling them directly.

Putting this all together, your model would look like:

data {
  int<lower=0> n;
  vector[3] y[n];
  real<lower=0> se[1, 3, n];
  vector<lower=-1,upper=1>[3] rw;
}

transformed data {
  matrix[3,3] L_S[n];
  for (j in 1:n){
    matrix[3,3] S;

    S[1,1] = square(se[1,1,j]);
    S[1,2] = rw[1] * se[1,1,j] * se[1,2,j];
    S[1,3] = rw[2]*se[1,1,j]*se[1,3,j];
    S[2,1] = S[1,2];
    S[2,2] = square(se[1,2,j]);
    S[2,3] = rw[3]*se[1,2,j]*se[1,3,j];
    S[3,1] = S[1,3];
    S[3,2] = S[2,3];
    S[3,3] = square(se[1,3,j]);

    L_S[j] = cholesky_decompose(S);
  }
}

parameters {
  vector[3] mu_raw[n];
  vector[3] beta;
  cholesky_factor_corr[3] L_Omega;
  vector<lower=0>[3] tau;
}

transformed parameters {
  vector[3] mu[n];
  matrix[3,3] mu_Cov = diag_pre_multiply(tau, L_Omega);

  for (m in 1:n){
    // Implies mu[m] ~ multi_normal_cholesky(beta, mu_Cov) 
    mu[m] = beta + mu_Cov * mu_raw[n];
  }
}

model {
  for(m in 1:n) {
    mu_raw[m] ~ std_normal(); 
  }
  // Priors
  L_Omega ~ lkj_corr_cholesky(2);
  tau ~ cauchy(0, 2.5);

  beta ~ normal(0,10);

  // Data
  for (i in 1:n){
    y[i] ~ multi_normal_cholesky(mu[i], L_S[i]);
  }
}
3 Likes

Thank you so very much for all the suggestions. I tried to run it as revised but was still having convergence issues, the divergent transitions warnings showed up even after I adjusted for adapt_delta and max_treedepth. I wonder if the problem is coming from the model strucutre/data itself, that is, the model doesn’t explain the inherent nature behind data? But still it won’t explain why wishart performs better than LKJ prior. Is it possible to run LKJ prior in BUGS or jags? I can’t seem to find an example in BGUS or jags.

The exact same model in BUGS or JAGS will suffer suffer the same pathology as Stan is warning you about now. The only difference is that BUGS and JAGS won’t/cannot tell you about it as the MCMC diagnostics are not as sensitive as those available for HMC in Stan. I would therefore recommend that you stick with the model as is and first try to simulate some data from it and then try to recover the parameter values by fitting the model to the faux data. Or try a simpler version of the model first (I haven’t looked at the details of the model to make specific recommendations).

Sorry this is very general advise, but usually applies everywhere and every time ;)