# Comparing Wishart prior against LKJ prior

I am trying to fit a 3-dimensional multivariate normal model, assuming either Wishart or LKJ prior for covariance matrix, I chose standard choices of hyperparameters for those, which are nu = 4, and Sigma = diag(3) for Wishart and nu = 2 and taus follow independent half cauchy(0,2.5). It run fine with Wishart but always gives divergent transitions after warmup error for LKJ prior. Anyone with similiar experience? I tried to change the parameters for LKJ priors but still no luck of getting rid of the issue. What else should I try for LKJ prior, or under which circumstances will it perform worse than Wishart?

1 Like

Can you share the models that youâ€™re using for comparison here? Generally we would expect the opposite results (fewer divergences under the LKJ)

data {
int<lower=0> n;
vector[3] y[n];
real<lower=0> se[1, 3, n];
vector<lower=-1,upper=1>[3] rw;
}
parameters {
vector[3] mu[n];
vector[3] beta;
corr_matrix[3] Omega;
vector<lower=0>[3] tau;

}
transformed parameters {
matrix[3,3] S[n];
for (j in 1:n){
S[j,1,1] = square(se[1,1,j]);
S[j,1,2] = rw[1] * se[1,1,j] * se[1,2,j];
S[j,1,3] = rw[2]*se[1,1,j]*se[1,3,j];
S[j,2,1] = S[j,1,2];
S[j,2,2] = square(se[1,2,j]);
S[j,2,3] = rw[3]*se[1,2,j]*se[1,3,j];
S[j,3,1] = S[j,1,3];
S[j,3,2] = S[j,2,3];
S[j,3,3] = square(se[1,3,j]);
}
}
model {
// Priors
for (m in 1:n){
}
Omega ~ lkj_corr(2);
tau ~ cauchy(0, 2.5);

beta ~ normal(0,10);

// Data
for (i in 1:n){
y[i] ~ multi_normal(mu[i], S[i]);
}
}

For your LKJ question specifically, itâ€™s recommended to use the `cholesky_factor_corr` parameter type with the `lkj_corr_cholesky` distribution for more efficient and stable sampling. This would then be passed to the `multi_normal_cholesky` instead of `multi_normal`.

But I do have some other suggestions for your model more generally. In your construction of the covariance matrix `S`, youâ€™re only using inputs passed as data, no parameters. This should be moved to the `transformed data` block, so that it is only constructed once (at the start of sampling). By constructing this in the `transformed parameters` block, it is being re-created at every iteration.

Additionally, the `multi_normal_cholesky` likelihood will sample more efficiently (as it has analytic gradients), so you can create the covariance matrix and then take the cholesky factor to subsequently pass to `multi_normal_cholesky`:

``````transformed data {
matrix[3,3] L_S[n];
for (j in 1:n){
matrix[3,3] S;

S[j,1,1] = square(se[1,1,j]);
S[j,1,2] = rw[1] * se[1,1,j] * se[1,2,j];
S[j,1,3] = rw[2]*se[1,1,j]*se[1,3,j];
S[j,2,1] = S[j,1,2];
S[j,2,2] = square(se[1,2,j]);
S[j,2,3] = rw[3]*se[1,2,j]*se[1,3,j];
S[j,3,1] = S[j,1,3];
S[j,3,2] = S[j,2,3];
S[j,3,3] = square(se[1,3,j]);

L_S[i] = cholesky_decompose(S);
}
}
``````

Additionally, a likely source of your divergences is the construction of `mu`. Hierarchical models like these can result in some tricky posterior geometry that the sampler can have trouble with. To avoid this, itâ€™s recommended to use the non-centered parameterisation to construct the hierarchical parameters from a transformed standard-normal rather than sampling them directly.

Putting this all together, your model would look like:

``````data {
int<lower=0> n;
vector[3] y[n];
real<lower=0> se[1, 3, n];
vector<lower=-1,upper=1>[3] rw;
}

transformed data {
matrix[3,3] L_S[n];
for (j in 1:n){
matrix[3,3] S;

S[1,1] = square(se[1,1,j]);
S[1,2] = rw[1] * se[1,1,j] * se[1,2,j];
S[1,3] = rw[2]*se[1,1,j]*se[1,3,j];
S[2,1] = S[1,2];
S[2,2] = square(se[1,2,j]);
S[2,3] = rw[3]*se[1,2,j]*se[1,3,j];
S[3,1] = S[1,3];
S[3,2] = S[2,3];
S[3,3] = square(se[1,3,j]);

L_S[j] = cholesky_decompose(S);
}
}

parameters {
vector[3] mu_raw[n];
vector[3] beta;
cholesky_factor_corr[3] L_Omega;
vector<lower=0>[3] tau;
}

transformed parameters {
vector[3] mu[n];
matrix[3,3] mu_Cov = diag_pre_multiply(tau, L_Omega);

for (m in 1:n){
// Implies mu[m] ~ multi_normal_cholesky(beta, mu_Cov)
mu[m] = beta + mu_Cov * mu_raw[n];
}
}

model {
for(m in 1:n) {
mu_raw[m] ~ std_normal();
}
// Priors
L_Omega ~ lkj_corr_cholesky(2);
tau ~ cauchy(0, 2.5);

beta ~ normal(0,10);

// Data
for (i in 1:n){
y[i] ~ multi_normal_cholesky(mu[i], L_S[i]);
}
}
``````
3 Likes

Thank you so very much for all the suggestions. I tried to run it as revised but was still having convergence issues, the divergent transitions warnings showed up even after I adjusted for adapt_delta and max_treedepth. I wonder if the problem is coming from the model strucutre/data itself, that is, the model doesnâ€™t explain the inherent nature behind data? But still it wonâ€™t explain why wishart performs better than LKJ prior. Is it possible to run LKJ prior in BUGS or jags? I canâ€™t seem to find an example in BGUS or jags.

The exact same model in BUGS or JAGS will suffer suffer the same pathology as Stan is warning you about now. The only difference is that BUGS and JAGS wonâ€™t/cannot tell you about it as the MCMC diagnostics are not as sensitive as those available for HMC in Stan. I would therefore recommend that you stick with the model as is and first try to simulate some data from it and then try to recover the parameter values by fitting the model to the faux data. Or try a simpler version of the model first (I havenâ€™t looked at the details of the model to make specific recommendations).

Sorry this is very general advise, but usually applies everywhere and every time ;)