I’m quite new to Stan and I’ve been struggling with a hierarchical regression lately. The model is basically a normal hierarchical regression with additionally

- two random intercepts for two higher hierarchy levels, assumed to be normally distributed around 0
- a third hierarchy level, also with spatially autocorrelated errors.

I generated data according to this specification (8000 obs. at the lowest level) with parameters in the allowed range and the X1/X2/X3/Z being matrices of independent variables uniformly distributed between -1 and 1. The spatial model requires computing a determinant which appeared to increase estimation time enormously for which reason I precomputed all possible values (based on the value of lambda) and the right one is chosen based on current lambda.

When I estimate the model, however, I get n_eff values around 2 or 3 (although there are 8000 obs.) and out of 4 chains, 3 converged to the wrong values for gamma and theta (only the violet one is correct):

```
data {
int<lower=1> P; // no. products = 8000
int<lower=1> N; // no. orders = 5000
int<lower=1> M; // no. clients = 2000
int<lower=1> J; // no. regions = 200
int<lower=1> K_p; //no. predictors product = 3
int<lower=1> K_n; //no. predictors orders = 3
int<lower=1> K_c; //no. predictors clients = 3
int<lower=1> K_r; //no. predictors regions = 3
matrix[P, K_p] X1; // product data
matrix[P, K_n] X2; // order data
matrix[P, K_c] X3; // client data
matrix[J, K_r] Z; // regional data
int grouping_regions[P];
int grouping_clients[P];
int grouping_orders[P];
matrix[J,J] W; // weighting matrix for local spatial influence
vector[P] y; // length = no. obs. on lowest level
matrix[2000,2] logdet_A_precomp; // matrix with all log(det(A)) values for lambda el. (-1,1), starting with -1
}
transformed data {
matrix[J,J] I_J;
I_J = diag_matrix(rep_vector(1.0,J));
}
parameters {
real intercept;
vector[K_p] beta1;
vector[K_n] beta2;
vector[K_c] beta3;
vector[K_r] gamma;
vector[K_r] theta;
real<lower=0> sigma1;
real<lower=0> sigma2;
real<lower=0> sigma3;
real<lower=0> tau;
real<lower=0, upper=1> lambda;
vector[J] alpha;
vector[M] alpha3;
vector[N] alpha2;
}
model {
vector[J] alpha_mu;
vector[P] y_mu;
matrix[J,J] A;
real logdet_A;
int index;
A = I_J - lambda*W;
alpha_mu = Z*gamma + W*Z*theta;
for (i in 1:P) {
y_mu[i] = intercept + X1[i]*beta1 + X2[i]*beta2 + X3[i]*beta3 + alpha[grouping_regions[i]] + alpha3[grouping_clients[i]] + alpha2[grouping_orders[i]];
}
index = 1;
logdet_A = logdet_A_precomp[index,2];
while (lambda > logdet_A_precomp[index,1]) {
index = index + 1;
}
logdet_A = logdet_A_precomp[index,2];
target += logdet_A + normal_lpdf(A*alpha | A*alpha_mu, tau);
alpha3 ~ normal(0, sigma3);
alpha2 ~ normal(0, sigma2);
y ~ normal(y_mu, sigma1);
}
```

Do you have any ideas on how to make the model converge better? These are the results for now:

```
n_eff Rhat mean mcse sd 2.5% 25% 50% 75% 97.5%
intercept 3 3.6 3.2 0.1 0.2 2.8 3.1 3.2 3.4 3.6
beta1[1] 4 2.1 1.1 0 0.1 1 1 1 1.1 1.2
beta1[2] 3 2.2 -2 0 0.1 -2.1 -2 -2 -2 -1.8
beta1[3] 2 2.9 -0.7 0 0.1 -0.8 -0.8 -0.7 -0.7 -0.6
beta2[1] 2 3.9 -0.5 0.1 0.1 -0.6 -0.5 -0.5 -0.4 -0.3
beta2[2] 2 3.6 3 0.1 0.1 2.8 3 3 3.1 3.1
beta2[3] 3 2.7 1 0 0.1 0.9 1 1 1.1 1.2
beta3[1] 2 5.9 -2.8 0.1 0.1 -3.1 -2.9 -2.8 -2.7 -2.7
beta3[2] 4 2.6 2 0 0.1 1.8 1.9 2 2 2.1
beta3[3] 3 2.1 5.9 0 0.1 5.8 5.9 6 6 6.1
gamma[1] 2 54.6 1 2.7 3.8 -4.1 -0.6 0.7 2.4 6.8
gamma[2] 2 61.5 -0.3 4.6 6.6 -5.2 -4.4 -3.6 0.4 11.1
gamma[3] 2 39.1 1.8 2.3 3.3 -3.1 0.2 1.9 3.6 6.5
theta[1] 2 38.7 1.6 2.6 3.8 -3.3 -0.5 1.1 3.4 7.3
theta[2] 2 27.8 0.8 2.4 3.5 -2.4 -1.7 -0.5 2 6.7
theta[3] 2 47.6 1.5 3.1 4.3 -4.1 -0.8 0.8 3 8.2
sigma1 2 42.3 0.9 0.4 0.5 0.2 0.5 1 1.4 1.6
sigma2 2 31.8 0.8 0.2 0.3 0.3 0.7 1 1.1 1.1
sigma3 2 15.6 0.8 0.2 0.2 0.5 0.7 0.9 1 1.1
tau 2 13.2 2.9 2 2.9 0.5 0.5 1.6 4.1 8.7
lambda 2 8.2 0.6 0.2 0.2 0.2 0.4 0.6 0.8 0.8
```

Furthermore, in the end, I would need to use a bernoulli/logit distribution for Y which I currently do not because convergence was way worse in that case. I it plausible that convergence for bernoulli_logit is harder to achieve than for normally distributed data?