Hierarchical spatial regression: very low n_eff, chains stuck to wrong values

I’m quite new to Stan and I’ve been struggling with a hierarchical regression lately. The model is basically a normal hierarchical regression with additionally

  • two random intercepts for two higher hierarchy levels, assumed to be normally distributed around 0
  • a third hierarchy level, also with spatially autocorrelated errors.

I generated data according to this specification (8000 obs. at the lowest level) with parameters in the allowed range and the X1/X2/X3/Z being matrices of independent variables uniformly distributed between -1 and 1. The spatial model requires computing a determinant which appeared to increase estimation time enormously for which reason I precomputed all possible values (based on the value of lambda) and the right one is chosen based on current lambda.

When I estimate the model, however, I get n_eff values around 2 or 3 (although there are 8000 obs.) and out of 4 chains, 3 converged to the wrong values for gamma and theta (only the violet one is correct):

data {
  int<lower=1> P; // no. products = 8000
  int<lower=1> N; // no. orders = 5000
  int<lower=1> M; // no. clients = 2000
  int<lower=1> J; // no. regions = 200
  int<lower=1> K_p; //no. predictors product = 3
  int<lower=1> K_n; //no. predictors orders = 3
  int<lower=1> K_c; //no. predictors clients = 3
  int<lower=1> K_r; //no. predictors regions = 3
  matrix[P, K_p] X1; // product data
  matrix[P, K_n] X2; // order data
  matrix[P, K_c] X3; // client data
  matrix[J, K_r] Z; // regional data
  int grouping_regions[P];
  int grouping_clients[P];
  int grouping_orders[P];
  matrix[J,J] W; // weighting matrix for local spatial influence
  vector[P] y; // length = no. obs. on lowest level
  
  matrix[2000,2] logdet_A_precomp; // matrix with all log(det(A)) values for lambda el. (-1,1), starting with -1
}

transformed data {
  matrix[J,J] I_J;
  I_J = diag_matrix(rep_vector(1.0,J));
}

parameters {
  real intercept;
  vector[K_p] beta1;
  vector[K_n] beta2;
  vector[K_c] beta3;
  vector[K_r] gamma;
  vector[K_r] theta;
  real<lower=0> sigma1;
  real<lower=0> sigma2;
  real<lower=0> sigma3;
  real<lower=0> tau;
  real<lower=0, upper=1> lambda;
  
  vector[J] alpha;
  vector[M] alpha3;
  vector[N] alpha2;
}

model {
  vector[J] alpha_mu;
  vector[P] y_mu;
  
  matrix[J,J] A;
  real logdet_A;
  int index;
  
  A = I_J - lambda*W;
  alpha_mu = Z*gamma + W*Z*theta;
  for (i in 1:P) {
    y_mu[i] = intercept + X1[i]*beta1 + X2[i]*beta2 + X3[i]*beta3 + alpha[grouping_regions[i]] + alpha3[grouping_clients[i]] + alpha2[grouping_orders[i]];
  }
  
  index = 1;
  logdet_A = logdet_A_precomp[index,2];
  while (lambda > logdet_A_precomp[index,1]) {
      index = index + 1;
  }
  logdet_A = logdet_A_precomp[index,2];
  
  target += logdet_A + normal_lpdf(A*alpha | A*alpha_mu, tau);
  alpha3 ~ normal(0, sigma3);
  alpha2 ~ normal(0, sigma2);
  y ~ normal(y_mu, sigma1);
}

Do you have any ideas on how to make the model converge better? These are the results for now:

    	n_eff	Rhat	mean	mcse	sd	2.5%	25%	50%	75%	97.5%
intercept	3	3.6	3.2	0.1	0.2	2.8	3.1	3.2	3.4	3.6
beta1[1]	4	2.1	1.1	0	0.1	1	1	1	1.1	1.2
beta1[2]	3	2.2	-2	0	0.1	-2.1	-2	-2	-2	-1.8
beta1[3]	2	2.9	-0.7	0	0.1	-0.8	-0.8	-0.7	-0.7	-0.6
beta2[1]	2	3.9	-0.5	0.1	0.1	-0.6	-0.5	-0.5	-0.4	-0.3
beta2[2]	2	3.6	3	0.1	0.1	2.8	3	3	3.1	3.1
beta2[3]	3	2.7	1	0	0.1	0.9	1	1	1.1	1.2
beta3[1]	2	5.9	-2.8	0.1	0.1	-3.1	-2.9	-2.8	-2.7	-2.7
beta3[2]	4	2.6	2	0	0.1	1.8	1.9	2	2	2.1
beta3[3]	3	2.1	5.9	0	0.1	5.8	5.9	6	6	6.1
gamma[1]	2	54.6	1	2.7	3.8	-4.1	-0.6	0.7	2.4	6.8
gamma[2]	2	61.5	-0.3	4.6	6.6	-5.2	-4.4	-3.6	0.4	11.1
gamma[3]	2	39.1	1.8	2.3	3.3	-3.1	0.2	1.9	3.6	6.5
theta[1]	2	38.7	1.6	2.6	3.8	-3.3	-0.5	1.1	3.4	7.3
theta[2]	2	27.8	0.8	2.4	3.5	-2.4	-1.7	-0.5	2	6.7
theta[3]	2	47.6	1.5	3.1	4.3	-4.1	-0.8	0.8	3	8.2
sigma1	2	42.3	0.9	0.4	0.5	0.2	0.5	1	1.4	1.6
sigma2	2	31.8	0.8	0.2	0.3	0.3	0.7	1	1.1	1.1
sigma3	2	15.6	0.8	0.2	0.2	0.5	0.7	0.9	1	1.1
tau	2	13.2	2.9	2	2.9	0.5	0.5	1.6	4.1	8.7
lambda	2	8.2	0.6	0.2	0.2	0.2	0.4	0.6	0.8	0.8

Furthermore, in the end, I would need to use a bernoulli/logit distribution for Y which I currently do not because convergence was way worse in that case. I it plausible that convergence for bernoulli_logit is harder to achieve than for normally distributed data?

I don’t really understand your model, but there are two plausible causes:

  1. You have an error in the implementation of your model. However such a large model is hard to debug. Instead of starting with the full model, it is often faster to start with a simple model, check that it works (recovers parameters from simulated data) and then add layers of complexity one at a time - in your case this might mean starting with only a single product, single client and single region. Once the model works, you allow multiple regions. Then multiple products, clients, the hierarchical levels etc…
  2. The model is actually multi modal (there are multiple local maxima in the posterior density) - this breaks sampling and you need to figure out how to get rid of the unwanted modes by either narrowing the priors or reformulating the model - there is no general solution, you have to think hard about your model and understand. The trivariete plot in ShinyStan (with log density as one axis) might show the multimodality quite well.

It may also make sense to use non-centered parametrization (see http://mc-stan.org/users/documentation/case-studies/divergences_and_bias.html and lots of posts on this site)

1 Like

Thank you for your advice! I actually simplified the model quite a bit and formulated it more in terms of the priors (i.e. u ~ normal(0, sigma)) instead of in terms of the resulting normal posterior (i.e. u_mu = Zgamma + WZtheta and target += log(det(A)) + normal_lpdf(Au | A*u_mu, tau)):

model {
  vector[J] alpha_mu;
  vector[P] y_mu;
  vector[J] u_corr;
  
  matrix[J,J] A;
  
  A = I_J - lambda*W;
  u_corr = A \ u;
  y_mu = intercept + X1*beta1 + X2*beta2 + X3*beta3 + Z*gamma + WZ*theta + u_corr[grouping_regions] + epsilon3[grouping_clients] + epsilon2[grouping_orders];
  
  y ~ normal(y_mu, sigma1);
  epsilon2 ~ normal(0, sigma2);
  epsilon3 ~ normal(0, sigma3);
  u ~ normal(0, tau);
}

Now the convergence is much, much better! However, for one or two parameters n_eff is still extremely low:

intercept  3.05    0.04 0.27  2.56  2.87  3.05  3.23  3.60    47 1.07
beta1[1]   1.00    0.00 0.01  0.98  0.99  1.00  1.01  1.02   400 1.00
beta1[2]  -2.02    0.00 0.01 -2.04 -2.03 -2.02 -2.01 -2.00   400 1.00
beta1[3]  -0.74    0.00 0.01 -0.76 -0.74 -0.74 -0.73 -0.72   400 0.99
beta2[1]  -0.49    0.00 0.02 -0.53 -0.50 -0.49 -0.47 -0.45   400 1.00
beta2[2]   3.00    0.00 0.02  2.96  2.99  3.00  3.01  3.04   400 0.99
beta2[3]   1.00    0.00 0.02  0.96  0.99  1.00  1.02  1.04   400 1.00
beta3[1]  -2.99    0.00 0.04 -3.07 -3.02 -2.99 -2.96 -2.90   289 1.01
beta3[2]   1.98    0.00 0.05  1.90  1.95  1.98  2.02  2.08   349 0.99
beta3[3]   6.00    0.00 0.05  5.90  5.97  5.99  6.03  6.08   277 1.00
sigma1     0.20    0.00 0.00  0.19  0.20  0.20  0.20  0.21   110 1.03
sigma2     0.30    0.00 0.01  0.29  0.30  0.30  0.30  0.31   188 1.00
sigma3     0.49    0.00 0.01  0.47  0.48  0.49  0.50  0.51   245 1.02
tau        0.52    0.02 0.04  0.45  0.49  0.51  0.54  0.60     6 1.29
gamma[1]   1.00    0.01 0.16  0.66  0.90  1.01  1.11  1.30   172 1.03
gamma[2]  -4.96    0.01 0.15 -5.28 -5.05 -4.94 -4.86 -4.67   196 1.03
gamma[3]   2.28    0.01 0.15  1.99  2.19  2.28  2.37  2.54   144 1.00
theta[1]   1.90    0.03 0.23  1.44  1.75  1.89  2.05  2.33    63 1.05
theta[2]  -1.06    0.03 0.22 -1.46 -1.21 -1.06 -0.91 -0.65    59 1.06
theta[3]   1.19    0.02 0.21  0.77  1.04  1.19  1.34  1.61   171 1.03
lambda     0.24    0.11 0.18  0.00  0.00  0.30  0.38  0.50     3 2.18

Especially lambda is worrisome. It is already restricted to be in (-1,1) in the parameters block. Would it help to put a particular prior on it additionally?

To help debug, it helps to see all of your model. Just seeing the model blocks leaves too much up to guesswork. The posted traceplot seems to indicate multimodal solutions. Is that what you expect here?

What you should see if you have mixing is that if you run for twice as many sampling iterations, your effective sample sizes should roughly double. If that effective ample size is 3, it definitely means problems and almost certainly isn’t mixing.

Yes, priors can help to remove multimodal solutions.