Trace plot got stuck for some chains

I am fitting a multi-level Weibull regression model. The following convergence problem has shown up quite a lot. Basically, some chains get converged but some are not and they just got stuck. I’d like to know what is the cause of that and how to fix this. Right now I’m just changing random seed and hopefull all chains get converged.

An example of problematic traceplot:
image

Stan code:

data {
  int<lower=0> N;
  int<lower=0> Nevent;
  int<lower=0> K;
  int<lower=0> L; //levels
  vector[N] y;
  matrix[N, K] X;   // predictor matrix
  int<lower=1,upper=L> ll[N];//level index
}
transformed data {
  real<lower=0> tau_al;
  vector[Nevent] yobs;
  vector[N-Nevent] ycen;
  matrix[N, K] Q_ast;
  matrix[K, K] R_ast;
  matrix[K, K] R_ast_inverse;
  matrix[Nevent, K] Xobs;
  matrix[N-Nevent, K] Xcen;
  tau_al = 10.0;
  yobs = y[1:Nevent];
  ycen = y[(Nevent+1):N];
  // thin and scale the QR decomposition
  Q_ast = qr_Q(X)[, 1:K] * sqrt(N - 1);
  R_ast = qr_R(X)[1:K, ] / sqrt(N - 1);
  R_ast_inverse = inverse(R_ast);
}
parameters {
  real mu_alpha;
  real<lower=0> tau_alpha;
  vector[L] alpha_raw;
  // slope
  real mu_theta[K];
  real<lower=0> tau_theta[K];
  vector[K] theta[L];
  // intercept
  real mu_a;
  real<lower=0> tau_a;
  real a[L];
}
transformed parameters {
  vector[L] alpha;
  alpha = exp(mu_alpha + tau_alpha * alpha_raw);
}
model {
  vector[N] linear_part;
  vector[N] shape_part;
  mu_theta ~ normal(0, 10);
  tau_theta ~ student_t(4, 0, 1);
  mu_a ~ normal(0, 100);
  tau_a ~ student_t(4, 0, 1);
  mu_alpha ~ normal(0, 100);
  tau_alpha ~ student_t(4, 0, 1);
  // priors
  a ~ normal(mu_a, tau_a);
  for (l in 1:L){
    theta[l] ~ normal(mu_theta, tau_theta);
  }
  for (n in 1:N){
    shape_part[n] = alpha[ll[n]];
    linear_part[n] = (a[ll[n]] + Q_ast[n, 1:K] * theta[ll[n]])/alpha[ll[n]];
  }
  yobs ~ weibull(shape_part[1:Nevent], exp(linear_part[1:Nevent]));
  target += weibull_lccdf(ycen | shape_part[(Nevent+1):N], exp(linear_part[(Nevent+1):N]));
  alpha_raw ~ normal(0.0, 1.0);
}
generated quantities {
  vector[K] theta_out[L];
  for (l in 1:L){
    theta_out[l] = R_ast_inverse * theta[l]; // coefficients on x
  }
}

Initial methods:

def get_inits():
    inits = {
        'mu_alpha': 0.01*np.random.normal(),
        'tau_alpha': np.random.uniform(0.01, 0.05),
        'alpha_raw': 0.01*np.random.normal(size=L),
        'mu_theta': 0.1*np.random.normal(size=K),
        'tau_theta': np.random.uniform(0.05, 0.2, K),
        'theta': 0.1*np.random.normal(size=(L, K)),
        'mu_a': 0.1*np.random.normal(loc=3.5),
        'tau_a': np.random.uniform(0.2, 0.4),
        'a': 0.1*np.random.normal(size=L)
    }
    return inits

Calling methods:

n_chain = 2
Nsamples = 4000
nuts_control = {
    'adapt_delta' : 0.9,
    'max_treedepth': 12
}
fit = sm.sampling(data=stan_data, iter=Nsamples, chains=n_chain, init=get_inits, n_jobs=n_chain,
                  control=nuts_control, seed=1)

I’m just guessing but any chance:

alpha = exp(mu_alpha + tau_alpha * alpha_raw);

Are blowing up to infinity? Any time a model has explicit exps it’s pretty easy to get numerical problems which could cause a weirdness like this.

I will try to make the priors more concentrate so to prevent infinity. On the other hand, I tried increase adapt_delta to 0.98 and it helps, but took me a really long time to get the results with warnings:

WARNING:pystan:35 of 4000 iterations ended with a divergence (0.875 %).
WARNING:pystan:Try running with adapt_delta larger than 0.98 to remove the divergences.
WARNING:pystan:101 of 4000 iterations saturated the maximum tree depth of 13 (2.52 %)
WARNING:pystan:Run again with max_treedepth larger than 13 to avoid saturation
WARNING:pystan:Chain 1: E-BFMI = 0.148
WARNING:pystan:Chain 2: E-BFMI = 0.174
WARNING:pystan:E-BFMI below 0.2 indicates you may need to reparameterize your model

The mixing is much better:
image

Not a bad idea. Even if you can specifically identify what’s blowing up, this is something to try. But check plots if you can.

Check for correlated variables. Identifiability problems are one way to get lots of treedepth things like this.

Yeah, it definitely looks better, but once you move away from having clearly stalled out chains it’s best to lean on Neff/Rhat and the other diagnostics to figure out what the chains are doing.