Lack of convergence in spatial model when group-level prior used / not when hard-coded

Hi everyone,

I’m having difficulty achieving convergence on my parameters of interest when I use group-level priors (but not when I hard code them).

The core of the model is a spatial model, where the outcome is a count variable which is modelled as a function of an individual-level intercept \alpha_i, a target-level intercept \gamma_j, and the distance between the two (latent) parameters of interest, \theta_{ig} and \zeta_j:

\text{Pr}(Y_{ijg} = y) \sim \text{NegBin}(\alpha_i + \gamma_j - ||\theta_{ig} - \zeta_j||^2, \omega_i).

The prior on \zeta_j is simply \text{N}(0, 1). However, because individuals i belong to groups g, I put separate group-specific priors on them: \theta_{ig} \sim \text{N}(\mu_g, \sigma_g). However, the \theta_{ig}, \mu_{g}, and \sigma_{g} don’t converge. The estimates get somewhat close to each other across multiple chains, but nevertheless end up in clearly distinct locations. On multiple runs, some chains might center \mu_{g} at, say, 0.77, and others at, say, 0.65, as if there are two modes. When I drop the group-level priors, and just make all \theta_{ig} \sim \text{N}(0, 1), the theta and zeta converge as expected. Any help would be greatly appreciated! The Stan code is the following:

data {

  int verbose;
  int N;
  int M;
  int G;
  int group[N];
  int Y[N, M];

}

parameters {

  // User-level intercept
  vector[N] alpha;
  real alpha_mu;
  real<lower = 0> alpha_sigma;

  // Target-level intercept
  vector[M] gamma;
  real<lower = 0> gamma_sigma;

  // User-level variance
  vector<lower = 0>[N] omega;
  real<lower = 0> omega_a;
  real<lower = 0> omega_b;

  // Latent spatial position of user
  vector[G] theta_mu;
  vector<lower = 0>[G] theta_sigma;
  vector[N] theta;

  // Latent spatial position of target
  vector[M] zeta;
  
}

transformed parameters {
}

model {
  
  // Linear predictor
  real C[N, M];

  // Priors
  alpha_mu ~ normal(0, 2.5);
  alpha_sigma ~ normal(0, 2.5);
  alpha ~ normal(alpha_mu, alpha_sigma);

  gamma_sigma ~ normal(0, 2.5);
  gamma ~ normal(0, gamma_sigma);
  
  omega ~ inv_gamma(omega_a, omega_b);

  theta_mu ~ normal(0, 1)
  theta_sigma ~ normal(0, 1)
  theta ~ normal(theta_mu[group], theta_sigma[group]);
  zeta ~ normal(0, 1);

  // Model
  for(i in 1:N)
    for(j in 1:M)
      C[i, j] = alpha[i] + gamma[j] - square(theta[i] - zeta[j]);

  for(i in 1:N) Y[i, ] ~ neg_binomial_2_log(C[i, ], omega[i]);

}

generated quantities {
}

An update on this. I was able to get this model working by using a non-centered parameterization for the parameters \alpha_i, \gamma_j, \theta_{ig}, and \zeta_j. The effective number of samples, n_eff, however, is relatively low and the model takes 4-5 times the amount of time to fit as the centered parameterization (around 5 hours versus 1 hour). If anyone has any suggestions on how I might speed this up or increase the number of effective samples, I’d really appreciate it!

Thanks!

1 Like

Maybe I’m messing something up, but maybe hard coding them makes sense? Maybe they are just aren’t identified well or something?

There are several identifiability problems here.

  1. Only one intercept. There are location priors for alpha and for gamma. These aren’t identified as they get added in the linear predictor. Instead, locate both of those at zero and add a general intercept term.

  2. theta and zeta aren’t identified, but the normal(0, 1) prior on zeta should identify the location.

  3. Dispersion terms are hard to identify in the simplest cases, but when they vary by item, they’re super tricky. You may need stronger priors on omega.

What we usually recommend is fixing the hyperpriors rather than making them parameters and then gradually introducing more model flexibility. Then you can see when things start going wrong when you add a predictor or a hierarchical intercept that breaks identifiability.

Hi Bob,

  1. The second intercept, \gamma_j, is a random effect centered at zero to avoid identification problems with \gamma_j and \alpha_i floating around. \gamma_j \sim \text{N}(0, \sigma_{\gamma}) versus \alpha_i \sim \text{N}(\mu_{\alpha}, \sigma_{\alpha}). So identification shouldn’t be a problem here. In any case, the n_eff on these parameters is good.

  2. \theta_i and \zeta_j are identified because \zeta_j is centered at zero. They “scale themselves” (with variance parameters estimated from the data) to the extent that I don’t have a coefficient on the spatial term, (e.g. I don’t do \lambda||\theta_i - \zeta_j||^2).

  3. The n_eff on the parameters \omega_i is actually really good. It’s just on \zeta_j and \theta_i. Overall, though, I have it working well enough that with 2,000 post-warmup draws, I get around n_eff = 200 on these parameters which is sufficient I think.

If I’m off on any of this though, please let me know!

Looks good. I missed that the \gamma were centered at zero. So the only problem is that it’s slow?

You might want to try increasing adapt_delta and spending more time warming up if things take a long time. Scaling the parameters so that their posteriors are roughly unit scale (e.g., normal(0, 1)) helps a lot with warmup and mixing.