Why would a chain stabilize in a parameter space with worse likelihood?

I am estimating a multinomial logistic regression with random parameters. The model’s likelihood is unimodal and it has a 1-1 relationship with the parameters.

My goal is to estimate the same model with two different datasets. With one the model is fine, but with the other, some chains stabilize around an area that has a much worse likelihood:

likelihood per chain.pdf (45.9 KB)

What could prevent a chain from exploring better areas of the parameter space? My best guess is that it got stuck in a pathological region. The posterior is very close to -1 or 1 for some correlation elements and very close to zero for one variance (<lower=0>).

Also, what’s the correct thing to do here? Should I pick better initial values? I’m not very eager to make the priors even more informative, and I assume that the right thing to do isn’t just “tossing” the worse chain.

Here’s the Stan code:

data {
  int<lower=1> N; // n of observations (number of individuals times number of choice scenarios per individual) 
  int<lower=1> I; // n of individuals 
  int<lower=2> J; // n of alternatives 
  int<lower=0> K; // n of choice covariates, random 
  int<lower=0> Kf; // n of choice covariates, not random 
  array[N] int<lower=1, upper=J> choices; // choice made for each observation 
  array[N] matrix[J, K] X; // choice covariates for each choice situation (random)
  array[N] matrix[J, Kf] Xf; // choice covariates for each choice situation (not random)
  array[N] int<lower=1, upper=I> task_individual; // for each choice situation N, who was the corresponding individual? 
parameters {
  vector[J - 1] alpha_raw; // J-1 free alphas, one alpha is always = 0 
  vector[K] beta; // choice covariates: mean of random betas
  vector[Kf] beta_f; // choice covariates: non-random betas
  matrix[I, J] z; // individual specific random effects 
  vector<lower=0>[K] var_xi; // SD of the random effects 
  cholesky_factor_corr[K] L_Xi_corr; // the cholesky factor of the correlation matrix of betas
transformed parameters {
  vector[J] alpha = append_row(0, alpha_raw);
  matrix[K,K] L_Xi = diag_pre_multiply(var_xi, L_Xi_corr);
model {
  matrix[J,J] util_cov;
  matrix[J,J] L_utilcov;
  vector[J] curr_utility;
  alpha_raw ~ normal(0, 2); // prior for alpha_raw 
  beta ~ normal(0, 2); // prior for beta 
  beta_f ~ normal(0, 2); // prior for beta 
  to_vector(z) ~ std_normal();
  var_xi ~ normal(0, 2);
  L_Xi_corr ~ lkj_corr_cholesky(1);
  for (i in 1:N) {
    util_cov = tcrossprod(X[i] * L_Xi);
    L_utilcov = cholesky_decompose(add_diag(util_cov, rep_vector(1e-6, J)));
    curr_utility = alpha + X[i] * beta + Xf[i] * beta_f + L_utilcov * z[task_individual[i]]';
    target += categorical_logit_lpmf(choices[i] | curr_utility);

The L_utilcov part is just the reduced form of the random utility (I talk about utility because I study consumers’ discrete choices, but I know it has other names):

\text{choices} \sim \text{categorical logit}(u_i) \\ u_{i} = \alpha + X_i \beta_i + X_i^f \beta^f \\ \beta_i = \bar{\beta} + \zeta_i, \quad \zeta_i \sim N(0, \Xi) \\ u_i = \alpha + X_i \bar{\beta} + X_i^f \beta^f + X_i \zeta_i \\ \therefore u_i \sim N(\alpha + X_i \bar{\beta} + X_i^f \beta^f, X_i \Xi X_i^t)

The data have the following dimensions: N \approx 14,000, I \approx 2,000, J = 3, K = 6, K^f = 4.

Thanks a lot for your help!

How can you be sure that is indeed the case? Unless you can derive it analytically, the shape of the likelihood/posterior is not known a priori (no pun intended) and it is through sampling that you will approximate it and gain some knowledge about its topology — it’s unlikely that a chain will converge to a region where there isn’t some sort of mode (although as far as I know the terminology uni/bi/multimodal is not really formal anyway).

I’m not sure what you mean with this, technically this would only be strictly true if the model has a single parameter.

Are both data sets similar in size and otherwise, or is there a reason why you would expect them to behave differently? You could check the space metric (“mass matrix”) and step size, and see if for some reason the chain that is off is not optimizing them properly.

Yes. Although that’s an overly broad answer to an ever broader question. “Pathological” is also a vague shorthand for all sorts considerations about the geometrical features of the posterior and the ability of different methods to move around it.

I’d say yes, you can try to constrain initial values or parameter ranges and see if that solves the problem, the important point is not to make initial values “too similar” and making sure the parameters aren’t “artificially” constrained (these are in quotes because they are vaguely defined too).

I can’t really go into the code here, but for multilogit models there are often identifiability issues and one category may be fixed as a reference, so something like this or similar could help in your specific case.

A not-uncommon problem that can occur for models with constrained parameters are likelihood functions, and hence possibly posterior distributions, that exhibit weak modes concentrating on the boundaries in addition to larger modes concentration in the interior. These weak modes are often negligible but if a Markov chain is initialized too close them then it can be captured by that mode for arbitrarily long amounts of time regardless of how small it is relative to the main mode(s). This is an especially likely scenario if you’re running multiple Markov chains and only some are being stuck on the boundaries.

If your prior model allocates only limited probability near the boundaries then an initialization more compatible with the prior model should stay away from the boundaries and avoid this capture. Then again if you do not have domain expertise that contrasts boundary behavior then you can’t ignore those weak modes so easily.

1 Like