I am estimating a multinomial logistic regression with random parameters. The model’s likelihood is unimodal and it has a 1-1 relationship with the parameters.
My goal is to estimate the same model with two different datasets. With one the model is fine, but with the other, some chains stabilize around an area that has a much worse likelihood:
likelihood per chain.pdf (45.9 KB)
What could prevent a chain from exploring better areas of the parameter space? My best guess is that it got stuck in a pathological region. The posterior is very close to -1 or 1 for some correlation elements and very close to zero for one variance (<lower=0>
).
Also, what’s the correct thing to do here? Should I pick better initial values? I’m not very eager to make the priors even more informative, and I assume that the right thing to do isn’t just “tossing” the worse chain.
Here’s the Stan code:
data {
int<lower=1> N; // n of observations (number of individuals times number of choice scenarios per individual)
int<lower=1> I; // n of individuals
int<lower=2> J; // n of alternatives
int<lower=0> K; // n of choice covariates, random
int<lower=0> Kf; // n of choice covariates, not random
array[N] int<lower=1, upper=J> choices; // choice made for each observation
array[N] matrix[J, K] X; // choice covariates for each choice situation (random)
array[N] matrix[J, Kf] Xf; // choice covariates for each choice situation (not random)
array[N] int<lower=1, upper=I> task_individual; // for each choice situation N, who was the corresponding individual?
}
parameters {
vector[J - 1] alpha_raw; // J-1 free alphas, one alpha is always = 0
vector[K] beta; // choice covariates: mean of random betas
vector[Kf] beta_f; // choice covariates: non-random betas
matrix[I, J] z; // individual specific random effects
vector<lower=0>[K] var_xi; // SD of the random effects
cholesky_factor_corr[K] L_Xi_corr; // the cholesky factor of the correlation matrix of betas
}
transformed parameters {
vector[J] alpha = append_row(0, alpha_raw);
matrix[K,K] L_Xi = diag_pre_multiply(var_xi, L_Xi_corr);
}
model {
matrix[J,J] util_cov;
matrix[J,J] L_utilcov;
vector[J] curr_utility;
alpha_raw ~ normal(0, 2); // prior for alpha_raw
beta ~ normal(0, 2); // prior for beta
beta_f ~ normal(0, 2); // prior for beta
to_vector(z) ~ std_normal();
var_xi ~ normal(0, 2);
L_Xi_corr ~ lkj_corr_cholesky(1);
for (i in 1:N) {
util_cov = tcrossprod(X[i] * L_Xi);
L_utilcov = cholesky_decompose(add_diag(util_cov, rep_vector(1e-6, J)));
curr_utility = alpha + X[i] * beta + Xf[i] * beta_f + L_utilcov * z[task_individual[i]]';
target += categorical_logit_lpmf(choices[i] | curr_utility);
}
}
The L_utilcov
part is just the reduced form of the random utility (I talk about utility because I study consumers’ discrete choices, but I know it has other names):
The data have the following dimensions: N \approx 14,000, I \approx 2,000, J = 3, K = 6, K^f = 4.
Thanks a lot for your help!