Dear Stan community,
I have trouble fitting a mixture Generalized Partial Credit Model (belonging to the IRT class of models). This is going to be a rather long post, so I added sections.
1. The model
Let I, K, J\in\mathbb{N}, which can be thought of as number of subjects, items and thresholds per item, respectively. Furthermore, let C\in\mathbb{N} be the number of latent classes (mixture components) which partition the subjects exhaustively.
In every class c, each subject i is assigned an ability parameter \theta_{ic} and each item k is assigned a discrimination parameter \alpha_{kc} and j threshold parameters \beta_{jkc}. Furthermore, each class occurs with probability \lambda_c.
Now let y_{i}\in\{0,...,J\}^K be the response vector of subject i. Marginalizing out class membership, the likelihood for each response vector is then given by
p(y_i\vert\theta_i,\beta,\lambda)=\sum_{c=1}^{C}\prod_{k=1}^{K}\lambda_c\frac{\exp\left(\sum_{j=1}^{y_{ik}}\alpha_{kc}(\theta_{ic}\beta_{jkc})\right)}{\sum_{l=0}^{J}\exp\left(\sum_{j=1}^{l}\alpha_{kc}(\theta_{ic}\beta_{jkc})\right)}.
For identification issues (in the classical sense), we need to fix the location and scale of the parameters in each class (however, I can neither proof nor disproof that the mixture model is identified), which in practice is most often done by either constraining the \beta parameters to sum to zero and setting \alpha_{1c}=1 in each class or assuming a standard normal distribution for the ability parameters \theta_{ic}.
2. Data
I generated two datasets, both containing 15 items with 5 categories (thus 4 thresholds) and either 500 or 1000 subjects, split into two classes with sizes 60%/40%. The ability parameters are normally distributed and with the standardization of \alpha and \beta described above, they have mean \pm1.25 and standard deviation 0.75 and are therefore well separated (i.e. the distribution is nicely bimodal).
If I set C=1 and fit the classes separately, the parameters are well recovered, fitting the mixture model is a lot harder though. I have read a lot of the posts and papers relevant to my problem, including Michael Betancourt’s many highly informative blog posts.
3. Parameterizations
I tried nearly 30 different reparameterizations so far which led to a lot of different pathologies. What seems to be a common phenomenon is that when the step size is too low the chain will wiggle around in a very tiny interval around some value of \lambda (see fig1.pdf (134.0 KB) fig2.pdf (132.5 KB) fig3.pdf (123.4 KB) attached).
I set adapt_delta
to 0.99 and max_treedepth
to 15 (here I went from starting out with stan > following the advice to increase adapt_delta given in the output to reduce divergences > learning more and finding out that this is a last resort and setting adapt_delta back to 0.8 > nothing else works so setting it back to 0.99 as a last resort).
The data block I use is always the same:
data {
// Counts
int<lower=1> I; // No. persons
int<lower=1> K; // No. items
int<lower=1> J; // No. parameters per item
int<lower=1> C; // No. classes
// Data
int<lower=1, upper=J+1> y[I, K]; // Data as persons by items Matrix
}
Here are three parameterizations that have worked the best so far:
#7:
parameters {
// Person Parameters
vector[I] theta[C];
// Item Parameters
// Level 1
vector<lower=0>[K] alpha[C];
vector[J] alpha_times_beta_raw[C, K];
// Level 2
vector[K] alpha_times_mu_beta[C];
// Mixture Parameters
vector[2 * 16] lambda_raw[C];
}
transformed parameters {
// Mixture Parameters
simplex[C] lambda;
// Item Parameters
vector[J] alpha_times_beta[C, K];
ordered[C] mu_beta;
for (c in 1:C) {
lambda[c] = dot_self(lambda_raw[c]); // Sumofsquares
mu_beta[c] = sum(alpha_times_mu_beta[c]) / sum(alpha[c]); // Weighted average
for (k in 1:K) {
alpha_times_beta[c, k] = alpha_times_mu_beta[c, k] + alpha_times_beta_raw[c, k];
}
}
lambda = lambda / sum(lambda);
}
model {
// Priors ///////////////////////////////////////////
for (c in 1:C) {
// Person Parameters
target += std_normal_lpdf(theta[c]);
// Item Parameters
target += lognormal_lpdf(alpha[c]  0, 0.5); // Implies median(alpha) = 1
for (k in 1:K) {
target += normal_lpdf(alpha_times_beta_raw[c, k]  0, alpha[c, k]); // Implies beta_raw ~ N(0, 1)
target += normal_lpdf(alpha_times_mu_beta[c, k]  0, alpha[c, k]); // Implies mu_beta ~ N(0, 1)
}
// Mixture Parameters
target += std_normal_lpdf(lambda_raw[c]);
}
// Likelihood ///////////////////////////////////////////
for (i in 1:I) {
vector[C] ll_by_pers = rep_vector(0, C);
for (c in 1:C) {
for (k in 1:K) {
ll_by_pers[c] += categorical_logit_lpmf(
y[i, k] 
append_row(0, cumulative_sum(alpha[c, k] * theta[c, i]  alpha_times_beta[c, k]))
);
}
}
target += log_sum_exp(log(lambda) + ll_by_pers);
}
}
The class size parameter \lambda is reparameterized following this (https://mcstan.org/docs/2_18/stanusersguide/reparameterizationsection.html) section from the Stan manual. Furthermore, the parameterization of \alpha\beta as one parameter follows Michael Betancourts post on identifiability (https://mcstan.org/docs/2_18/stanusersguide/reparameterizationsection.html) leading to better estimation results as I already enquired about in an earlier post (Convergence depending on parameterization of discrimination parameter in the GPCM).
Since the standard normal prior on \theta determines the scale, I set priors on \beta and \alpha to be on that same scale. In all fairness, the narrow priors, especially on the \alpha parameters are necessary to ensure convergence. Also I found that adding any kind of hierarchy (e.g. estimating mean and/or variance of the \alpha parameters) induces instability and complicates convergence. This is why the parameter ensuring the avoidance of label switching, mu_beta
is a transformed parameter and not actually estimated.
fig4_par7.pdf (272.9 KB) shows the traceplot of the \lambda parameter and the traceplot of the step size (truncated after 1000 iterations since step size doesn’t change after warmup). Even is this is the best I have got so far (and there are no divergent transitions after warmup) on the small dataset, the ESSs and PSRFs are still far from perfect (for 82000 warmup and 82000 sampling iterations):
mean se_mean sd 2.5% 50% 97.5% n_eff Rhat
alpha[2,6] 1.01 0.02 0.28 0.55 0.98 1.63 199 1.04
theta[1,432] 0.86 0.06 1.21 2.38 1.34 1.72 404 1.03
theta[2,40] 0.93 0.07 1.32 1.70 1.16 2.65 360 1.03
theta[2,155] 0.96 0.06 1.02 1.52 1.37 2.12 301 1.03
alpha[2,3] 0.68 0.01 0.17 0.40 0.66 1.06 267 1.03
alpha[2,7] 0.86 0.01 0.22 0.52 0.83 1.38 285 1.03
alpha[2,13] 1.95 0.04 0.64 0.95 1.88 3.44 266 1.03
alpha_times_beta_raw[2,10,4] 0.68 0.05 1.00 1.19 0.63 2.81 386 1.03
lambda[1] 0.63 0.00 0.04 0.55 0.63 0.71 231 1.03
lambda[2] 0.37 0.00 0.04 0.29 0.37 0.45 231 1.03
The other two parameterizations are:
#5:
parameters {
// Person Parameters
vector[I] theta[C];
// Item Parameters
// Level 1
vector<lower=0>[K] alpha[C];
vector[J] alpha_times_beta_raw[C, K];
// Level 2
ordered[C] alpha_1_times_mu_beta;
// Mixture Parameters
vector[2 * lambda_prior] lambda_raw[C];
}
transformed parameters {
// Mixture Parameters
simplex[C] lambda;
// Item Parameters
vector[J] alpha_times_beta[C, K];
vector[K] alpha_times_mu_beta[C];
for (c in 1:C) {
lambda[c] = dot_self(lambda_raw[c]); // sumofsquares
alpha_times_mu_beta[c] = append_row(alpha_1_times_mu_beta[c], alpha_1_times_mu_beta[c] / alpha[c, 1] * alpha[c, 2:K]);
for (k in 1:K) {
alpha_times_beta[c, k] = alpha_times_mu_beta[c, k] + alpha_times_beta_raw[c, k];
}
}
lambda = lambda / sum(lambda);
}
model {
// Priors ///////////////////////////////////////////
for (c in 1:C) {
// Person Parameters
target += std_normal_lpdf(theta[c]);
// Item Parameters
target += lognormal_lpdf(alpha[c]  alpha_prior_mu, alpha_prior_sigma);
target += normal_lpdf(alpha_1_times_mu_beta[c]  0, alpha[c, 1]);
for (k in 1:K) {
target += normal_lpdf(alpha_times_beta_raw[c, k]  0, alpha[c, k]);
}
// Mixture Parameters
target += std_normal_lpdf(lambda_raw[c]);
}
// Likelihood ///////////////////////////////////////////
for (i in 1:I) {
vector[C] ll_by_pers = rep_vector(0, C);
for (c in 1:C) {
for (k in 1:K) {
ll_by_pers[c] += categorical_logit_lpmf(
y[i, k] 
append_row(0, cumulative_sum(alpha[c, k] * theta[c, i]  alpha_times_beta[c, k]))
);
}
}
target += log_sum_exp(log(lambda) + ll_by_pers);
}
}
Traceplot in fig5_par5.pdf (270.7 KB) and with posterior summary:
mean se_mean sd 2.5% 50% 97.5% n_eff Rhat
lambda[1] 0.66 0.00 0.05 0.56 0.67 0.74 96 1.07
lambda[2] 0.34 0.00 0.05 0.26 0.33 0.44 96 1.07
alpha_times_beta[1,10,4] 0.39 0.02 0.24 0.89 0.39 0.05 175 1.04
alpha_times_beta[1,14,2] 1.19 0.03 0.39 2.01 1.16 0.52 167 1.04
alpha[1,8] 1.15 0.01 0.18 0.81 1.15 1.50 264 1.03
alpha[2,6] 0.79 0.02 0.24 0.43 0.75 1.38 211 1.03
alpha_1_times_mu_beta[1] 0.94 0.01 0.20 1.39 0.92 0.59 274 1.03
alpha_times_beta[1,4,4] 1.26 0.01 0.21 1.71 1.26 0.87 234 1.03
alpha_times_beta[1,6,4] 0.76 0.01 0.21 1.19 0.75 0.38 191 1.03
alpha_times_beta[1,7,3] 1.42 0.02 0.36 2.22 1.39 0.79 213 1.03
and #22:
parameters {
// Person Parameters
vector[I] theta[C];
// Item Parameters
// Level 1
vector<lower=0>[K] alpha_times_sigma[C];
vector[J] alpha_times_beta_free[C, K  1];
vector[J  1] alpha_times_beta_free_K[C];
// Level 2
vector[K] alpha_times_mu_theta[C];
// Mixture Parameters
vector[2 * lambda_prior] lambda_raw[C];
}
transformed parameters {
// Person Parameters
ordered[C] mu_theta;
// Item Parameters
vector[J] alpha_times_beta[C, K];
// Mixture Parameters
simplex[C] lambda;
for (c in 1:C) {
real alpha_times_beta_sum = 0;
for (k in 1:(K  1)) {
alpha_times_beta_sum += sum(alpha_times_beta_free[c, k]);
alpha_times_beta[c, k] = alpha_times_beta_free[c, k];
}
alpha_times_beta_sum += sum(alpha_times_beta_free_K[c]);
alpha_times_beta[c, K] = append_row(alpha_times_beta_free_K[c], alpha_times_beta_sum);
mu_theta[c] = sum(alpha_times_mu_theta[c])/sum(alpha_times_sigma[c]); // Weighted average
lambda[c] = dot_self(lambda_raw[c]); // Sumofsquares
}
lambda = lambda / sum(lambda);
}
model {
// Priors ///////////////////////////////////////////
for (c in 1:C) {
// Person Parameters
target += std_normal_lpdf(theta[c]);
// Item Parameters
target += lognormal_lpdf(alpha_times_sigma[c]  alpha_prior_mu, alpha_prior_sigma);
for (k in 1:K) {
target += normal_lpdf(alpha_times_beta[c, k]  0, alpha_times_sigma[c, k]);
target += normal_lpdf(alpha_times_mu_theta[c, k]  0, alpha_times_sigma[c, k]);
}
// Mixture Parameters
target += std_normal_lpdf(lambda_raw[c]);
}
// Likelihood ///////////////////////////////////////////
for (i in 1:I) {
vector[C] ll_by_pers = rep_vector(0, C);
for (c in 1:C) {
for (k in 1:K) {
ll_by_pers[c] += categorical_logit_lpmf(
y[i, k] 
append_row(0, cumulative_sum(alpha_times_mu_theta[c, k] + alpha_times_sigma[c, k] * theta[c, i]  alpha_times_beta[c, k]))
);
}
}
target += log_sum_exp(log(lambda) + ll_by_pers);
}
}
This is basically #7, just that I reduce the dimension by 1 by introducing a sum to zero constrain on the \beta parameters and therefore shift the intercept to the ability parameters (alpha_times_mu_beta
is renamed as alpha_times_mu_theta
), now subtractively and not additively. I am well aware that I have one prior “too much”, since the very last entry of alpha_times_beta
is a transformed parameter, but frustratingly and paradoxically, also in the other parameterizations I tried, the “naive” version always works better than the mathematically correct one. In another model I parameterized \alpha_k=\alpha_1\delta_k and put priors on the transformed parameters \alpha_k, but the estimation became a lot more unstable when adding the log det Jacobians of the transform. Anyway, trace plots in fig6_par22.pdf (139.1 KB) and posterior summary here:
mean se_mean sd 2.5% 50% 97.5% n_eff Rhat
alpha_times_sigma[1,13] 2.02 0.09 0.71 0.95 1.92 3.67 59 1.08
theta[1,101] 0.93 0.13 1.18 1.66 1.46 2.33 77 1.07
alpha_times_sigma[1,3] 0.70 0.02 0.17 0.42 0.68 1.10 59 1.07
alpha_times_sigma[1,5] 0.92 0.03 0.23 0.55 0.89 1.45 63 1.07
alpha_times_sigma[1,6] 1.06 0.04 0.30 0.60 1.01 1.77 69 1.07
alpha_times_beta_free[1,10,3] 0.69 0.10 0.91 0.99 0.65 2.61 81 1.07
alpha_times_beta[1,10,3] 0.69 0.10 0.91 0.99 0.65 2.61 81 1.07
alpha_times_sigma[2,11] 1.31 0.02 0.22 0.94 1.29 1.79 104 1.06
alpha_times_beta_free[1,10,4] 0.40 0.10 0.99 1.50 0.38 2.40 93 1.06
alpha_times_beta[1,10,4] 0.40 0.10 0.99 1.50 0.38 2.40 93 1.06
So it seems like reducing the dimension is actually hurting the estimation and not facilitating it.
4. More data
To my disappointment, however, when I fit those models to the bigger dataframe (1000 subjects), the estimations become even more unstable and we have chains where a group collapses (chain 2, 4 in par. 7, fig7_par7_M.pdf (476.3 KB)). The estimations for 5 and 22 have not finished, but in fig8_par22_M.pdf (476.7 KB) we see that even though, in par. 22, no chain collapses (chain 6 is not yet finished), the chains look worse than on the smaller data set.
5. Questions
So of course my main question is:
 Can someone come up with another parameterization or prior specification (I guess the main candidate here would be \alpha) to improve convergence?
Furthermore, I would gladly like some insight on:

What happens in Fig. 13 when the stepsize is almost zero and why can the chain sometimes wind itself out of this basically constant state?

Where do the little “detours” of chain 1 in par. 5 (fig. 5) and of chain 2 in par. 22 (fig. 6) come from and how to prevent that?

Why do things behave (in my mind) unintuitively  i.e. reducing the dimension worsens the fit, more date worsens the fit etc.? I have a basic understanding of the examples in the relevant blog posts, but those mostly deal with very simple models, so it is not trivial to transpose the intuitions given there to my kind of model (where every new line of data also introduces a new parameter).
I would be very glad for any advice, input or questions, since I am very much at the end of my wits.
Thank you!