Insufficient convergence in mixture Generalized Partial Credit Model

Dear Stan community,

I have trouble fitting a mixture Generalized Partial Credit Model (belonging to the IRT class of models). This is going to be a rather long post, so I added sections.

1. The model

Let I, K, J\in\mathbb{N}, which can be thought of as number of subjects, items and thresholds per item, respectively. Furthermore, let C\in\mathbb{N} be the number of latent classes (mixture components) which partition the subjects exhaustively.

In every class c, each subject i is assigned an ability parameter \theta_{ic} and each item k is assigned a discrimination parameter \alpha_{kc} and j threshold parameters \beta_{jkc}. Furthermore, each class occurs with probability \lambda_c.

Now let y_{i}\in\{0,...,J\}^K be the response vector of subject i. Marginalizing out class membership, the likelihood for each response vector is then given by
p(y_i\vert\theta_i,\beta,\lambda)=\sum_{c=1}^{C}\prod_{k=1}^{K}\lambda_c\frac{\exp\left(\sum_{j=1}^{y_{ik}}\alpha_{kc}(\theta_{ic}-\beta_{jkc})\right)}{\sum_{l=0}^{J}\exp\left(\sum_{j=1}^{l}\alpha_{kc}(\theta_{ic}-\beta_{jkc})\right)}.

For identification issues (in the classical sense), we need to fix the location and scale of the parameters in each class (however, I can neither proof nor disproof that the mixture model is identified), which in practice is most often done by either constraining the \beta parameters to sum to zero and setting \alpha_{1c}=1 in each class or assuming a standard normal distribution for the ability parameters \theta_{ic}.

2. Data

I generated two datasets, both containing 15 items with 5 categories (thus 4 thresholds) and either 500 or 1000 subjects, split into two classes with sizes 60%/40%. The ability parameters are normally distributed and with the standardization of \alpha and \beta described above, they have mean \pm1.25 and standard deviation 0.75 and are therefore well separated (i.e. the distribution is nicely bimodal).

If I set C=1 and fit the classes separately, the parameters are well recovered, fitting the mixture model is a lot harder though. I have read a lot of the posts and papers relevant to my problem, including Michael Betancourt’s many highly informative blog posts.

3. Parameterizations

I tried nearly 30 different reparameterizations so far which led to a lot of different pathologies. What seems to be a common phenomenon is that when the step size is too low the chain will wiggle around in a very tiny interval around some value of \lambda (see fig1.pdf (134.0 KB) fig2.pdf (132.5 KB) fig3.pdf (123.4 KB) attached).

I set adapt_delta to 0.99 and max_treedepth to 15 (here I went from starting out with stan -> following the advice to increase adapt_delta given in the output to reduce divergences -> learning more and finding out that this is a last resort and setting adapt_delta back to 0.8 -> nothing else works so setting it back to 0.99 as a last resort).

The data block I use is always the same:

data {
  // Counts
  int<lower=1> I; // No. persons
  int<lower=1> K; // No. items
  int<lower=1> J; // No. parameters per item
  int<lower=1> C; // No. classes
  
  // Data
  int<lower=1, upper=J+1> y[I, K];  // Data as persons by items Matrix
}

Here are three parameterizations that have worked the best so far:
#7:

parameters {
  // Person Parameters
  vector[I] theta[C];
  
  // Item Parameters
  // Level 1
  vector<lower=0>[K] alpha[C];
  vector[J] alpha_times_beta_raw[C, K];
  
  // Level 2
  vector[K] alpha_times_mu_beta[C];
  
  // Mixture Parameters
  vector[2 * 16] lambda_raw[C];
}

transformed parameters {
  // Mixture Parameters
  simplex[C] lambda;
  
  // Item Parameters
  vector[J] alpha_times_beta[C, K];
  ordered[C] mu_beta;
  
  for (c in 1:C) {
    lambda[c] = dot_self(lambda_raw[c]); // Sum-of-squares
    
    mu_beta[c] = sum(alpha_times_mu_beta[c]) / sum(alpha[c]); // Weighted average
    
    for (k in 1:K) {
      alpha_times_beta[c, k] = alpha_times_mu_beta[c, k] + alpha_times_beta_raw[c, k];
    }
  }
  
  lambda = lambda / sum(lambda);
}

model {
  // Priors ///////////////////////////////////////////
  for (c in 1:C) {
    // Person Parameters
    target += std_normal_lpdf(theta[c]);
  
    // Item Parameters
    target += lognormal_lpdf(alpha[c] | 0, 0.5); // Implies median(alpha) = 1
    
    for (k in 1:K) {
      target += normal_lpdf(alpha_times_beta_raw[c, k] | 0, alpha[c, k]); // Implies beta_raw ~ N(0, 1)
      target += normal_lpdf(alpha_times_mu_beta[c, k] | 0, alpha[c, k]); // Implies mu_beta ~ N(0, 1)
    }
    
    // Mixture Parameters
    target += std_normal_lpdf(lambda_raw[c]);
  }
  
  // Likelihood ///////////////////////////////////////////
  for (i in 1:I) {
    vector[C] ll_by_pers = rep_vector(0, C);
    
    for (c in 1:C) {
      for (k in 1:K) {
        ll_by_pers[c] += categorical_logit_lpmf(
          y[i, k] |
          append_row(0, cumulative_sum(alpha[c, k] * theta[c, i] - alpha_times_beta[c, k]))
        );
      }
    }
    
    target += log_sum_exp(log(lambda) + ll_by_pers);
  }
}

The class size parameter \lambda is reparameterized following this (https://mc-stan.org/docs/2_18/stan-users-guide/reparameterization-section.html) section from the Stan manual. Furthermore, the parameterization of \alpha\beta as one parameter follows Michael Betancourts post on identifiability (https://mc-stan.org/docs/2_18/stan-users-guide/reparameterization-section.html) leading to better estimation results as I already enquired about in an earlier post (Convergence depending on parameterization of discrimination parameter in the GPCM).

Since the standard normal prior on \theta determines the scale, I set priors on \beta and \alpha to be on that same scale. In all fairness, the narrow priors, especially on the \alpha parameters are necessary to ensure convergence. Also I found that adding any kind of hierarchy (e.g. estimating mean and/or variance of the \alpha parameters) induces instability and complicates convergence. This is why the parameter ensuring the avoidance of label switching, mu_beta is a transformed parameter and not actually estimated.

fig4_par7.pdf (272.9 KB) shows the traceplot of the \lambda parameter and the traceplot of the step size (truncated after 1000 iterations since step size doesn’t change after warmup). Even is this is the best I have got so far (and there are no divergent transitions after warmup) on the small dataset, the ESSs and PSRFs are still far from perfect (for 82000 warmup and 82000 sampling iterations):

  	                          mean	se_mean	sd	    2.5%	50%	   97.5%  n_eff Rhat
alpha[2,6]	                  1.01	0.02	0.28	0.55	0.98	1.63	199	1.04
theta[1,432]	              -0.86	0.06	1.21	-2.38	-1.34	1.72	404	1.03
theta[2,40]	                  0.93	0.07	1.32	-1.70	1.16	2.65	360	1.03
theta[2,155]	              0.96	0.06	1.02	-1.52	1.37	2.12	301	1.03
alpha[2,3]	                  0.68	0.01	0.17	0.40	0.66	1.06	267	1.03
alpha[2,7]	                  0.86	0.01	0.22	0.52	0.83	1.38	285	1.03
alpha[2,13]	                  1.95	0.04	0.64	0.95	1.88	3.44	266	1.03
alpha_times_beta_raw[2,10,4]  0.68	0.05	1.00	-1.19	0.63	2.81	386	1.03
lambda[1]	                  0.63	0.00	0.04	0.55	0.63	0.71	231	1.03
lambda[2]	                  0.37	0.00	0.04	0.29	0.37	0.45	231	1.03

The other two parameterizations are:

#5:

parameters {
  // Person Parameters
  vector[I] theta[C];
  
  // Item Parameters
  // Level 1
  vector<lower=0>[K] alpha[C];
  vector[J] alpha_times_beta_raw[C, K];
  
  // Level 2
  ordered[C] alpha_1_times_mu_beta;
  
  // Mixture Parameters
  vector[2 * lambda_prior] lambda_raw[C];
}

transformed parameters {
  // Mixture Parameters
  simplex[C] lambda;
  
  // Item Parameters
  vector[J] alpha_times_beta[C, K];
  vector[K] alpha_times_mu_beta[C];
  
  for (c in 1:C) {
    lambda[c] = dot_self(lambda_raw[c]); // sum-of-squares
    
    alpha_times_mu_beta[c] = append_row(alpha_1_times_mu_beta[c], alpha_1_times_mu_beta[c] / alpha[c, 1] * alpha[c, 2:K]);
    
    for (k in 1:K) {
      alpha_times_beta[c, k] = alpha_times_mu_beta[c, k] + alpha_times_beta_raw[c, k];
    }
  }
  
  lambda = lambda / sum(lambda);
}

model {
  // Priors ///////////////////////////////////////////
  for (c in 1:C) {
    // Person Parameters
    target += std_normal_lpdf(theta[c]);
  
    // Item Parameters
    target += lognormal_lpdf(alpha[c] | alpha_prior_mu, alpha_prior_sigma);
    target += normal_lpdf(alpha_1_times_mu_beta[c] | 0, alpha[c, 1]);
    
    for (k in 1:K) {
      target += normal_lpdf(alpha_times_beta_raw[c, k] | 0, alpha[c, k]);
    }
    
    // Mixture Parameters
    target += std_normal_lpdf(lambda_raw[c]);
  }
  
  // Likelihood ///////////////////////////////////////////
  for (i in 1:I) {
    vector[C] ll_by_pers = rep_vector(0, C);
    
    for (c in 1:C) {
      for (k in 1:K) {
        ll_by_pers[c] += categorical_logit_lpmf(
          y[i, k] |
          append_row(0, cumulative_sum(alpha[c, k] * theta[c, i] - alpha_times_beta[c, k]))
        );
      }
    }
    
    target += log_sum_exp(log(lambda) + ll_by_pers);
  }
}

Traceplot in fig5_par5.pdf (270.7 KB) and with posterior summary:

                          mean	se_mean	sd	    2.5%	50%	   97.5%  n_eff	Rhat
lambda[1]	              0.66	0.00	0.05	0.56	0.67	0.74	96	1.07
lambda[2]	              0.34	0.00	0.05	0.26	0.33	0.44	96	1.07
alpha_times_beta[1,10,4] -0.39	0.02	0.24	-0.89	-0.39	0.05	175	1.04
alpha_times_beta[1,14,2] -1.19	0.03	0.39	-2.01	-1.16	-0.52	167	1.04
alpha[1,8]	              1.15	0.01	0.18	0.81	1.15	1.50	264	1.03
alpha[2,6]	              0.79	0.02	0.24	0.43	0.75	1.38	211	1.03
alpha_1_times_mu_beta[1] -0.94	0.01	0.20	-1.39	-0.92	-0.59	274	1.03
alpha_times_beta[1,4,4]	 -1.26	0.01	0.21	-1.71	-1.26	-0.87	234	1.03
alpha_times_beta[1,6,4]	 -0.76	0.01	0.21	-1.19	-0.75	-0.38	191	1.03
alpha_times_beta[1,7,3]	 -1.42	0.02	0.36	-2.22	-1.39	-0.79	213	1.03

and #22:

parameters {
  // Person Parameters
  vector[I] theta[C];
  
  // Item Parameters
  // Level 1
  vector<lower=0>[K] alpha_times_sigma[C];
  vector[J] alpha_times_beta_free[C, K - 1];
  vector[J - 1] alpha_times_beta_free_K[C];
  
  // Level 2
  vector[K] alpha_times_mu_theta[C];
  
  // Mixture Parameters
  vector[2 * lambda_prior] lambda_raw[C];
}

transformed parameters {
  // Person Parameters
  ordered[C] mu_theta;
  
  // Item Parameters
  vector[J] alpha_times_beta[C, K];
  
  // Mixture Parameters
  simplex[C] lambda;
  
  for (c in 1:C) {
    real alpha_times_beta_sum = 0;
    for (k in 1:(K - 1)) {
      alpha_times_beta_sum += -sum(alpha_times_beta_free[c, k]);
      alpha_times_beta[c, k] = alpha_times_beta_free[c, k];
    }
    alpha_times_beta_sum += -sum(alpha_times_beta_free_K[c]);
    alpha_times_beta[c, K] = append_row(alpha_times_beta_free_K[c], alpha_times_beta_sum);
    
    mu_theta[c] = sum(alpha_times_mu_theta[c])/sum(alpha_times_sigma[c]); // Weighted average
    
    lambda[c] = dot_self(lambda_raw[c]); // Sum-of-squares
  }
  
  lambda = lambda / sum(lambda);
}

model {
  // Priors ///////////////////////////////////////////
  for (c in 1:C) {
    // Person Parameters
    target += std_normal_lpdf(theta[c]);
  
    // Item Parameters
    target += lognormal_lpdf(alpha_times_sigma[c] | alpha_prior_mu, alpha_prior_sigma);
    
    for (k in 1:K) {
      target += normal_lpdf(alpha_times_beta[c, k] | 0, alpha_times_sigma[c, k]);
      target += normal_lpdf(alpha_times_mu_theta[c, k] | 0, alpha_times_sigma[c, k]);
    }
    
    // Mixture Parameters
    target += std_normal_lpdf(lambda_raw[c]);
  }
  
  // Likelihood ///////////////////////////////////////////
  for (i in 1:I) {
    vector[C] ll_by_pers = rep_vector(0, C);
    
    for (c in 1:C) {
      for (k in 1:K) {
        ll_by_pers[c] += categorical_logit_lpmf(
          y[i, k] |
          append_row(0, cumulative_sum(alpha_times_mu_theta[c, k] + alpha_times_sigma[c, k] * theta[c, i] - alpha_times_beta[c, k]))
        );
      }
    }
    
    target += log_sum_exp(log(lambda) + ll_by_pers);
  }
}

This is basically #7, just that I reduce the dimension by 1 by introducing a sum to zero constrain on the \beta parameters and therefore shift the intercept to the ability parameters (alpha_times_mu_beta is renamed as alpha_times_mu_theta), now subtractively and not additively. I am well aware that I have one prior “too much”, since the very last entry of alpha_times_beta is a transformed parameter, but frustratingly and paradoxically, also in the other parameterizations I tried, the “naive” version always works better than the mathematically correct one. In another model I parameterized \alpha_k=\alpha_1\delta_k and put priors on the transformed parameters \alpha_k, but the estimation became a lot more unstable when adding the log det Jacobians of the transform. Anyway, trace plots in fig6_par22.pdf (139.1 KB) and posterior summary here:

                              mean	se_mean	sd	    2.5%	50%	   97.5%  n_eff	Rhat
alpha_times_sigma[1,13]	      2.02	0.09	0.71	0.95	1.92	3.67	59	1.08
theta[1,101]	              0.93	0.13	1.18	-1.66	1.46	2.33	77	1.07
alpha_times_sigma[1,3]	      0.70	0.02	0.17	0.42	0.68	1.10	59	1.07
alpha_times_sigma[1,5]	      0.92	0.03	0.23	0.55	0.89	1.45	63	1.07
alpha_times_sigma[1,6]	      1.06	0.04	0.30	0.60	1.01	1.77	69	1.07
alpha_times_beta_free[1,10,3] 0.69	0.10	0.91	-0.99	0.65	2.61	81	1.07
alpha_times_beta[1,10,3]	  0.69	0.10	0.91	-0.99	0.65	2.61	81	1.07
alpha_times_sigma[2,11]	      1.31	0.02	0.22	0.94	1.29	1.79	104	1.06
alpha_times_beta_free[1,10,4] 0.40	0.10	0.99	-1.50	0.38	2.40	93	1.06
alpha_times_beta[1,10,4]	  0.40	0.10	0.99	-1.50	0.38	2.40	93	1.06

So it seems like reducing the dimension is actually hurting the estimation and not facilitating it.

4. More data

To my disappointment, however, when I fit those models to the bigger dataframe (1000 subjects), the estimations become even more unstable and we have chains where a group collapses (chain 2, 4 in par. 7, fig7_par7_M.pdf (476.3 KB)). The estimations for 5 and 22 have not finished, but in fig8_par22_M.pdf (476.7 KB) we see that even though, in par. 22, no chain collapses (chain 6 is not yet finished), the chains look worse than on the smaller data set.

5. Questions

So of course my main question is:

  • Can someone come up with another parameterization or prior specification (I guess the main candidate here would be \alpha) to improve convergence?

Furthermore, I would gladly like some insight on:

  • What happens in Fig. 1-3 when the stepsize is almost zero and why can the chain sometimes wind itself out of this basically constant state?

  • Where do the little “detours” of chain 1 in par. 5 (fig. 5) and of chain 2 in par. 22 (fig. 6) come from and how to prevent that?

  • Why do things behave (in my mind) unintuitively - i.e. reducing the dimension worsens the fit, more date worsens the fit etc.? I have a basic understanding of the examples in the relevant blog posts, but those mostly deal with very simple models, so it is not trivial to transpose the intuitions given there to my kind of model (where every new line of data also introduces a new parameter).

I would be very glad for any advice, input or questions, since I am very much at the end of my wits.

Thank you!

Nice

I guess you’ve read all the things and it goes almost without saying, but getting MCMC stuff to mix with mixture models is hard. Here’s a recent thing by @yuling on trying to get around this by focusing on predictive performance that might be relevant: https://arxiv.org/abs/2006.12335 .

So your complicated base model works and you add in the mixture component and it doesn’t work.

If your base model is A and your complicated model is B, can be add some models between the two and try them?

For instance, what happens if you drop the ‘c’ index from \alpha, \theta, and \beta? Then the only mixture component parameter is \lambda_c. Are you able to simulate and recover parameters from that model? If so, maybe then there’s hope of specifying stronger priors or something on them to get the full model working.

Also do a non-centered parameterization for these (it’s just usually the right decision):

target += normal_lpdf(alpha_times_beta_raw[c, k] | 0, alpha[c, k]);
target += normal_lpdf(alpha_times_mu_beta[c, k] | 0, alpha[c, k]);

More data can separate the modes more strongly and make it hard for MCMC to hop around.

Thanks a lot for your replies!

That is an interesting read. However, do you think this solution applies in my case? It seems like a method to get inferences from chains that behave well when taken separately but explore different modes of the posterior. However, in my case, the chains all sample from roughly the same area of the posterior but do so too fuzzily. For example, split Rhats for the \lambda_1 parameter in model 7 for every chain are:

1.05 1.00 1.03 1.00 1.01 1.01 1.03 1.04

This is something I should explore again. Interestingly, when I started, I restricted the \theta parameters to be equal across groups (i.e. dropping the c). When I loosened that restriction, it vastly improved the fit, however. This is how I got Rhats even close to 1.

Would you mind expanding on that a little? I suppose you mean reparameterizing in a way that there are no model parameters in the priors? This would inadvertently mean splitting up the \alpha from alpha_times_.... such that we have

target += normal_lpdf(beta_raw[c, k] | 0, 1);
target += normal_lpdf(mu_beta[c, k] | 0, 1);

and then alpha * (mu_beta + beta_raw) in the model. But the combination was done to improve the fit in the first place, as I asked about in this thread (Convergence depending on parameterization of discrimination parameter in the GPCM). Is there a way to both non-centralize and remove the multiplicative identifiability issue?

Thanks again for your help!

Don’t know, honestly. Are you trying to make predictions? If so, maybe. If you really care about estimating the parameters in your model, probably less so (and you’re better served by trying to reparameterize or find a new model).

That is interesting

Sure, what I meant was equivalent to what you had previously.

Did the combined parameterization improve the fit?

Also the combined parameterization you have is equivalent to the old one. What if you did something like alpha_times_beta_raw ~ normal(0, 1) and axed the alpha?