Non-convergence on 2PL IRT model

Hi all,

I would like some help diagnosing the problems with a two-parameter item-response theory (IRT) model that does not converge. This model should be identical to the example 2PL model in the Stan reference manual (section 8.11, p. 140-141), with the difference that my observations are binomially distributed–they represent the number of people in a country c, year t who answered item j “correctly”.

Eventually, I would like to estimate random walks over time for each country. For diagnostic purposes, I pared things down to estimating a global latent variable score for each year. I have about 50 items, and about 20 years, and I observe 2000 country/year/item observations. All in all, my data set contains responses from 2.2 million people. There is much more data in some years than others. Here is the stan file:

data {

  int<lower=0> TT; //number of years
  int<lower=0> J; //number of items
  int<lower=0> CTJ; //number of observed country-year-questions
  int<lower=0> y[CTJ]; //positive responses per country-year
  int<lower=0> n[CTJ]; //responses per country-year


parameters {

  real mu_beta;
  real alpha[TT]; //abilities
  real beta[J]; //difficulty
  real<lower=0> gamma[J]; //discrimination; should be positive


transformed parameters {

  real<lower=0, upper=1> p[CTJ] ;

  for(ctj in 1:CTJ){
    p[ctj] = inv_logit( gamma[ questions[ctj] ] * ( alpha [ years[ctj] ] -  beta[ questions[ctj] ] ) );


model {

  //years are independent
  for(t in 1:TT){
    alpha[t] ~ normal(0, 1);

  //weakly informative non-hierarchical priors
  //for question difficulty and discrimination
  beta ~ normal(mu_beta, 5);
  gamma ~ lognormal(0, 2);
  mu_beta ~ cauchy(0, 5);

  y ~ binomial( n, p );


This model has a hard time converging. With 4 chains of 1000 iterations, of which 100 are warm-up, I get plenty of divergent transitions, and the median effective sample size for the alpha parameters is 4.2. When I look at the trace plots, 2 out of 4 chains have an acceptance rate of almost zero. This does not bode well for my plans to estimate a more complicated model on this data.

Happy to provide the data if necessary. Am I doing something wrong?



The easiest thing to do is to just run it with 1000 warmup and 1000 draws. Want to try that first? 100 warmup may not be enough to estimate the mass matrix of the unconstrained parameters well. If that fixes it, then that’s simple enough.

If that doesn’t work, report back with r hats and what paremeters aren’t estimating properly.

Quick question: have you run the model on simulated data?

The easiest thing to do is to just run it with 1000 warmup and 1000 draws. Want to try that first? 100 warmup may not be enough to estimate the mass matrix of the unconstrained parameters well. If that fixes it, then that’s simple enough.

If that doesn’t work, report back with r hats and what paremeters aren’t estimating properly.

Quick question: have you run the model on simulated data?

Sure, thank you! I have not fit the model on simulated data yet. I did not think it necessary because it is so close to a model that’s in the manual.

With 1000 warm-up, 1000 interations, for the alpha
(ability) parameters, I get the following Rhats and effective sample sizes
for the alpha, beta and gamma parameters:

Min. 1st Qu. Median Mean 3rd Qu. Max.
1.029 1.742 2.432 9.297 3.608 326.600

Min. 1st Qu. Median Mean 3rd Qu. Max.
2.001 2.182 2.551 5.390 3.405 61.270

And here is a breakdown of the median Rhat/n_eff by parameter group:

tapply(summ_stan$Rhat, summ_stan$parameter, median)
alpha beta gamma mu_beta
2.564598 2.255598 2.631457 2.541527

tapply(summ_stan$n_eff, summ_stan$parameter, median)
alpha beta gamma mu_beta
2.299584 2.858157 2.458137 2.449047

And here are the maximum Rhat and minimum n_eff:

tapply(summ_stan$Rhat, summ_stan$parameter, max)
alpha beta gamma mu_beta
6.202956 18.365058 326.582868 2.541527
tapply(summ_stan$n_eff, summ_stan$parameter, min)
alpha beta gamma mu_beta
2.044987 2.005745 2.001132 2.449047

Is that what you meant by “which
parameters aren’t estimating properly”?



Thanks. That was useful, but not quite enough information to give you specific advice.

In general, I think your priors are pushing the parameters to values that don’t make sense for this sort of IRT model. For example, your gamma terms should be close to 1, alphas and betas should be close to 0. On the logit scale, the term really shouldn’t be greater than 3 in magnitude. Without looking at your output, my guess is that mu_beta is different for different chains and isn’t moving within the chain. This would lead to different scale alphas and gammas being estimated.

I see now that the Stan program is almost directly from the manual. This case study might be useful for you:

Hi Clara –

Identification for IRT models depends on the particular application–there is no one true answer. Probably what is happening is that your chains are converging to mirror images of each other because of “multiplicative aliasing”, i.e., your chains can be multiplied by -1 without changing the relative position of the parameters. To see that, you should either use the shinystan package or the bayesplot package to look at trace plots of your chains. You will see some chains converge to positive values and other chains converge to the same value but negative (i.e., +2 and -2) if you have that problem.

For a review on these issues, I encouraging you to read Gelman et al. and Jackman for a review of identification issues.

For example, you may not want to constrain discrimination to be positive. That comes from the use of IRT to analyze tests because the questions are arranged in increasing order of difficulty, but that may not be the case for your survey.

I’m currently working on an IRT/Stan package that will automate identification, but it’s still a month away from release.

Thank you, Daniel and Robert. I created trace plots for some of the parameters, to help with diagnosing the problem: As you can see, three chains are barely moving. Some further questions and responses:

Daniel: thanks a lot for the case study link. I am working on a run with simulated data right now. I see what you are saying about the priors. But don’t think I understand why, in my case, the priors should be different/stronger than what is suggested in the manual? Is there other info that I could show to help us diagnose?

[update: from my simulation efforts, I understand better your concern about the priors–especially the one on mu_beta seems far too weak, doesn’t it? And you are right, the cauchy prior on gamma too allows for some very unrealistic gamma magnitudes. Does the model in the manual assume that one has far more data, and that stronger priors are not needed for identification?]

Robert: thanks for the ideas and links. Great that you are working on a package to solve these kinds of problems. Looking at my trace plots, it does not seem like we are dealing with a lack identification of the polarity of the latent trait. Would you agree? My understanding was, also, that restricting gammas to be positive is supposed to solve that identification issue, correct? These items should, in fact, all be coded such that responses are positively correlated with difficulty (these are questions about climate change, and a response of “1” always corresponds with the more “concerned” answer).

I’ll get started on reading your references and also will update you on the simulation–meanwhile, do you have any other ideas?

Sure if you set up your data around the positive discrimination assumption then the estimates should be reasonable.

I agree with syclik that the issue could well be mu_beta given that it looks like there are sampler pathologies. Hierarchical parameters can cause these issues. Only one of the chains is mixing at all, the others are stuck at fixed values. Also, the one chain that is moving shows very high autocorrelation and may not be stationary.

The advice I’ve always been given is to start small and build up. Also, you could simulate the data and that would tell you whether it’s a model issue or a data issue.

So try either 1) simulating the data or 2) using a simpler 1 PL model, drop the discrimination term and use a fixed mean for beta, say 0. Also, you could 3) try a negative binomial model in case the addition of a shape parameter helps with fitting over-dispersed outcomes.

Those would be my next steps, although I am not 100% sure what is the problem.

Thanks all. Meanwhile, I ran some simulations. When I simulate data with the same number of observations (and the same missingness pattern) as my data, the model does not converge. However, using narrower priors, the model comes close to convergence at 500 warmup, 1000 iterations. The narrower priors I used are:

beta ~ normal(mu_beta, 5);
gamma ~ normal(1, 1);
mu_beta ~ normal(0, .5);

so, especially the prior for mu_beta got narrowed way down. With 500 warmup and 1000 iterations,this model also gives much better results on my actual data set. Rhats by parameter group:

tapply(summ_interest$Rhat, summ_interest$parameter, summary)

Min. 1st Qu. Median Mean 3rd Qu. Max.
1.004 1.018 1.041 1.035 1.046 1.066

Min. 1st Qu. Median Mean 3rd Qu. Max.
1.001 1.006 1.017 1.021 1.037 1.051

Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.003 1.006 1.009 1.013 1.033

Min. 1st Qu. Median Mean 3rd Qu. Max.
1.006 1.006 1.006 1.006 1.006 1.006

And effective sample sizes:

tapply(summ_interest$n_eff, summ_interest$parameter, summary)

Min. 1st Qu. Median Mean 3rd Qu. Max.
53.30 54.62 58.92 119.50 199.30 326.90

Min. 1st Qu. Median Mean 3rd Qu. Max.
53.78 93.82 227.30 223.60 295.30 504.30

Min. 1st Qu. Median Mean 3rd Qu. Max.
110.0 322.0 483.4 465.2 560.8 933.0

Min. 1st Qu. Median Mean 3rd Qu. Max.
717.3 717.3 717.3 717.3 717.3 717.3


Would you agree that this looks good enough to start moving on to more complex models, like one where global latent opinion follows a random walk, or one with country “fixed effects” as well as a global yearly mean?

Sure I think the sampling looks fine. The one issue is that beta[3] looks like it is sampling near a hard boundary in the parameter space at 0. Your betas aren’t constrained around zero so that could be evidence of your model sampling against some other constraint, such as the constraint on gamma. So you might want to re-run without that constraint and see if beta samples better. But in general I don’t think IRT models will tend to have high n_eff because they are estimating latent parameters and there isn’t a whole lot of information about them.

If you want to look at country intercepts, it seems like it’d be best to make your alpha parameter equal to countries and then put a random walk prior on the alphas as you mentioned. The stan manual explains very clearly how to do that.

It’s a very interesting model, please stay in touch as you develop it!

Thank you both! More complex models also look promising as long as I stick with informative priors. For example, a model with continent-level random walks as well as country-level random walks works well.

I think an important lesson learned for me is that simulating data helps, not only to see whether a model is likely to work on a given data set, but also to get more insight in what the priors mean. Once I started simulating, I understood much better which values (esp, for gamma and beta here) might be reasonable.

1 Like

See also for other recommendations.

Sampling from the prior is an extremely powerful tool for understand the consequences of your prior choices. We are working on formalizing its importance in a new paper at the moment and will hopefully have more documentation on the idea as we proceed.