IRT modeling problems

Hi all -

So I have a quick question regarding item response models (IRT). I’ve found Stan (to be clear: HMC) to be impossibly slow with them. I’ve tried every reparameterizing trick in the book to no avail. My chains get stuck at “sampling 1/2000” and never make progress even after hours (or overnight). I know I’m trying this on large amounts of data - over 5 million observations. But still.

However, when I use Stan’s ADVI vb() function, the model runs super fast and converges to the ELBO quickly. The results match those from other sources so I’m confident the model is working properly. Now, I know that variation approximations are just that - approximations - but I don’t get why the vb() just iterates along and I can’t even budge with HMC.

I’m pasting a version of the model I’m trying to run. To be clear: I’m running this on legislative roll call data.

data {
int<lower=1> M; //total observations (legislator-vote pairs)
int<lower=1> N; //num legislators
int<lower=1> J; //num votes
int<lower=1> y[M]; //votes - binary 0,1
int<lower=1> memID[M]; //member indicator
int<lower=1> voteID[M]; //vote indicator
}

parameters {
vector[J] beta; //discrimination
vector[J] alpha; //difficulty
vector[N] theta; //ability
}

model {
//priors from Clinton, Jackman, Rivers
//tried hierarchical versions to no avail
beta ~ normal(0,5);
alpha ~ normal(0,5);
theta ~ std_normal();

y ~ bernoulli_logit(-alpha[voteID] + beta[voteID]*theta[memID];);
}

1 Like

Let me also add: I’ve tried subsetting the data to 1/20th the full size. It still chugs pretty badly. The sampler gets to the 10th or 15th iteration and then stalls forever. If I set “warmup = FALSE” it runs without problems.

I thought I had read somewhere that exhaustive permutation testing was O(2^n) but this paper suggests that for MCMC it’s probably O(n^2) noting that “polynomial time computing of Bayesian and quasi-Bayesian estimators by MCMC is in fact implied by the CLT conditions”. It could be that O(2^n) has n as the number of parameters I’d have to check.

Actually it was Sivia and Skilling who say that brute force and ignorance is O(X^N) for X data points and N parameters.

Thanks for the citation; very helpful indeed. Yeah, I’m just baffled because, though I love Stan, I can’t understand why for the life of me this slowness with IRT models isn’t discussed anywhere. There are vignettes on IRT in Stan from the edstan folks and countless others, but nobody seems to acknowledge that it’s painfully slow for anything more than the most trivial examples.

I did some simulation studies (to make sure my code wasn’t to blame). If I set the number of legislators (or in the IRT context, test-takers) to 25 and the number of votes (or items) to 100, the sampler runs along fast and nice. If I up these numbers to 50/250, respectively, things get much, much slower. Once we get to 100/500 (something like the US Senate), it slows even more and if we use the US House (435/1000+), it just dies. What do I mean? I mean it sits at like iteration 10 or 15 for 8-12 hours. I never let it go much more than that because I have better things to do with my laptop (haahaa).

On the positive side, I have found that the variation approximation in the vb() function works wonderfully. It’s pretty fast and the model parameter estimates match Gibbs sampler results (produced through custom C++ samplers and such).

I’m wondering if the map_rect() feature could be helpful here. Maybe dividing the data up in to shards would help to make the sampling more efficient.

1 Like

In order for ADVI to be reasonably accurate, and Gibbs samples to be fast and accurate, the posterior must be relatively Gaussian in which case dynamic HMC will have no problem fitting in a reasonable amount of time. The contrasts in performance that you are seeing is strongly indicating that your posterior is much more complex and the ADVI/Gibbs results should at the very least be taken with a bit of hesitation. It is very likely that these methods are, at the very least, currently underreporting uncertainty.

As you add more data in IRT models you add more parameters, increasing the risk of weak identifiability that leads to nasty posteriors. At the same time with more data you more strongly inform the posterior – if the model is well identified this leads to a nicer geometry, but if the model is misfitting the true data generating process then this will lead to a worse geometry as the posterior concentrates around a complex subspace of equally bad fits. The increased data also complicates any hierarchical structure that you might have, requiring care in parameterizations of the latent structure.

The slow down of Stan isn’t an inconvenience but rather a cry for help from your model implementation that it’s not up for the task of fitting the data that you’re giving it. To understand exactly what is going on you’re going to have to deeper into the fit results, looking at pairs plots for signs on poor identifiability and following divergences for regions of strong curvature.

2 Likes

Thanks for the reply, Michael. I do get your point and it’s not that I don’t agree with you, but the reality is that these problems happen irrespective of the data I use. Even when I do Monte Carlos and simulate the data using reasonable values, as N (individuals) and J (items) \rightarrow\infty, Stan begins to chug to a halt. I’ve tried multiple datasets, real and simulated, and the result is the same every time: for N>100 and J>500, Stan just dies. I’ve tried replications of IRT case studies using Stan and the result happens time and time again.

I’ve really tried everything - centering/not centering, hierarchical/non-hierarchical priors, etc. I don’t dispute that the posterior geometry might be weird, but why then do out-of-the-box Gibbs samplers based on data augmentation (e.g. MCMCirt1d from MCMCpack and ideal from pscl) work for these cases? Similarly, I’ve been experimenting with greta and it’s much more able to handle the scale of data I’m trying in this case.

One of the things I’ve considered here is that in IRT models with large N and J, there are a lot of embarrassingly parallel calculations. However, RStan only runs on a single thread per chain. So my guess is that N*J is large (i.e. 500k or more), the single threading is really killing the CPU.

1 Like

Rstan can multithread!, look: https://www.google.com/search?q=rstan+multithreading

As well I just recently came across an interesting back-of-the-envelope calculation in the vignette for the R package CFC which is:

“in Bayesian settings where posteriors are approximated by MCMC samples, each combination of observation and cause will have as many integrals as samples. For example, in a competing-risk analysis with 3 causes, 1000 observations, and 5000 posterior samples, 3×1000×1000 = 3 million cumulative incidence functions must be calculated.”

Hey Adam,

You may need a constraint of some kind on the discrimination.
One problem may simply be that the latent scores’ signs are ill-determined without some sort of constraint.

You can have logit(y) = .4 + .7*(-2) or .4 - .7*(2), and get the same value. This is typically resolved by either using a very strong directional prior on the discrimination, or a positivity constraint on discrimination. I recommend the latter (vector<lower=0>[J] beta).

Without some sort of constraint, you will get a symmetric multimodal posterior, and it may simply be impossible to adequately adapt to.

Edit: Also, the reason gibbs samplers may ‘work’ for this, is merely because they get stuck. HMC is much better at exploring the posteriors, whereas gibbs is just going to find a mode and stick to it. VB will probably either give you nonsense results (mean of zero on discrimination, because it’s between the symmetric modes) or will randomly switch between positive and negative discriminations on different runs.

1 Like

This is correct. Just because VB and Gibbs give you answers it doesn’t mean that you can trust those answers as faithful representations of the true posterior distribution.

Keep in mind that the cost of the HMC sampler underneath Stan scales as number of leapfrog steps times cost of gradient. Both of these will vary with dimension, the former in surprising ways.

Increasing N and J will increase the cost of the gradient calculation so that even if the number of leapfrog steps stays constant then Stan will take longer to run, as @increasechief notes. To verify this you can try running with smaller N and J and increase each a little bit and track the time it takes to finish. If the time scales as N * J then you know that the cost is dominated by the increasing gradient cost and you can look into MPI parallelization.

If the cost grows faster than N * J, however, then you have a modeling problem and the increased data are just magnifying the consequences of that problem. As @Stephen_Martin notes IRT models have well known non-identifiabilties that require care to moderate, and the precision required of that care grows with increasing data. In other words problems that can largely be ignored for small data sets quickly become significant as the data grow.

1 Like

Thanks to everybody for the help with all of this. I’ve done a lot of playing around and I think I’ve figured out what’s going on. The good news is that it seems to have nothing to do with the IRT aspect of the model more to do with the fact that my dataset is huge.

I stripped my model down to just a random intercept for votes (items) and that’s it. In other words,

alpha ~ normal(0,5);
y ~ bernoulli_logit(alpha[jj);

That’s it. This model chugs along just as badly as my previous ones. I also tried running a model like this in brms and found the same painfully slow runtimes.

In sum, I’ve come to the conclusion that the problem is that I’ve got 5 million or so observations and over 12,000 items. If I estimate my full model, I have something like 25,000 or so parameters. Even the random intercept model alone has 12,000 parameters. This is not so much a problem with my data or my model; it’s just the sheer size of observations and parameters being evaluated is too large for Stan to handle in a reasonable amount of time (I’ve let it sit for up to 12 hours).

The good news: on smaller data sets of the same sort, ADVI and the NUTS samples look very similar. ADVI works on the larger data and is fairly fast. The results pass the “smell test” too, so I guess I’m stuck with ADVI for data of this magnitude.

2 Likes