Priors and constraints in the Stan 2-PL IRT Case Studies

There are two 2-PL IRT model case studies on the Stan site:

Two questions:

  1. In Section 3.2.1 of Case Study 1, a latent regression is added to the 2-PL IRT model. In this model, \gamma is the regression coefficient for a covariate predicting the latent ability’s mean. However, the model block does not assign a prior to \gamma. This seems problematic as the document notes, “For identifiability of the latent regression, the mean ability of females is constrained to zero, and we allow the mean for males to be different from zero – thus, the intercept is not included.” What is the implicit prior here? How does this ensure the identification constraint?

  2. In Case Study 2, isn’t the proposed model much more restrictive than one would want (or is typically used) in an IRT model? For example, the 2-PL with latent regression in Section 2.2:

  • Constrains \beta_I so that the average item difficult is 0 by fixing it as the negative sum of the other difficulties
  • Constrains \alpha positive with lognormal(1,1) prior
  • Constrains variance of \theta to 1 with the normal prior
  • Constrains the mean AND variance of \lambda with the student T priors. Not sure about this one.

Perhaps I am missing something?

Pinging @danielcfurr as author.

It’s been a while since I’ve looked at this, but I’m happy to try to answer your questions.

  1. The prior on gamma is uniform. This is the “default” that happens if we don’t specify a prior in Stan. This prior is bad practice, I’ll admit. Either the abilities or the difficulties need to be anchored somehow, and setting the mean of one ability group but not the other is sufficient.
  2. I agree that the model is restrictive in requiring positive values for \alpha. I haven’t found a way to allow for negative discriminations that works well, so I just try to live with that. I don’t believe that the model is unusually restrictive in the other regards though.
    • There is a constraint on \beta, but in contrast with the other model, the latent regression has an estimated intercept. In effect, the constraint has been moved from the person side to the item side, which isn’t any more restrictive overall.
    • The prior for \theta has a mean determined from the latent regression and the usual 2PL constraint on the variance. I wouldn’t say this is restrictive, though perhaps in some cases you may want the variance to vary.
    • There is a default, weakly informative prior placed on \lambda (the regression coefficient vector) that is based on normalized covariate values. I wouldn’t describe this prior as a constraint–lambda is free to take fairly extreme values.
1 Like

I can’t say that I’ll add anything technically superior to what @danielcfurr has already clarified, but I would just point out that, in most cases, the Bayesian IRT models will be less constrained than their maximum likelihood counterparts (at least if we consider relatively default or generic models in most R packages – there are many more ways of loosening up certain constraints in specific contexts where there may be better constraints elsewhere)

Most IRT R packages that I’m familiar with will, by default, estimate the model with assumption that \theta \sim N(0, 1) in the population. The final sample estimates for \hat{\theta} may not reflect that, but behind the scenes, the model estimation is assuming that the parameter follows a standard normal distribution. Depending on your model, the Bayesian IRT methods can permit free estimation of the variance of \theta, meaning that one can estimate \theta \sim N(0, \sigma).

In packages like mirt, there are other distributions that are available for estimating \theta, but they all make their own assumptions and impose some degree of constraint as a result because of the inherent unidentiability of IRT models. In essence, the standard IRT models are factor analysis on tetrachoric (or polychoric) correlations of item-level responses, and just like factor analyses need to be identified by either standardizing the latent variable (i.e., assuming the factor follows a standard normal distribution) or by fixing the loading of the first item on that factor to 1. At some level, there needs to be some potentially undesirable assumption about what parameter can be fixed to some value and changed from unknown to known so that there is sufficient available information to identify the model – some methods just do a better job at hiding those decisions from others whereas Bayesian methods and our priors often mean we need to be more explicit about the decision we’ve made and our assumptions behind them.

As far as constraining \alpha to be positive, this was initially a limitation to me until I stopped to think about what the implication really is. If an item has negative discrimination, then it means that individuals who are higher in overall ability are more likely to get it wrong than people with lower levels of ability. This means either one of two things has happened: the item is poorly written and should be thrown out or the item should have been reverse scored. Either of those is actionable regardless of the model. Plotting the ICCs of the items can help to identify items that may have a true negative discrimination as they would theoretically be very close to flat as the model may try to estimate them as close to that negative range as possible. Similarly, one would expect that they exhibit poor item fit, so there are ways of detecting the possibility of negative discrimination.

Ultimately, at least as I understand it, there has to be some constraint on \alpha to prevent compensatory sign changes in \theta - \beta. For example, there’s only a sign difference between -1.5 * (0 - 1) and 1.5 * (0 + 1), so if you give the model a symmetric prior, then \alpha = -1.5 and \alpha = 1.5 are equally probable and there’s no way for the model to separate which is the “better” estimate when there’s also no difference (because of symmetric priors) between \beta = -1 and \beta = 1. Thinking about what a negative discrimination would imply, it seems more reasonable to constrain \alpha to be positive than to impose a constraint on \beta to identify the model.

1 Like

Thanks all for the responses!

@danielcfu The source of uncertainty was that I’m not sure if there’s a need to constrain the variance of \theta to 1 (“the usual 2PL constraint”) given the positivity constraints and the constraint on \beta. Typically we would either fix a loading or the factor variance while imposing constraints on the factor means for identification. The sum-to-zero (a “criterion method”) is also used to constrain the variance of the factor. I guess the constraint doesn’t affect the variance and doesn’t affect the location since the intercept does that.

@wgoette I don’t know of software that sets \theta \sim N(0,1) by default but doesn’t allow you to change the parameter values. Fixing the \theta to those values just means you don’t need to fix other parameters unless you want to override defaults…so I don’t think Bayesian and frequentist models are inherently any more “constrained” although maybe the software requires a couple lines of code more to override defaults.

The positive constraint on \alpha is certainly quite reasonable in most cases, but not all. A perfect scenario for a strongly informative prior with almost no mass below 0, rather than a constraint. Certainly it can make sense to “throw out” bad measures, but now we’re conditioning the prior on the likelihood – deeply problematic even if “practical” and unlikely to even be such if priors are registered and you have what you have. Other have discussed this elsewhere on here.

1 Like

Ultimately, I agree with your points overall, but it seems like you are speaking more directly to specific use cases rather than a generic 2PL model, which I believe the User Guide aims to present. There are other Stan examples of IRT models such as Paul Burkner’s paper on fitting IRT models in brms ( and then also the edstan package (GitHub - danielcfurr/edstan). Burkner’s brms paper in particular highlights the differences between “strong” and “weak” identifiability of IRT models in frequentist versus Bayesian estimation.

As I said, I ultimately agree with this point, but the idea of specifying alternative parameters implies customizing a generic model to something more specific to your needs. In this sense, I don’t know that the Stan User Guide examples of IRT models are any different than the models that one would effectively estimate from most IRT software’s default settings, but in cases where those default settings wouldn’t be desired, I’d think that the same would then be true of the examples in Stan.

I think it’s still worth noting that there are sometimes unexpected consequences of changing default settings in some estimation software. For example, see this old discussion post about mirt where Phil Chambers notes that the package automatically constrains an item parameter to 0 in order for the model to be identified: I can’t say that I follow the mirt R code well enough to know the situations and circumstances where this arises, but I have encountered similar “behind the scenes” issues with other software. Bayesian analyses, and hardcoding an IRT model in Stan in particular, removes any of those surprises since they have to be done more explicitly.

I agree, but this is only a solution in cases where one is comfortable with giving an informative prior. Imposing a constraint can permit more skeptical, weakly informative priors, so in cases where the constraint can be justified (e.g., where a negative discrimination would be a product of bad item writing), then the ability to use less informative priors may be better than the alternative option of highly informative priors but no constraints.

I initially was going to suggest this as an option, but I couldn’t come up with any examples wherein a strongly informative prior wouldn’t be needed. As a result, I didn’t think it was a widely applicable solution to avoiding constraints. I have empirically fit models without any constraints on any of the parameters, but this was only in situations where I had parameter estimates from previous results with the test. In that case, I feel comfortable with the strong priors as they are based on empirical findings rather than my own predictions of how the items function. I’d rather have skeptical and weakly informative priors to start with and then test whether this imposes ill-fitting results as one would typically do in an IRT model first, though.

I mention this only in the generic sense as it may not always be true per se, but since Bayesian analyses allow the use of a prior that can impose “soft” constraints on parameter estimates, it is often possible to fit more parameterized models that would not otherwise be identified in frequentist estimation. If we think of maximum likelihood estimation as Bayesian analyses with proper but flat priors, then it is generally easier to see how using priors to guide estimation to reasonable parameter values can make models more likely to converge (what Burkner called “weak” identifiability). For example, this paper (A General Bayesian Multidimensional Item Response Theory Model for Small and Large Samples - PMC) describes how Bayesian IRT methods can be used to fit fairly highly parameterized models in smaller samples than is typically possible for frequentist methods. Obviously, a lot has been written on IRT identifiability as it is an inherent issue in the model framework (e.g.,, Model Identification in IRT and Factor Analysis Models, and