2PL IRT Efficiency Questions

I am testing a series of item response models related to a verbal learning test. The test itself is comprised of 10 words that are read by the participant in 3 different trials. After each trial, they are asked to recall from memory as many words as they can. This particular test presents the 10 words in different orders for each of the trials as well. I can’t share the data, but I’ve defined my variables below.

Variables:

  • Resp - response variable, scored 0 (did not recall) or 1 (recalled) for each word in each trial
  • Item - identification variable, factor (levels 1-10) identifying each word
  • ID - identification variable, factor (levels 1-1478) identifying each participant
  • ItemPos - factor (levels 1-10) to identify what position (order) the item was presented in for that trial
  • TrialID - factor (levels 1-3) to identify which trial the response belongs to
  • LocDep - identification variable defining dependency between items (i.e., correct recall of a word on a later trial is coded to be dependent on recall for the same word on a previous trial) – coding and inclusion in syntax comes from the De Boeck paper on using lme4 for IRT
  • ItemPosDif - a DIF variable coded to examine whether position effects differ between trials – coding and inclusion in synatx comes from the De Boeck paper and here

Just as an example of the syntax, the following is my call for the “base” model being examined:

Priors_Rasch <- 
  prior("normal(0, 3)", class = "sd", group = "ID") +
  prior("normal(0, 3)", class = "sd", group = "Item")

Rasch_1D <- brm(Resp ~ 1 + LocDep + (1 | Item) + (1 + LocDep | ID),
                data = df_long, 
                family = bernoulli(link = "logit"),
                prior = Priors_Rasch,
                iter = 3000, warmup = 1000,
                control = list(adapt_delta = 0.99))

So far, I’ve been using the priors and general model syntax provided by Paul Burkner’s wonderfully helpful publication. My goal is to systematically examine several different models (e.g., Rasch v. 2PL, unidimensional v. multidimensional, no DIF v. DIF, etc). I expect Stan to take a while fitting the models since they are (a) complicated and (b) fairly large (n = 44,340 in long format). The Rasch model shown above takes about 2.5 hours without issues. Running the same model but the 2PL version takes ~26 hours.

I wanted to see whether anyone has suggestions for cutting down the time or improving the efficiency of the model. I have toyed with the idea of brms_multiple() on an EC2 instance from AWS where the dataset is split into 4 and simultaneously estimated, but I didn’t find a huge improvement in modelling time on smaller datasets. I am currently using WAIC to inform model selection (2PL and dimensions for each trial are better than Rasch and unidimensional models, but I haven’t been able to get a “clean” run of the multidimesional 2PL model yet since I’ve only run it with adapt_delta = 0.80).

A related question that I have is whether anyone has experience using something like a horseshoe prior for IRT models. I have seen a couple papers using a horseshoe prior for Bayesian Neural Network IRT models, but none in a general linear mixed model. All the models will have fixed and random effects modeled for all the covariates, so I suspect that being able to cut down on the estimation of the large number of coefficients being computed would be more meaningful for reducing estimation time.

Session info:

  • Operating System: Windows 10, 64 bit
  • brms Version: 2.12.0 (running Stan version 2.19.3 and R version 4.0.0)

Regarding performance, have you considering tightening up the priors some more? The default priors in brms are pretty loose and I see you only changed the group sds. That also might make it possible for you to lower adapt_delta.

Also since this is a bernoulli outcome you want to avoid the situation where your priors are so wide that they’re actually kindof informative in the prior predictives (compare hist(1 / (1 + exp(-rnorm(10000, sd = 3)))) to hist(1 / (1 + exp(-rnorm(10000, sd = 1.5))))).

A horseshoe model will give you more parameters to estimate, though if you have sparsity in your parameters then you should probably try it out.

1 Like

Thank you for your response!

Tightening up the priors was something that I had tried, though as you noted I was only specifying priors for the group sds. I had halved the standard deviations of the priors as you did in the histogram examples, but that actually resulted in divergent transitions on the Rasch model, which hadn’t been a problem under the wider priors.

My thinking is that a prior that is conservative and favors a small effect for each parameter would help the sampling, and it seems that at the very least pulling away prior probability from the tails to some extent may be useful for avoiding having to keep a high adapt_delta. In this case, I’m not too worried about the priors being overly influential because the dataset is decently large, so the conservative priors may just be useful for keeping the sampling sensible (though I may be mistaken on this point). That’s what had made me initially think about the horseshoe as a possibility. My concern is that the initial model estimation is already going to take a long time, so the sensitivity analyses may be even longer if divergences and other sampling problems keep coming up.

Twenty six hours is way too long and likely an issue of model misspecification. Try checking out the edstan package.

That’s been one of my concerns as well, but I didn’t want to just dismiss long compilation times for sign of misspecification before I’d sought out some additional eyes on the framework. My hope was that including the right covariates would eventually resolve the issue, but I also don’t have the time to try out several different ways of accounting for the dependency in the items.

I’ll check edstan out in more detail. It looks like right now the package doesn’t support item covariates, but I’ll check out the model code to see about expanding the framework

That’s surprising. My intuition is that a tighter prior on a brms glm model would just run faster (if making bad predictions).

This makes me think that the model and the data aren’t really fitting that well.

Without knowing what else to do, I’d look at posterior predictions for this model and see if there’s anything weird going on. Like maybe ppc_stat_grouped over the Item groups: Plotting for Bayesian Models • bayesplot ? And there are probably some others that make sense?

What parameters have the lowest Neff values?

So, I reran the model using 3 different priors to see what impact that was having (see below). As far as time, they all took about 1.5 hours. It doesn’t seem like the priors have that much of an influence on the estimates. The mean absolute errors for item coefficients and the 95% credible intervals were all < .001 while the person coefficients and 95% credible intervals were more impacted (MAE ranged from <.01 to 0.03). The posterior predictive checks also seem to reflect minimal influence from the prior (see below).

Priors_Rasch1 <- 
  prior("normal(0, 3)", class = "sd", group = "ID") +
  prior("normal(0, 3)", class = "sd", group = "Item") +
  prior("normal(0, 3)", class = "b")

Priors_Rasch2 <- 
  prior("normal(0, 1.5)", class = "sd", group = "ID") +
  prior("normal(0, 1.5)", class = "sd", group = "Item") +
  prior("normal(0, 1.5)", class = "b")

Priors_Rasch3 <- 
  prior("normal(0, 0.5)", class = "sd", group = "ID") +
  prior("normal(0, 0.5)", class = "sd", group = "Item") +
  prior("normal(0, 0.5)", class = "b")

Only fitting issue that resulted now that there’s a prior over the slope coefficient is in the widest prior model produced 1 divergent transition after warmup. For clarification, I ran default adapt_delta and 3,000 samples.

As far as the lowest Neff values, in all cases it’s the intercept (bulk ESS < 1200 for each model), which is probably because I forgot to change that prior as well. The default brms intercept is student_t(3, 0 ,10), so that needs to be changed as well. The next lowest Neff, though similar in magnitude to the intercept, is the correlation between the intercept and the local dependency predictor (probably because the estimate is 0.96).

1 Like

What’s the average treedepth for this model?

And I guess maybe we should be thinking about the difference in the Rasch and 2PL models? What’s the 2PL model look like here?

Average threedepth in all three are 6

1 Like

I did try to fit the 2PL last night with tighter priors and under a different method for accounting for the dependencies. I thought maybe the issue was with the local dependency parameter, so instead of identifying the items that were expected to be dependent on one another across trials, I tried a multidimensional model where each trial was its own dimension (see code with posterior predictive check as well):

Priors_2PL <- 
  prior("normal(0, 1.5)", class = "b", nlpar = "eta") +
  prior("normal(0, 1.5)", class = "b", nlpar = "logalpha") +
  prior("normal(0, 1.5)", class = "sd", group = "ID", nlpar = "eta") +
  prior("normal(0, 1.2)", class = "sd", group = "Item", nlpar = "eta") +
  prior("normal(0, 1)", class = "sd", group = "Item", nlpar = "logalpha")

TwoPL_3D <- brm(formula = bf(Resp ~ exp(logalpha)*eta,
                             eta ~ 1 + (1 |i| Item) + (0 +  TrialID | ID),
                             logalpha ~ 1 + (1 |i| Item),
                             nl = TRUE),
                data = df_long,
                family = bernoulli(link = "logit"),
                prior = Priors_2PL,
                iter = 3000, warmup = 1000)

The model took just short of 8 hours, and brms spit back warnings about 7 divergent transitions and low bulk ESS. Checking the estimates, the correlations between trial 1 with trials 2 and 3 are the problem (Neff < 500 with estimated correlations of >0.95). I’ll rerun the 2PL model with the new priors and just the local dependency instead of the multidimensional setup so that it’ll be comparable to the Rasch model.

For reference, the priors that I had used from the brms paper on IRT are shown below. These were the ones that I was using when the fitting time was too high:

Priors_2PL <-
prior("normal(0, 5)", class = "b", nlpar = "eta") +
prior("normal(0, 1)", class = "b", nlpar = "logalpha") +
prior("normal(0, 3)", class = "sd", group = "id", nlpar = "eta") +
prior("normal(0, 3)", class = "sd", group = "item", nlpar = "eta") +
prior("normal(0, 1)", class = "sd", group = "item", nlpar = "logalpha")

Let the 2PL model with the local dependency parameter run over night. It took just over 5.5 hours to run, but it spit back lots of errors. Only one parameter had an Neff > 1000 (most were <100), all the Rhat values are greater than 1.04 (largest is 1.22), and there were 1485 divergent transitions after warmup.

So that seems to support my thinking that the local dependency model may not be the best way of specifying this model since the multidimensional IRT model seemed to perform better. The only thing that I would be worried about is whether I specified the 2PL version correction. The local dependency indicator (LocDep) is an item-by-person covariate, so I only included it on the eta (difficulty) parameter and not the alpha (discrimination) parameter (see below) since the former is defined by both item and person effects while the latter is not. I also wonder whether a better prior can be used for the alpha parameter since that’s supposed to be constrained to be positive. Seems like maybe just outright doing a lognormal prior would be more sensible (again, I initially have just followed the model examples in the paper by Paul Burkner).

Priors_2PL <- 
  prior("normal(0, 1.5)", class = "b", nlpar = "eta") +
  prior("normal(0, 1.5)", class = "b", nlpar = "logalpha") +
  prior("normal(0, 1.5)", class = "sd", group = "ID", nlpar = "eta") +
  prior("normal(0, 1.5)", class = "sd", group = "Item", nlpar = "eta") +
  prior("normal(0, 1.5)", class = "sd", group = "Item", nlpar = "logalpha")

TwoPL_1D <- brm(formula = bf(Resp ~ exp(logalpha)*eta,
                             eta ~ 1 + LocDep + (1 |i| Item) + (1 + LocDep | ID),
                             logalpha ~ 1 + (1 |i| Item),
                             nl = TRUE),
                data = df_long,
                family = bernoulli(link = "logit"),
                prior = Priors_2PL,
                iter = 3000, warmup = 1000)