Posterior predictive checks for IRT models

Hi,

I’m a relative newbie so I hope this post isn’t redundant. Beguin and Glas (2001) present some useful posterior predictive plots and test statistics for Bayesian IRT model fit. The plots are not difficult to implement in R, but they also seem like the sort of code that could be automated and shared, if that hasn’t already happened since 2001. Stan, or at least the Stan community in some capacity, seems like a good place to make that code available. Thoughts? Maybe the answer is as simple as making an external link available in a response to this forum post, so that more people would be able to find the code.

My post is also related to this one.

Cheers,

Richard

Off the top of my head, the IRT model person I can think of is @saudiwin. I think he does an IRT package: https://github.com/saudiwin/idealstan . Maybe he has opinions or suggestions or pointers to packages if you wanted to contribute something.

1 Like

Thanks for the shoutout @bbales2 :D! My package does indeed have posterior predictive checks for the IRT models in the package. I’m also coming out with a big update in the next two months that will add a variety of time-series IRT models & also ability to use hierarchical parameters.

@rjc10, I’m interested to hear more about what kind of posterior predictive checks or model fit criteria you would think are useful. I’ve seen a ton of model fit criteria out there but never really understood what to use them for.

@saudiwin Thanks, nice to meet you! I’ve read through some of the literature on ideal point models and while it’s not my area, I found it useful for understanding some of the approaches for model identification when a researcher can bring specific domain knowledge to the table. Thank you, we’re so fortunate to have open-source software for these models out there.

I’m more interested in the educational assessment application for IRT models. The model fit statistic that Beguin and Glas propose involves test score prediction. They simply plot number of respondents with a particular test score vs. the test scores in the observed data, use the EAP probabilities that each respondent will correctly respond to each question to generate a predicted test score for each respondent, and then compare the distribution of expected test scores to observed test scores, with 95% confidence intervals (CPI I think) around the expected test scores. That’s a useful plot for me in particular because I’m working with difficult tests, with resulting sparse data such that I’m concerned about influential outliers. Therefore I want to make sure that my models predict the data well around the tails. Beguin and Glas also propose two \chi^2 test statistics (p. 557), to be compared to each other. For each independent draw of the model parameters from their joint posterior distribution, they compare the observed number of respondents answering each of the questions to the expected number of respondents answering each of the questions given the model parameters, computing the \chi^2 statistic for that comparison and summing over all items, i.e. \chi^2_o = \sum_{k=1}^K{\frac{(O_k - E_k)^2}{E_k}} for a survey with K items. Then they generate a new data set using the current parameter values, and get the frequency distribution of test scores in the new data set. Let’s call the generated number of respondents answering item k correctly in this new frequency distribution Rep_k. Then they have \chi^2_{rep} = \sum_{k=1}^K{\frac{(Rep_k - E_k)^2}{E_k}}. “The posterior predictive p-value is the proportion of replications where \chi^2_{rep} > \chi^2_o, and the model is rejected when this proportion becomes very small.”

The \chi^2 test and the plots make a lot of sense to do for models with sparse data, and I think that generally they would be useful wherever test scores are involved. In my case I will also need to run posterior predictive checks to make sure that the assumptions that the respondents’ abilities are independent conditional on the item parameters and that the item difficulties are independent conditional on the respondents’ abilities also hold up, but haven’t yet done the research to find what fitness statistics are currently the standard to use.

It doesn’t look hard to code up these tests, but we shouldn’t all be separately reinventing the wheel, right? Let’s fight for fewer bugs in research papers!