Hi,
I’ve got a use case in which I’m fitting IRT models to a difficult test, such that a large number of respondents answer few questions. The skew results in a fairly large number of influential outliers which is not ideal, but that’s the data I have to work with. Here’s some typical LOO output.
Computed from 2000 by 29330 log-likelihood matrix
Estimate SE
elpd_loo -2587.3 65.3
p_loo 526.2 18.2
looic 5174.6 130.7
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 27591 94.1% 202
(0.5, 0.7] (ok) 1649 5.6% 91
(0.7, 1] (bad) 88 0.3% 21
(1, Inf) (very bad) 2 0.0% 11
See help('pareto-k-diagnostic') for details.
What’s happening is that for some respondents answering only one relatively easy question correctly and getting everything else wrong, there’s a large \hat{k} value associated with that single correct response. I’m pretty sure the reason is that leaving out that response causes a significant difference in the predicted Y value compared to the model with the response included. Most of the \hat{k} values above 0.7 in the above graph are associated with respondents who only answer one question correctly. This problem is already known: “Another setting where LOO (and cross-validation more generally) can fail is in models with weak priors and sparse data” (Vehtari, Gelman and Gabry 2016, p. 22).
Enough respondents only answer one question correctly that I don’t want to drop them from the data set, but it also turns out that I can drop just a few respondents. Here’s what I get after dropping the two respondents with \hat{k} values greater than 1.0 in the above LOO plot:
I still have three \hat{k} values greater than 0.85, and so I’m wondering if it makes sense to proceed with this enterprise of iteratively removing problematic data points from my model, and if so when to stop (stopping at 0.7 seems like I would have qualitatively changed the nature of my data set by the time I get there). I recognize I could see “unrealistic convergence times” for \hat{k} values greater than 0.7, meaning (if I read that paper correctly) that I would potentially need a large number of posterior samples to get normally-distributed parameter values, but just how many samples are we talking? I’m expecting cross-validation to have the same problem as importance sampling in this setting, because of the influential data points, so k-fold cross-validation is out. So at this point, am I better off using a different test of model fit such as the DIC? Or in using the DIC, am I just covering up problems that PSIS-LOO is making visible?
Thanks very much for any insight you could provide! I suspect I’m not the only person out there with sparse data and this kind of loo
output. (By the way, I recognize that I’m possibly putting the cart before the horse with this question, and I’m going to do some more posterior predictive checks to verify that my models are appropriately fitting this data set.)