Model selection / LOO for sparse data

Hi,

I’ve got a use case in which I’m fitting IRT models to a difficult test, such that a large number of respondents answer few questions. The skew results in a fairly large number of influential outliers which is not ideal, but that’s the data I have to work with. Here’s some typical LOO output.

Computed from 2000 by 29330 log-likelihood matrix
Estimate    SE
elpd_loo  -2587.3  65.3
p_loo       526.2  18.2
looic      5174.6 130.7

Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     27591 94.1%   202
(0.5, 0.7]   (ok)        1649  5.6%   91
(0.7, 1]   (bad)         88  0.3%   21
(1, Inf)   (very bad)     2  0.0%   11
See help('pareto-k-diagnostic') for details.


What’s happening is that for some respondents answering only one relatively easy question correctly and getting everything else wrong, there’s a large \hat{k} value associated with that single correct response. I’m pretty sure the reason is that leaving out that response causes a significant difference in the predicted Y value compared to the model with the response included. Most of the \hat{k} values above 0.7 in the above graph are associated with respondents who only answer one question correctly. This problem is already known: “Another setting where LOO (and cross-validation more generally) can fail is in models with weak priors and sparse data” (Vehtari, Gelman and Gabry 2016, p. 22).

Enough respondents only answer one question correctly that I don’t want to drop them from the data set, but it also turns out that I can drop just a few respondents. Here’s what I get after dropping the two respondents with \hat{k} values greater than 1.0 in the above LOO plot:

I still have three \hat{k} values greater than 0.85, and so I’m wondering if it makes sense to proceed with this enterprise of iteratively removing problematic data points from my model, and if so when to stop (stopping at 0.7 seems like I would have qualitatively changed the nature of my data set by the time I get there). I recognize I could see “unrealistic convergence times” for \hat{k} values greater than 0.7, meaning (if I read that paper correctly) that I would potentially need a large number of posterior samples to get normally-distributed parameter values, but just how many samples are we talking? I’m expecting cross-validation to have the same problem as importance sampling in this setting, because of the influential data points, so k-fold cross-validation is out. So at this point, am I better off using a different test of model fit such as the DIC? Or in using the DIC, am I just covering up problems that PSIS-LOO is making visible?

Thanks very much for any insight you could provide! I suspect I’m not the only person out there with sparse data and this kind of loo output. (By the way, I recognize that I’m possibly putting the cart before the horse with this question, and I’m going to do some more posterior predictive checks to verify that my models are appropriately fitting this data set.)

1 Like

k-fold cross-validation does not suffer from importance sampling problem, and I would recommend to use k-fold cross-validation in this case.

That different thing would be k-fold cross-validation. DIC doesn’t have good diagnostic and thus you are correct that DIC would fail but not just tell you that it’s failing. Furthermore DIC is likely to fail even more, as it’s based on using posterior mean of the parameters for prediction instead of integrating over the parameters and evaluating the performance of posterior predictive distribution.

Note that when running k-fold-cv, you can save time by not running those folds where all left-out-observations had good Pareto-k values in PSIS-LOO.

Edit: fixed couple typos

1 Like

Thank you for the clarification, sir. I may need to do some more research into why cross-validation would fail in models with sparse data, as you mention in the paper – I thought the reason was that removing one or more influential data points from a fold can strongly bias the posterior predicted values, even after averaging over all folds.

I do have a second concern about k-fold cross-validation, which is the high computational cost. In my example above, nearly every high \hat{k} value is associated with a unique respondent because the influential outliers are people only getting one question right on a test. Let’s say for simplicity’s sake I have 800 respondents and 40 such outliers. In your paper, sir, you write that “if there are more than K problematic i, then we recommend checking the results using K-fold cross-validation” (p. 7). That means that I would be using 40-fold cross-validation in this example. If five percent of the respondents in my sample have at least one problematic Pareto-k value, then I would expect the odds of a fold having all left-out observations with good Pareto-k values to be (.95)^{20} = .358, dropping my computational cost to still be 26 times that of the original run. With 10% of respondents having a problematic Pareto-k value as in the initial LOO output above, the cost rises to a factor of 36. In general I’m looking at K - K(1-\frac{K}{N})^{\frac{N}{K}} folds.

I’m sure that k-fold cross-validation is the most accurate method in this case, but I am hoping for a faster approach that is still “good enough,” whatever that means. Since I’m comparing multidimensional model fits, extending the computing time by a factor of 30 will extend the time it takes for a single analysis from a day to a month! That’s just infeasible. My goal is to run each test of model fit within a few hours, which at present is the cost of a single run of Stan with 2000 post-warmup samples and 3 chains. Maybe that approach is removing influential outliers from the model until K gets small enough. Perhaps I could consider something like K/4-fold cross-validation, where K is the number of observations with problematic Pareto-k values (but we’re still talking about a week for the analysis). Maybe the approach involves holding my nose and accepting Pareto-k values that are somewhat larger than 0.7 as “good” values. Or perhaps I just hold my nose and use the DIC. (I could also try to switch over to using variational inference, but I hate to do that before using Hamiltonian Monte Carlo.) The question is, given limited resources, which is the least bad option? I realize that in the end the answer to this question may require setting up some simulations and answering the question empirically. (Here’s a paper finding that LOO and WAIC out-perform the DIC for unidimensional IRT models, with the DIC in third place ahead of AIC and the BIC. But those tests don’t seem to have been extended to sparse data or multidimensional models yet.)

Thank you all very much!

Hi @rjc10,

Thanks for the detailed reply, it makes easier for me to continue.

First check the new cross-validation tutorial video

I think you could do 10-fold cross-validation. K-fold-CV is completely parallelizable, so the time can be the same as running once. If you don’t have access to several cores, you can use cloud services. Even if you would do K-fold-CV serially with 10 folds it should one week, which is not much compared to how long you probably write the paper.

Where did you get 40? If you have more than 10 problematic observations, then use 10-fold-CV.

This assumes random division, but you can use also deterministic divisions. See the video linked above.

Then you need to tell more what would you do with the cross-validation result.

1. Now several high Pareto-k values already indicate potential model misspecification (see loo vignette), so you could consider switching to a better model without need to be able to run cross-validation for this model.
2. When there high k values, elpd_loo is optimistic. If this model is losing the comparison clearly to another model which has no high Preto k values, then you probably don’t need to get more accurate result.
3. If this model seems to be better than some other model based on elpd_loo, then you need compute more reliable estimate, e.g. with K-fold-CV.
4. If you would be using a utility or cost function which has interpretation with calibration with respect to expert information, you need compute more reliable estimate, e.g. with K-fold-CV.

No. That is cheating and will bias your results.

Could be ok in cases 1 and 2, I listed above, but not in cases 3 and 4.

Since you are already aware of the problems, that would be cheating and lying.

For this model and influential observations, variational inference is likely to fail and the diagnostics are not yet available in Stan release versions meaning more work for you to check whether it would work.

10-fold-CV run in parallel.

Note that LOO and *IC shouldn’t be used to select the best model without taking into account the uncertainty, that is, if the models are close to each other, instead of selecting model averaging should be used. See more in Using Stacking to Average Bayesian Predictive Distributions and blog posts 1, 2, and 3.

2 Likes

Dear Dr. Vehtari,

Thank you very much for your detailed reply! I agree that moving ahead with 10-fold cross-validation, preferably in parallel, is the way to go. So, I’ll certainly mark your answer as the solution. I have one rather specific follow-up question that might be helpful to other people here in the forums, and which is very much in the spirit of the original question. First let me respond to your questions, sir.

Done, and thank you for the talk! In particular, thank you for the additional links to further demos and videos.

In inferring the number of necessary folds, due to a lack of practical experience I interpreted a sentence in your paper literally. (“If there are more than K problematic i, then we recommend checking the results using K-fold cross-validation”, that’s in “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC”, 2016, p. 7)). Certainly, 10 could be substituted for K into that sentence given I have about 40 problematic i, but then 8 or 16 also work, so that’s why I thought something was wrong with my logic and inferred K=40. But now I see that in Vehtari and Lampinen 2002, you suggest that “values of k between 8 and 16 seem to have good balance between the increased accuracy and increased computational load.” It sounds like the technical basis for your recommendation of using 10-fold cross-validation in the present case is simply that I have at least 10 problematic i and 10-fold cross-validation is a common choice, based at least in part on findings from simulations. Is that right?

Got it. Based on the presentation of balanced k-fold cross-validation, I might consider placing, say, four of my problematic 40 or so values in each of the ten folds in order to avoid doing too much extrapolation.

Sure! I have two IRT models, one unidimensional and one multidimensional, and I would like to determine which is a better fit to the data. I expect that both models will suffer from influential outliers due to the skewness of the data, although I don’t actually know that yet and it would be quick to check. I’m considering using elpd_loo to compare the two models, so that definitely implies I need to use K-fold-CV. Here’s my follow-up question (related to the question at 33:30 in the https://www.youtube.com/watch?v=hpr8pxqkCH8&index=10&list=PLuwyh42iHquU4hUBQs20hkBsKSMrp6H0Jvideo): given the skewness of my data, I’m actually wondering if I need to implement a correction for the bias due to leaving some data points out of the smaller training sets compared to the full data set as per Vehtari and Lampinen 2002, p. 9 (2447 in the published version). I recognize that it isn’t a step that is traditionally taken when comparing models because both elpd_loo statistics would be biased and the bias tends to cancel out. But I wonder if I should consider using the correction term in the present case, as the IRT model using an inappropriate number of dimensions would have more influential outliers and thus more potential sources of bias? Perhaps it would be overkill to do so. (And maybe I could answer the question myself with a quick simulation, kind of a lower priority.)

Certainly, sir, but do you have an opinion about the use of the DIC in this setting? Seriously, though, thank you for the candid feedback. I agree it would be cheating for certain, and lying if I didn’t inform the reader of the potential problem.

Thank you! That makes sense in general, sir, except that I’m comparing a unidimensional with a multidimensional model in the present case, so I would say the conclusion here would be to use one or the other model, even if the elpd_loo values are close.

Many thanks for the extremely helpful feedback. I hope the dialogue is beneficial to others.

Yes.

Yes.

Probably not.

Yes.

Yes, that is possible and you can test that

My opinion is that don’t use it in any setting.

Then you can read Parsimonious principle vs integration over all uncertainties and consider based on that what you want.

I would also strongly recommend making posterior predictive checks and considering whether you can modify your models so that they can better explain skewed data.

2 Likes