Suppose I am using loo on a hierarchical model where most groups are data-rich and some are data-poor. I expect unreliable performance and bad pareto-k diagnostics due to the data-poor groups. Suppose that indeed I have multiple high k’s, but they are confined to the data-poor groups.
I am willing to assume that models that yield better predictive performance across the data-rich groups will also yield better (or not worse) performance across the data-poor groups. I’m thinking about the case where, for example, I model a random intercept and then a single set of universal slopes, and I’m doing comparisons among alternative sets of covariates included in the universal effects. If I’m willing to assume that these effects are identical across groups, then I shouldn’t care too much about dropping a handful of data-poor groups as I explore alternative specifications for these effects. Thus, I am willing to do model comparison using only the rows pertaining to data-rich groups, which in my hypothetical example all have acceptable pareto-k diagnostics. I have two questions for @avehtari and others:
Is this an obviously terrible course of action? Why?
In a setting like the example above, would it also be fine to just ignore the Pareto k warnings from the full log-likelihood matrix? In other words, (in the limited setting describe above) is it possible for the full matrix to yield model comparison results that are radically different from the partial matrix? How would that happen?
I think it would be helpful to define “predictive”: are you interested in predicting new observations from existing groups, or new observations from new groups? If the latter, you could compute loo using a likelihood that is marginalized over the random intercepts, in which case the data-poor/data-rich distinction is no longer as relevant (and pareto-k issues tend to go away). Here is a paper describing related issues:
It would help to know whether you plan to make predictions for new groups or for the existing groups (or not going to make any predictions).
If you used the model to make predictions for new groups, and the data-poor groups don’t differ from the data-rich groups except for the amount of data, you can assume exchangeability among the data-poor, data-rich, and the new groups, and then it would be ok base the estimate on those groups where the predictive performance estimate is more reliable.
If you used the model to make predictions for the existing groups, even if you would assume exchangeability, the learning curve for the group specific parameter(s) can be such that a different model can be the best depending on the amount of data for the group (there is an example in Bayesian hierarchical stacking paper)…
If the differences between the models performances are big, then dropping some loo folds or including them despite bad Pareto k’s, then the model comparison results are not sensitive. Especially in the early part of the workflow, it may be possible to drop some inferior models. If you want to get more accurate computations in the end, you could try moment matching loo, integrated loo, or brute force.
How big error there might be when we see high Pareto \hat{k} values is difficult to know, as then the error distribution has a very long tail. Most of the time you would observe over-optimistic results, but if you are unlucky you can get also very over-pessimistic results.