I have a few data points (4 out of 362) showing larger than 0.7 Pareto k diagnostic values, I tried to follow case studies like Cross-validation for hierarchical models and Using the loo package (version >= 2.0.0) to recompute elpd_loo with k_threshold = 0.7. However, it didn’t work with loo() function from the loo package. Does this threshold only work with rstanarm::loo()? I notice these packages such as rstanarm and brms all have a loo function. My log_liks were computed from cmdstanr and I used the matrix method to compute loo, for instance,
adding k_threshold = 0.7 inside loo(LLmat, r_eff = rel_n_eff, cores = 2) will not give an error although it seems not defined as an input argument in loo function Efficient approximate leave-one-out cross-validation (LOO) — loo • loo, and the output is same as the case without thresholding.
Do you think I can manually find data points that have Pareto k larger than 0.7 from psis(), then take them out from LLmat and recompute loo?
I tried the following to calculate summed elpd_loo with pareto_k lower than 0.7, do you think this is valid? But if I want to compare models after setting the pareto_k threshold, I have to hack the original loo restuls, which is not good.
How much higher? If they are just a little over 0.7, you can try running more MCMC iterations and check whether the khat values get smaller. As khat values are estimated from finite sample, there can be variation which can be reduced with more iterations. If sampling is fast this can be easy approach if there are only a few khat values that are only a little bit over due to the random variation in the posterior sampel.
It is not possible to have a generic function that would make refits for loo or would do K-fold-CV as these require knowing something about the data and model structure. rstanarm and brms have restricted set of models, and thus it is possible to make special loo function for them.
Maybe we should add a warning to the generic loo function that k_threshold is not implemented. ping @jonah
No, this is not valid, and it’s worse than using all data points including those with high khat. The ones with the highest khats are most influential.
In addition of trying with more iterations (if khats are close to 0.7), or running MCMC separately for those 4 points, or using K-fold-CV, you could also
check that those data points are not erranous (Pareto-k diagnostic has revealed data problems many time)
consider adjusting the model, e.g. use Student-t instead of normal, or negative-binomial instead of Poisson. If you tell more about your model, it will be easier to help
consider what is the purpose computing elpd_loo, e.g., if this in the early phase of the workflow and the difference to other model(s) is very large (and better models don’t have high khats), then it is possible that the conclusion would not change if you compute those four elpd components with more accuracy. If you tell more about your modeling problem, it will be easier to help
They fall in (0.7, 1], I don’t have the exact khat as yesterday, rerunning the model and loo gives me 1 khat > 0.7 which is 0.852
Thanks, I think my understanding of PSIS-Loo is still wrong. It seems to set k_threshold one has to refit the model, it is not simply as setting a threshold on khat.
If you tell also the posterior sample size, I can estimate the posterior for khat. There is a PR for loo to make it easy for users to get that, too, but we need to check that we don’t break anything when merging that
Yes. k_threshold is telling which folds of the leave-one-out cross-validation to estimate by refitting the model with MCMC (instead of fast Pareto smoothed importance sampling)
I guess we should check the documentation how we explain this.
I don’t think we want to add a warning to the generic, because the generic has to assume that all potential arguments are valid in order for methods to be defined that have their own specific arguments (loo() the generic is called by the user and then it dispatches to specific methods like loo.stanreg() based on the class of the first argument, so if the loo() generic throws a warning it will be thrown by all methods). However, we could add a warning to the methods in the loo package. That is, loo.array(), loo.matrix(), and loo.function() could throw a warning about this if they detect that the unsupported k_threshold argument has been specified. Does that make sense (not sure if I explained that well)?
Thanks, by linking this sentence to what I was learning from your lectures, I guess I got the point. PSIS is a fast way to conduct loo without refitting the model many many times. Problemtic Pareto-k diagnostic (> 0.7) tells us on which data point(s) PSIS procedure is not trustable. Then we have to do classic loo by really leaving these problematic (for PSIS) points out one by one and refit the models. After that I guess we can still calculate elpd_loo for these points by looking the density of each point using posteriors estimated by MCMC.
There is about 14% probability that k could be less than 0.7. Also it is possible that with increasing sample size there is a transition from pre-asymptotic behavior to asymptotic behavior, but that is unlikely in LOO specifically