Practical implications of many high-Pareto K observations loo

I have built a model describing how long it takes fish to digest their food (i.e., gut passage time) with temperature, diet, and a temperature-diet interaction as population effects. I included species as a group-level effect to account for repeat sampling, however; some species have been sampled many times (n = 8) whereas most have been sampled one time.

Here is the main model output:
model output

And the loo output:

image

Obviously that is a lot of high Pareto K values, but all of my posterior predictive checks suggest that the model fits the data well, and p_loo < p. From what I have read elsewhere, the reason I am getting so many high Pareto K values is not because the model is badly specified but because many levels of the group effect (species) only have 1 estimate.

My question is, what are the practical implications of this for interpretation of the model? Does it mean that estimates of population effects will be highly sensitive to additional data? Should I trust estimates of population effects? Or does it just mean that its estimate of the species group effect is unreliable?

Any guidance would be much appreciated; I am trying to write up this model and I don’t want to overstate my conclusions.

2 Likes

This might be helpful from back in 2018

1 Like

Without seeing additional diagnostics (like LOO-PIT) it’s difficult to be certain, but this is the likely reason. You could compare which observations get high Pareto-k and if they all are observations that are the only observation for some species, then that would give additional support to this hypothesis.

If the high Pareto-k’s are only for the those observations that are the only one in some group, and we assume there is no model misspecification, then the posterior is still valid. Yes, the group effects for the groups having just one observation are probably unreliable, but that should show also in the width of the posterior for those effects. We can also say that the population prior is not very informative (they rarely are), and thus even with one observation in the group the prior and posterior for the group effect are quite different, Additional data may then change the posterior a lot, but not necessarily in a way that would be surprising given the current width of the posterior. Are you comparing this model with some other model?

2 Likes

Thank you so much for the helpful reply- I’m very sorry for the delayed response, I thought I was signed up to get an e-mail notification when somebody replied.

Without seeing additional diagnostics (like LOO-PIT) it’s difficult to be certain, but this is the likely reason. You could compare which observations get high Pareto-k and if they all are observations that are the only observation for some species, then that would give additional support to this hypothesis.

Here is the LOO-PIT, which to me looks reasonable:

The high Pareto K values are not just for species with one estimate. However, when I remove species as a group effect, all of the Pareto K values are way better:

image

Estimates of gut passage time can be highly variable within species due to things like the species of algae consumed or size of the fish (but size does not seem to have an effect across species). Is it possible that that variability is contributing to the high Pareto Ks?

Are you comparing this model with some other model?

Yes. I am comparing the model with two other models, which have 1) only temperature as a population effect and 2) temperature and diet but no interaction as a population effect. Both have species as a group effect, and have similarly high Pareto K values. The original model I showed had the most model weight (74%) using the model_weights() function with weights=“loo”.

I should have said this before, but I don’t care much about the estimates of group effects. Really I just want to make sure that my estimates of the population effects are reasonable given the available data.

Thank you again!

Thanks for the link!