To use or not to use the assignement to a hierarchical group as a feature for exact LOO and LOO-PSIS

I want to use leave-one-out (LOO) cross-validation (CV) for a hierarchical model. My intuition is not to use the information on the group assignment of the left-out data point when I compute the likelihood of that point given the model fitted to the rest of the data.
Since, if I would use the group assignment of the left-out data point for LOO-CV, I would treat the group asignment as a explanatory variable (with corresponding parameters having a special, informative prior – the hierarchical prior including information of the data itself). And this is not what I want. I use the information of the group asignement only to account for the fact that the data points aren’t independent given the non-hierarchical model for a known reason (but hopefully are independent given the group assignment), finally to get the variance correct(er).


  1. Is this view correct?
  2. Is LOO-PSIS an approximation for the LOO-CV without using the assignment to the hiararchical group of the left-out data point?
  3. Should I use or not use the assignment to the hiararchical group of the left-out data point as a feature when I do exact LOO-CV for the data point where the pareto-k value was too high in the LOO-PSIS (I use the loo package in R.)?

From my understanding of “Vehtari/Gelman/Gabry. 2017. ‘Practical Bayesian Model Evaluation Using Leave-One-out Cross-Validation and WAIC’.” the answer to question 2 is ‘No’. (I understand that \tilde{y}_{i} is the random variable of an unknown data point given features of data point y_{i} in equation (7).) But maybe there is an argument why this is not a problem if I then sum up over all data points.

Let me check I understand your goal. Let’s say you have 10 groups, and the observation y_{37} belongs to a group 8 (ie g_{37}=8) and in addition there are other covariates so that you have a model
p(y_{37} | x_ {37}, g_{37}=8, \phi, \theta_8),
where \phi are common parameters and \theta_8 is group 8 specific parameter. The usual posterior predictive distribution would
\int p(\tilde{y}_{37} | x_ {37}, g_{37}=8, \phi, \theta_8)p(\phi,\theta_8|y,x,g)d\phi\theta.
The usual leave-one-out predictive distribution would be
p(\tilde{y}_{37} | x_ {37}, g_{37}=8, \phi, \theta_8)p(\phi,\theta_8|y_{-37},x_{-37},g_{-37})d\phi\theta.
Do you mean that you would now want to use instead a predictive distribution that is p(\tilde{y}_{37} | x_ {37}, g_{37}=?, \phi, \theta_?)p(\phi,\theta_?|y_{-37},x_{-37},g_{-37})d\phi\theta,
where ? is unknown group among the existing groups, or not yet seen group?

Yes, this is exactly my question, while I only thought of either ? = 8 or ? being a not yet seen group, here. But in a very related question (interposed below) I also thought about ? being an unknown group among the already seen ones.

My understanding of the motivation for a hierarchical model with \theta_i \sim D(\phi) for i = 1, \dots, 10, where D is some distribution, is to learn about the \theta of the general population, i.e. of a random, probably not yet seen group. So, I would want to use the leave-one-out predictive distribution

\int \int p(\tilde{y}_{37} | x_ {37}, g_{37}=?, \phi, \theta_?)p(\phi,\theta_?|y_{-37},x_{-37},g_{-37})d\phi d\theta_{?},

for ? being a random, probably not yet seen group, and I would just integrate over \theta_{?}. Correspondingly, I would draw a random \theta_{?} \sim D(\phi) for computing the exact \texttt{log_lik} in each iteration of my stan program.

This question is related to another question I encountered:
Should one do posterior predictive checks for ? being an unknown group among the existing groups (I did this by forward simulating data for ? = 1,\dots,10 and then pooling.) or for ? being a not yet seen group (I did this by forward simulating data while sampling \theta \sim D(\phi | y, x, g).)? I did both, but I think the second option is what I should do as I want to learn something about the general population and not the groups I have seen. In my application, the second option did match less with the data (I just did a graphic check by plotting histograms of the simulated data and the true data on top of each other.), but I think that this could go in both directions depending on how different the \theta_i's are and on how well D represents them. Does this make sense? What option for the posterior predictive check is recommended? (I don’t yet know the literature very well, sorry. But I try to fill in my gaps.)