# Using loo for clustered data

This is a more conceptual than a practical question on how to validate a fitted model. In particular how to proceed in the presence of clustered data.

For instance, let y_{ij}, such that i = 1, \cdots, r and j = 1, \cdots, n_i, corresponds to an observation for the j\text{-th} individual belonging to region i. In that case, I would like to fit two models

\begin{align} \mathcal{M}_{\text{A}}: y_{ij} &= \mathbf{x}_{ij}\boldsymbol{\beta} + u_i + \epsilon_{ij} \\ \mathcal{M}_{\text{B}}: y_{ij} &= \mathbf{x}_{ij}\boldsymbol{\beta} + \epsilon_{ij} \end{align}

such that \mathbf{u} \sim G represent the spatial random effects and \epsilon_{ij} \overset{\text{i.i.d.}}{\sim} \text{N}(0, \sigma^2_{ij}).

Assume I fitted the models using, for example, Stan, and want to compare them based on the “leave-one-out cross-validation” procedure. To do so, I used the loo package and compute all the required quantities as in this vignette.

Then, I can analyze the results based on the loo_compare(loo_A, loo_B) output.

Finally, my question is, since I am dealing with a clustered data set (defined by regions i), does still make sense to use the (approximated) “leave-one-out” validation procedure? Instead, should I treat each cluster as one observation (and re-fit the model r times)?

See CV-FAQ “Can cross-validation be used for hierarchical / multilevel models?”, and the references and case studies listed in the answer. Let me know, if the answer there is not clear, and I can try to improve it.