Leave-group-out LOO question

Hi all.

I am developing [another] R package with Stan as the est. backend. As always, I am trying to implement a loo method for it. The model is a latent-multivariate mixed effects location scale model (LMMELSM). The details of the model aren’t terribly important - It’s estimating fine, recovering parameters, no divergences, good Rhats, etc. I am not going to paste the model code here, because it is complicated, and not particularly relevant to the problem.

The model is hierarchical, and involves latent variables.
Let J be the number of observed variables, and Y_i is the data matrix for person i.
Let n_i be the number of rows of observations for person i.
Let F be the number of latent factors (\eta), and k be the k-th observation of person i.

\underbrace{Y_i}_{n_i \times J} = \underbrace{\mathbf{1}}_{n_i} \underbrace{\nu}_{1\times J} + \underbrace{\eta_{i}}_{n_i \times F} \overbrace{\Lambda}^{F\times J} + \mathbf{e}_i

\eta_i is hierarchically modeled; e.g., in the case where F = 1 [a unidimensional latent structure]:

\begin{align*} \eta_{ik} &= \alpha_i^\mu + \epsilon_{ik} \\ \alpha_i^\mu &= 0 + u_i^\mu \tag{Location model}\\ \epsilon_{ik} &\sim N(0, \sigma_i^\epsilon) \\ \log(\sigma_i^\epsilon) &= 0 + u_i^\sigma \tag{Scale model}\\ \end{align*}

In essence, it is a location-scale model, but on a latent quantity instead of observed data; the latent quantity then projects to the observed data as per normal in a latent measurement model.

For loo, there can be several things ‘left out’.
The simplest, is just leave-row-out; in this case, the log_lik of each row is computed — Get log_lik of each score on each row; sum across the columns for the row-wise log_lik.

Another variant would be leave-person-out; in this case, the log_lik of each person is computed — Get log_lik of each score on each row; sum across the columns for the row-wise log_lik; then sum these values up per person. If you have K persons, then you have K log_liks.

The leave-row-out yields generally “ok” pareto-k values; sometimes there are 1 or 2 ‘bad’ k values. Unfortunately, I don’t think this would be the LOO of interest.
I would intuitively think that the LOO of interest is leave-group-out (akin to how leave-group-out may be more useful for mixed models in general).
But the leave-person-out pareto-k values are, and I am not exaggerating, 100% bad k-values.

Is there a reason for how 100% of the PSIS k-values can be bad? Am I computing LGO properly for a mixed, multivariate-outcome model (i.e., compute the log_lik for each multivariate occasion by summing the log_liks across outcomes; then sum these up within person to yield a joint-log-lik for a person’s observed data)? Is there a way of diagnosing why these k-values are so, so bad?

1 Like

I think this is due to the marginal posterior p(\theta^s|y) and LOO posterior p(\theta^s|y_{-i}) being very different, which can cause importance sampling to fail as pointed out in “Practical Bayesian model evaluation using leave-one-outcross-validation and WAIC” (see end of Section 2.1 for example).

This is common when computing leave one group out, from Cross-validation for hierarchical models case study:

In theory Pareto smoothed importance sampling could be used for leave-one-group-out. In hierarchical models each group has it’s own parameters and the posterior of those parameters tend to change a lot if all observations in that group are removed leading to failure of importance sampling. For certain models quadrature methods could be used to compute integrated (marginalized) importance sampling (Merkle, Furr and Rabe-Hesketh, 2018).

2 Likes

Indeed; in my re-reading of that paper, I came across the referenced citation (Merkel et al., 2018), and am working my way through that as well.

Basically, by leaving out the entire group, the posterior for some parameters are going to be obviously different; e.g., the latent intercept’s random effect (and each latent score for that person) will basically just be samples from the prior. That’s a huge difference. So the alternative is to use “marginal” likelihoods (in the sense of marginalizing out any group-specific parameter) rather than conditional ones (which depend on group-specific parameters).

I’m working through it now, but I imagine this can get extremely complicated once, e.g., random slopes are involved.

For the time being - let’s say I only have random intercepts, and I am using normal likelihoods. Does the ‘marginal’ likelihood have to actually be integrated, or can some shortcuts be used? For example, if I have:

y_{ik} \sim N(\nu + \eta_{ik}\Lambda, \sigma)

Then I think covariance-based estimators marginalize out the \eta_{ik} by instead using:

y_{ik} \sim MVN(\nu, \Sigma = \Lambda' \Phi \Lambda + \delta^\sigma)

, where \Phi is the latent covariance, and \delta^\sigma is a residual covariance matrix.

Is this the same “marginal likelihood” that is referenced by that paper?

If so, then I may be able to work out what my model’s equivalent is; it’s a bit tougher, since the latent covariance varies by person too… I fear I may run into the same problem, since the parameters vary by person.

1 Like

Normal-normal case is “easy” as you can analytically integrate over ik specific parameter \eta_{ik} as you have written.

Unless you have something more complex…

2 Likes

Indeed, the full model can be more complicated. I’m unsure how to handle the analytic integration in this particular model though. The integration above is for a typical common factor model.

This model is effectively an MELSM, but the ‘outcome’ of the MELSM is latent.

E.g.,

MELSM:

\begin{align*} y_{ik} &= X_{ik}\beta + Z_{ik}u_{i}^\mu + \epsilon_{ik} \tag{Location model}\\ \epsilon_{ik} &\sim N(0, \hat\sigma_{ik}) \\ \log(\sigma_{ik}) &= W_{ik}\gamma + V_{ik}u_{i}^\sigma \tag{Scale model}\\ \begin{bmatrix} u_i^\mu \\ u_i^\sigma \end{bmatrix} &\sim MVN(0, D_iLL'D_i) \\ \log(D_{ip}) &= R_i\zeta_p \tag{Between scale model} \end{align*}

X, W, and R are design matrices. Z and V are RE-design matrices. u matrices are REs. D is a diagonal-scale matrix for the RE SDs.

That’s the ‘full’ model for a 2-level LMMELSM.

The model I am working with, essentially replaces y_{ik} with \eta_{ik}, a latent variable manifest by J indicators (and adds some constraints, like the fixed-intercepts for the location and scale models are 0 for identification). In the simplest case, there are no predictors, no between-group scale model, and only random intercepts for the location and scale.

It’s unclear, then, how to get the expectations and covariances in this case. Technically, each row could have a different implied covariance matrix; it’s not clear to me what the expectation would be over, per se. Do you think it is feasible to compute the row-specific expectations and covariances? Or should I just try a quadrature approach?