Should I marginalize to the extent possible?

I am working with models where y_i=\mathcal{N}(Ax_i,D) and x_i\sim \mathcal{N}(\mu,\Sigma) is Gaussian, which means that conditional on A I can model y_i| A\sim \mathcal{N}(A\mu,A\Sigma A^{T}+D). In this scenario, y_i often lives in a space much larger than that of x_i and D and \Sigma can generally be assumed to be diagonal.

My question is whether it is recommended to implement this type of model with x_i marginalized out, or with x_i as a latent variable. So we can implement either:

  1. The complete un-marginalized model, including the latent x_i. The advantage seems to be that we are only ever drawing from diagonal Gaussians, but the disadvantage is that we are including a large number of latent variables into the model.
  2. The marginalized model, with x_i integrated out. There are a lot fewer random variables to sample, but they now all have to be sampled from a multivariate Gaussian.

Although the multivariate Gaussians technically have the dimensionality of y_i the covariance matrix has low-rank structure, so we effectively only have to do inverses on matrices with the (smaller) dimensionality of x_i, so we can shave some off the cost, but we still need to do some extra linear algebra.

Even though HMC is efficient compared to its competitors, its sampling ability still scales in the dimensionality of the sample space, so I would venture that marginalization helps with sampling. But it could be argued that the marginalized posterior is more complicated, which might counter the effect.

Knowing the other parameters, x_i could be sampled from the true posterior conditional on each sample either post hoc or in the generated quantities block, so we can retrieve x_i if needed, even if we use model 2.

A further disadvantage of model 1 seems to be the application of PSIS-LOO. If I want to evaluate the predictive probability p(y_i|\mathcal{D}_{obs}) with x_i marginalized, which seems the sensible thing to evaluate, I guess I have to use model 2 or at the very least do the computational work of model 2 on top of model 1. Or is it actually better to apply LOO conditional on samples of x_i since the marginalized density might be needlessly diffuse? I got some really poor \hat{k} when I tried it on the marginalized model.

1 Like

I think this is a fair description of the trade offs. Marginalization is almost always a win but if you use HMC to sample something MVN with a million dimensions it works fine too. I would be surprised if the marginalized model posterior for the remaining parameters were simpler than the nonmarginalized version so that’s my only concerned. Having a lot of spare MVM parameters around might just hide the problem I the nonmarginal version.

It will work, but it’ll take a lot longer to mix.

In almost all cases, if the posterior’s complicated under marginalization, then it’ll be hard to fit with the marginalized variables included explicitly. It would be nice to get more examples of these tradeoffs, so please feel more than free to share results back here.