HMC for mixture models with varying scales

Dear Stan Community,

I encountered the issue of problematic posteriors described in the Problematic Posteriors, specifically the component collapsing problem in mixture models with varying scales. While the guide outlines the problem, it doesn’t seem to provide practical solutions.

In my case, I’m working with a mixture of two Gaussian distributions: N(0, 1/\sqrt{10}) and N(10, 1/\sqrt{3}). After running HMC, the posterior collapses as described in the guide:

One example is a binary mixture model with varying scales, \sigma_1 and \sigma_2, for locations \mu_1 and \mu_2. The density grows without bound as \sigma_1 \to 0 and \mu_1 \to y_n for some n, concentrating all mass around a single data point y_n.

Could you suggest ways to address this issue?

Thank you!

Best regards,
Dylan

I would like to rephrase the question to make it easier to understand. I am working on a generalized linear mixed model. Specifically,

Y_{ij}=\text{Bernoulli}(p_{ij}), \; \text{logit}(p_{ij})=X_i+Z_{ij}^\top\beta, \;X_i \sim w_{1}N(\mu_1,\lambda_1^{-1})+w_{2}N(\mu_2,\lambda_2^{-1})

where 1\leq i\leq 500, 1\leq j \leq 6, \beta, Z_{ij} \in \mathbb{R}^8. The parameters involved are \beta, \mu_1, \mu_2, \lambda_1, \lambda_2, w_1. The true values are \beta=(-1.1671, 2.4665, -0.1918, -0.1008, 0.6212, 0.6524, 1.5410, 0.2563), \mu_1=0, \mu_2=3, \lambda_1=10, \lambda_2=3, w_1=0.8.

When running HMC (not stan, the HMC is written by myself), the samples always stuck in at large values of \lambda_1 or \lambda_2, or even in the extreme case, they could go to infinity, which align with the following scenario described in the guide:

One example is a binary mixture model with varying scales, \lambda_1 and \lambda_2, for locations \mu_1 and \mu_2. The density grows without bound as \lambda_1^{-1} \rightarrow 0 and \lambda_1^{-1} \rightarrow y_n for some n, concentrating all mass around a single data point y_n.

Hi @daoyuan-lai and welcome to the Stan forums.

Presumably w_1 \in (0, 1) and w_2 = 1 - w_1?

And what are your priors?

When you’re using your own HMC, you have to diagnose potential issues with your sampler, with your model definition, and with your gradients.

I’m not sure which guide you’re talking about, but you’ve put your finger on the problem. MacKay’s book describes this problem of mixture collapse as “EM goes boom” (EM being a popular tool for fitting maximum likelihood estimates of mixtures). What he meant is that the likelihood goes to infinity.

What you need to do is either initialize so this doesn’t happen or better, add zero-avoiding priors on the scales (or infinity-avoiding priors on the precisions).

The issue you’ll run into doing this is that you have to debug the sampler, the gradients, and the model. Here, the problem’s the model.

Thank you very much for your prompt reply and sorry for not being clear before. My goal for this task is to reproduce the results from the pseudo-marginal HMC paper (Alenlöv et al., 2021) and thus all my model settings strictly follow their implementations.

As described previously, Alenlöv et al. are considering a generalised linear mixed model, where Y_{ij} \sim \text{Bernoulli}(p_{ij}), \text{logit}(p_{ij}) = X_i + Z_{ij}^T\beta, and the latent variables X_{i} \sim wN(\mu_1, \lambda_1^{-1}) + (1-w)N(\mu_2, \lambda_2^{-1}). Hence the parameters involved here are (\beta^T, \mu_1, \mu_2, \lambda_1, \lambda_2, w), and to satisfy that \lambda_1, \lambda_2 > 0 and 0<w<1, the target parameters are rendered as

\theta = (\beta^T, \mu_1, \mu_2, \log\lambda_1, \log\lambda_2, \text{logit}(w)).

And yes, I’m using the HMC and in particular pseudo-marginal HMC to conduct the sampling. My Hamiltonian function is defined as

H(\theta, \rho, u, p) = -\log p(\theta) - \log p(y|\theta, u) + 0.5 \times (\rho^T\rho + u^Tu + p^Tp).

Here \theta is the aforementioned target parameter and \rho is its corresponding momentum variable. Alenlöv et al. assume the prior of \theta to be N(0, 100I) and let \rho \sim N(0, I). The u is exactly the auxiliary variable for the latent variable X_i's in the pseudo-marginal method, and p is its corresponding momentum variable. Alenlöv et al. assume both of them follow standard normal distribution.

The motivation of using pseudo marginal method here is that we want to use a surrogate p(y|\theta, u) to compute p(y|\theta, X). Hence an importance sampling is used

p(y|\theta, u) = \prod_{i=1}^T \prod_{j=1}^n p(y_{ij} | \theta, u_i) = \prod_{i=1}^T \prod_{j=1}^n \{\frac{1}{N}\sum_{l=1}^N \frac{g_\theta (y_{ij} | X_{il}) f_\theta(X_{il})}{q_\theta(X_{il})} \},

where q_\theta(X_{il}) is the proposal distribution in importance sampling and here we set it to N(0, 3^2) and use X_{il} = 3U_{il}.

My problem here is: using this pseudo-marginal HMC, the \lambda_1 or \lambda_2 always go to a large value no matter how small the step size I’m choosing. I suspect that one of the gaussian mixture component is fitting a single data point and therefore, the precision tends to go to infinity. And once I change the distribution of X_i from gaussian mixture to just gaussian, the \lambda_1 and \lambda_2 will not have this explosive behavior.

To debug whether there is any error in my implementation, I have doubled checked my gradient computations and the integrator. It seems to me that they are both correct after several tests. Therefore, I’m quite struggling with the abnormal sampling results.

I would greatly appreciate it if you could provide any advice or guidance on this matter, which will be invaluable to me. Thank you for your time.

Reference:

Alenlöv, J., Doucet, A., & Lindsten, F. (2021). Pseudo-Marginal Hamiltonian Monte Carlo. Journal of Machine Learning Research, 22.

I’m not sure what pseudo-marginal HMC is or what it’s going to do, so can’t offer any advice. You might want to ask the author of that paper.

Step size just affects stability and speed—it shouldn’t change the answer if you’ve chosen a stable step size.

May I ask again what your priors are on \lambda_1, \lambda_2? This is just an issue with mixtures in general, and has nothing to do with HMC.

Dear Bob,

I am really grateful for your advice.

As suggested in the paper, the priors are put on on \log\lambda_1 and \log\lambda_2, and are \mathcal{N}(0,10^2), and I can confirm I haveconsidered the Jacobian term when changing from \log\lambda_1 to \lambda_1.

Thanks a lot!

Now I see where I went wrong. I was reading too quickly and assuming the \lambda was the mixture proportion as that’s the Greek letter that’s usually used for what you wrote as w. What I meant to ask is what are the priors on w?

I’m confused by a ll the different notations for normal distributions. For Stan, we like to write \text{normal}(\mu, \sigma) for a normal distribution with location \mu and scale \sigma. It’s also common to see \mathcal{N}(\mu, \sigma^2) for the same distribution. I’m not sure which one N is supposed to be and thus I don’t know what \lambda_1^{-1} is. I’m guessing that \lambda_1 = 1 / \sigma^2 is a precision and thus \lambda^{-1} = \sigma^2 is a variance. In Stan, we usually just put priors directly on \sigma because the units are easier to understand than \sigma^2 and Stan doesn’t depend on conjugacy.

Also, a lognormal(0, 10) prior is probably too weak to do much for an inverse scale parameter.