Mixture of Normals as distribution for latent variable


First of all congratulations for creating this wonderful community!

I have a quick question. Is it possible to use normal mixtures in Stan as distributions for latent variables? In all the examples I have seen so far, normal mixtures are directly assigned to data.

For example, how about something like y_i \sim \text{Poisson}(\mu+\epsilon_i), (with y_i being observed) where the distribution of each \epsilon_i follows a normal mixture?

1 Like

Stan only very rarely privileges continous (real, vector, etc.) data over parameters, so most pieces of code that admit data will also admit parameters (there are a few exceptions, but log_mix / log_sum_exp are not one of them) - so in principle you could treat the mixture exactly as you would a mixture in the data.

In practice however, Stan doesn’t work well with multimodal posteriors, so unless you enforce that the mixture is not multimodal (e.g. by having a mixture of two distributions that share mean and differ only in scale), the model is unlikely to work that well.

But the case you’ve shown (and I would guess most others) can be directly translated to mixtures of observed data,i.e. you can have an equivalent model via:

y_i \sim \mathrm{Mix}(\theta, \mathrm{Poisson}(\mu + \epsilon_{i, 1}), \mathrm{Poisson}(\mu + \epsilon_{i, 2}) \\ \epsilon_{i, 1} \sim N(a_1, \sigma_1)\\ \epsilon_{i, 2} \sim N(a_2, \sigma_2)\\

Which should be pretty well behaved for the most part.

Best of luck with your model!

1 Like

many thanks for the equivalent model, seems interesting I will definitely try it!

About Stan not working well with multimodal posteriors, I am wondering whether this can be addressed by several parallel chains and post-processing of the MCMC output.

1 Like

In theory: yes. In practice it IMHO rarely works. Most multimodal posteriors actually exhibit weird curvatures that break sampling (e.g. give you divergent transitions) - for the most part this is a good thing as it lets you notice multimodality even if you were not aware of it. Additionally, it is very hard to ensure that your chains actually visited all the modes as the “attraction basin” of a mode (e.g. from which initial conditions it is likely to be found) may be relatively small even if the mode in fact should contain substantial amount of posterior mass. Most models just tend to work much better when you remove the multimodality in some way.

I see your point, many thanks