Why would anyone ever want to use a likelihood for a mixture model in which the discrete variables are "not marginalized out"


I am reading this article on mixture models : https://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html .

The advice there is to use likelihood functions in which the discrete variables are not sampled but are “marginalized out”.

I think I am not getting here something ( important ? ) : why does it even cross anybody’s mind to use a likelihood function in which the discrete variables are not marginalized out?

Why would that ever be useful ?

Maybe for “semi supervised learning” ? - but even then… if the classes are partially observed… then there is no need to sample them because they are already known :)

I cannot imagine what it would mean to use a likelihood function for a mixture model where the discrete variables are not marginalized out.

I think I am not getting something very basic. Could someone please enlighten me ?

I mean it’s not that the distribution density is discrete but the variable itself, and we sample from the parameters which are continous always, right ? So I don’t really get this discrete issue problem.

Doing MC for Ising model is a different issue because the energy depends on discrete variables. However here AFAI understand the energy only depends on continous variables (the parameters).

Could someone please give a very simple example where it is a good idea to sample from discrete distributions ? Can they not always be marginalized out ? ( Making a for loop, or two, or three ? )



I think people often did this (see: myself) when estimating mixtures in bugs/jags. It’s “easy” to do mixtures with discrete variables that way in those programs. It’s actually harder, I think (is it possible?) to do mixtures in jags with just a simplex parameter.

But doing that (direct sampling) won’t get rid of the identifiability problem either right ? I don’t see no reason why it would, am I correct ?

In my experience, more senior Bayesian collaborators favor using unmarginalized likelihoods because the data augmentation approach of imputing a mixture component label within the MCMC is both widely accepted and easy to explain in an algorithm. You can also easily query/track the population proportion falling into each component, which is sometimes of scientific interest. I advocate for marginalized likelihoods because I’m the one coding and running the computations. It’s my problem if the convergence is bad.

Also, many of the quantities I try to estimate are not “identified” in a classical/frequentist sense, but I do not care so long as I am appropriately quantifying the uncertainty. There may also be partial identifiability arising from assumptions (e.g., a parametric model specification) or other features of the problem, so it’s not always totally hopeless.

Thanks @lcomm, I think I need to read this a few times, memorise it, read a few books, and then one day, a few years later… when I am standing in the grocery store, in front of the cheese section and trying to decide what kind of cheese should I buy: cheddar or brie ??? AND THEN it suddenly clicks what you have meant here :) … as it usually happens with these kind of things :) thanks for the answer, I keep coming back to it and make it my mantra.

But you mean, that if you want to calculate anything by using the posterior samples then you can do that and that will be ok, no matter how many non-identifiable components are there. How often the chain jumps over “to the other ordering”, etc… when it comes to expectation values computed using the posterior, all this identification business does not matter, do I understand this correctly ?

Also an extremely obviously “the answer is no” type of question :

Would it be possible to sample from a continuous distribution in Stan (say uniform, say the variable is called X), and then put an IF into the log likelihood function. If x<0 then the log likelihood (LLH) is described by Y_1 distribution. If x>0 then the LLH is described by the Y_2 function. Would this not mean that part of the variables are sampled as simple coin flips, while others are sampled according to the actually selected values of X.

Would this approach work with Stan ?

The answer is no. It breaks differentiability.

Also, sometimes it is hard to marginalize out discrete variables. Sometimes really hard - think of the Tweedie likelihood (or, compound gamma - I think).

1 Like

Many thanks @Max_Mantei !

How often the chain jumps over “to the other ordering”, etc… when it comes to expectation values computed using the posterior, all this identification business does not matter, do I understand this correctly ?

I might not be understanding you correctly, but it sounds like you are describing the label switching problem, which I tend not to encounter much in my own research. Either I am using informative priors (which fix the category interpretations) or there is some other way for me to tell the latent categories apart.

My favorite (non-Stan) paper that views mixture components as potentially scientifically interpretable is Schwartz, Li, and Mealli in JASA (2011). I particularly like their application with latent compliance clusters.

Hi again,

Just a quick general comment - food for thought - nothing really searious - just “dreaming”/“playing around with some strange ideas”. The only purpose of this post is to be a spark for reaching new level of understanding of ML, maybe, perhaps, who knows, I do not promise anything :) enjoy !

This whole marginalization/hidden parameter business is a pretty big deal imho -> when it comes to creating fast ML solutions

I came to this intuition when I came up with a hypothetical machine vision algorithm that would use Stan and would outperform deep networks in situations when only 0.1% of the samples is labeled ( as in real life) - i.e. deep networks cannot (yet?) do semi supervised learning, Stan can, easily.

I came to the intuitive "conclusion) that this kind of marginalization over hidden parameters is where the devil is.

Ok, if this makes zero sense then that is ok. I am not stating anything, I wrote this down because these few thoughts might inspire someone to reach new aha moments.



One quick thought of this topic : this seems to be related to the symmetry breaking in the 2D Ising phase transition, below T_c (critical temperature).

In principle there is no phase transition “EVER” in 2D Ising model, only in the thermodynamic limit.

So the “jumping around” is basically the same thing. If Stan (or other MC) were to model the 2D Ising model (a finite version of it) - which it obviously could. Then in finite systems (especially small ones), such “jumps” would be expected.

So, it is nice to have some physical analogy, to better understand what those mixture model degeneracy problems mean. It least for people who have a Bachelor’s in physics.

Ising 2D phase transitions are not everybody’s favorite, but for sure it is mine, it shows a huge amount of non-trivial phenomena - the 2D Ising model - (one of) the simplest non trivial model where phase transitions can be studied.

MC has very difficult situation with phase transitions due to so called critical slow downs - around T_c (which is due to diverging correlation lengths - I guess - intuitively at least, this all makes sense - if one imagines what happens at T_{c} in a 2D Ising model).

I remember people were going great lengths to speed up such MC simulations around T_c, by swapping entire clusters of connected spins with the same state (as a single “MC-step”) - I think this might even go back all the way to the 80’s - 70’s .

Maybe something to keep in mind when one is “doing” mixture models.

Explicit marginalization over mixture models “simply” makes the problem symmetric. In the 2D Ising model, this would correspond to sampling spin configurations from a “symmetrized” Hamiltonian, again, frighteningly reminiscent to a symmetrized, bosonic wave-function.

People with a BSc in physics might find this connection interesting (at least this was standard material in 3rd year statistical mechanics).

Bosons are particles which obey Bose–Einstein statistics: When one swaps two bosons (of the same species), the wavefunction of the system is unchanged.[12]

Fermions, on the other hand, obey Fermi–Dirac statistics and the Pauli exclusion principle: Two fermions cannot occupy the same quantum state, accounting for the “rigidity” or “stiffness” of matter which includes fermions. Thus fermions are sometimes said to be the constituents of matter, while bosons are said to be the particles that transmit interactions (force carriers), or the constituents of radiation.