Why would anyone ever want to use a likelihood for a mixture model in which the discrete variables are "not marginalized out"



I am reading this article on mixture models : https://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html .

The advice there is to use likelihood functions in which the discrete variables are not sampled but are “marginalized out”.

I think I am not getting here something ( important ? ) : why does it even cross anybody’s mind to use a likelihood function in which the discrete variables are not marginalized out?

Why would that ever be useful ?

Maybe for “semi supervised learning” ? - but even then… if the classes are partially observed… then there is no need to sample them because they are already known :)

I cannot imagine what it would mean to use a likelihood function for a mixture model where the discrete variables are not marginalized out.

I think I am not getting something very basic. Could someone please enlighten me ?

I mean it’s not that the distribution density is discrete but the variable itself, and we sample from the parameters which are continous always, right ? So I don’t really get this discrete issue problem.

Doing MC for Ising model is a different issue because the energy depends on discrete variables. However here AFAI understand the energy only depends on continous variables (the parameters).

Could someone please give a very simple example where it is a good idea to sample from discrete distributions ? Can they not always be marginalized out ? ( Making a for loop, or two, or three ? )





I think people often did this (see: myself) when estimating mixtures in bugs/jags. It’s “easy” to do mixtures with discrete variables that way in those programs. It’s actually harder, I think (is it possible?) to do mixtures in jags with just a simplex parameter.



But doing that (direct sampling) won’t get rid of the identifiability problem either right ? I don’t see no reason why it would, am I correct ?



In my experience, more senior Bayesian collaborators favor using unmarginalized likelihoods because the data augmentation approach of imputing a mixture component label within the MCMC is both widely accepted and easy to explain in an algorithm. You can also easily query/track the population proportion falling into each component, which is sometimes of scientific interest. I advocate for marginalized likelihoods because I’m the one coding and running the computations. It’s my problem if the convergence is bad.

Also, many of the quantities I try to estimate are not “identified” in a classical/frequentist sense, but I do not care so long as I am appropriately quantifying the uncertainty. There may also be partial identifiability arising from assumptions (e.g., a parametric model specification) or other features of the problem, so it’s not always totally hopeless.



Thanks @lcomm, I think I need to read this a few times, memorise it, read a few books, and then one day, a few years later… when I am standing in the grocery store, in front of the cheese section and trying to decide what kind of cheese should I buy: cheddar or brie ??? AND THEN it suddenly clicks what you have meant here :) … as it usually happens with these kind of things :) thanks for the answer, I keep coming back to it and make it my mantra.



But you mean, that if you want to calculate anything by using the posterior samples then you can do that and that will be ok, no matter how many non-identifiable components are there. How often the chain jumps over “to the other ordering”, etc… when it comes to expectation values computed using the posterior, all this identification business does not matter, do I understand this correctly ?



Also an extremely obviously “the answer is no” type of question :

Would it be possible to sample from a continuous distribution in Stan (say uniform, say the variable is called X), and then put an IF into the log likelihood function. If x<0 then the log likelihood (LLH) is described by Y_1 distribution. If x>0 then the LLH is described by the Y_2 function. Would this not mean that part of the variables are sampled as simple coin flips, while others are sampled according to the actually selected values of X.

Would this approach work with Stan ?



The answer is no. It breaks differentiability.

Also, sometimes it is hard to marginalize out discrete variables. Sometimes really hard - think of the Tweedie likelihood (or, compound gamma - I think).

1 Like


Many thanks @Max_Mantei !



How often the chain jumps over “to the other ordering”, etc… when it comes to expectation values computed using the posterior, all this identification business does not matter, do I understand this correctly ?

I might not be understanding you correctly, but it sounds like you are describing the label switching problem, which I tend not to encounter much in my own research. Either I am using informative priors (which fix the category interpretations) or there is some other way for me to tell the latent categories apart.

My favorite (non-Stan) paper that views mixture components as potentially scientifically interpretable is Schwartz, Li, and Mealli in JASA (2011). I particularly like their application with latent compliance clusters.



Hi again,

Just a quick general comment - food for thought - nothing really searious - just “dreaming”/“playing around with some strange ideas”. The only purpose of this post is to be a spark for reaching new level of understanding of ML, maybe, perhaps, who knows, I do not promise anything :) enjoy !

This whole marginalization/hidden parameter business is a pretty big deal imho -> when it comes to creating fast ML solutions

I came to this intuition when I came up with a hypothetical machine vision algorithm that would use Stan and would outperform deep networks in situations when only 0.1% of the samples is labeled ( as in real life) - i.e. deep networks cannot (yet?) do semi supervised learning, Stan can, easily.

I came to the intuitive "conclusion) that this kind of marginalization over hidden parameters is where the devil is.

Ok, if this makes zero sense then that is ok. I am not stating anything, I wrote this down because these few thoughts might inspire someone to reach new aha moments.