Gibbs post-processing to find unknown K in mixture model

Hi guys, I’m following the advise in section 22.3 and 22.4 of BDA3 with regards to fitting mixture models with an unspecified number of components (K). In my case, say the data comes from a mix of 3 beta distributions in proportions \lambda=(.5,.25,.25) with \alpha=(1,2,3) and \beta=(1,1,1) respectively.

All I want is to discover K. My understanding is as follows:

(1) Specify a probability model. E.g.

h=1..H, where H is an upper bound on the possible number of clusters.
y=(y_1,..,y_N)
z=(z_1,...,z_N) where z_i \in \{1,...,H\} indicates the component y_i comes from.
a_h is the h component of \alpha, with constraint \alpha_1 \lt ... \lt \alpha_H.
\lambda_h is the h component of \lambda.
\phi=(\alpha,\beta,\lambda)
y_i \mid z_i=h,\phi \sim Beta(\alpha_h,\beta)
z_i=h,\phi \mid \sim Categorical(\lambda)
\lambda \sim Dirichlet(\frac{1}{H},...,\frac{1}{H})

(2) Treat H as an upper bound (e.g. H=10), and make a MCMC model which averages over H and z_i to discover the global parameters \theta=(\alpha,\beta).

(3) Gibbs time… Hope is that we can discover the discrete components z_i, and in so doing that K \approx \sum_{h=1}^H1_{n_h>0}. I.e. K is roughly the same as the number of components with at least one data point allocated to them.

Following section 11.1 and 22.3, my Gibbs sampler will take \theta to be known global parameters and just iterate the following steps to convergence.

(1) Sample \lambda from prior Dirichlet. Update z \mid \theta,\lambda.

(2) Sample z from prior \lambda. Update \lambda \mid z.

(4) What now? Can I use argmax with P(z_i=h \mid \lambda_h) to allocate points to mixtures components and then estimate K \approx \sum_{h=1}^H1_{n_h>0}?

My questions are:

(1) Is my understanding of this architecture correct?

(2) Is the Gibbs sampler correct?

(3) The two steps in my Gibbs sampler both have updates which require posteriors. Am I correct in saying that my posteriors must be analytically deduced?

(4) What do I do at point 4?

Many thanks in advance.

EDIT:

So after more staring at the Gibbs sampler examples I now understand that the above won’t work. Basically, I don’t think Stan can be used in this regard at all since the entirety of the model above would need to be expressed by the Gibbs sampler. K would then be derived as the posterior mode of \sum_{h=1}^H1_{n_h>0} following draws after convergence. To make the sampler efficient, all my distributions would need to have conjugate priors so I can specify their posteriors analytically. Arghhhh…

Note however that for not-so-big H you should be able to marginalize h out. The idea is that you model the probabilities p_k = P(h = k) as parameters in the model and compute the mixture probability over all H possible models. A sketch in code (please double check, I do make mistakes quite frequently):

parameters {
  simplex[H] p; //probabilities need to sum to 1
}

model {
  for(h in 1:H) {
    sub_likelihood[h] = ... //Compute likelihood assuming h components
  }
  //Mixture likelihood
  target += log_sum_exp(p + sub_likelihood);
}

I however need to admit that I do not have experience running those with mixture models neither have I seen it done before. (But if it works, please let us know, would be nice for the manual :-) The only mention I could find quickly was a thread by @ldeschamps (Mixture model with unknown k), but there is no conclusion whether this approach has been succesful.

Note that you would probably need completely separate model parameters for each value of h, as there is no reason to believe that components of the smaller models are related to the components of the larger models (e.g. that \alpha^{h=2}_1 = \alpha^{h=3}_1).

The methods described in the Latent discrete parameters chapter of the Stan User’s Guide are somewhat related, but not very instructive for your case.

Hope that helps!

2 Likes

Hey thanks for the reply. Averaging over the discrete parameters seems like the canonical advise. In the example above K \approx \sum_{h=1}^H 1_{n_h}>0 (i.e. the number of components with at least 1 allocated data point) is ultimately distributed according to Categorical(\lambda), so one should be able to average over the discrete components and then just use \lambda to work out a distribution for K.

Two problems are identifiability and priors. Betancourt’s excellent piece of identifiability was very useful. He says that one can use ordering constraints to achieve identifiability (which is partially contracted by BDA3 that says label switching issues can persist nonetheless). Let’s say for arguments sake this is solveable.

The other problem is priors, in your model above the prior on your simplex p is all important and will decide how the H components are weighted. BDA3 recommends using Dirichlet(\frac{1}{H}) to force concentration (allocation to fewer clusters). However, like any prior, as the amount of data increases, the data overwhelms the prior and we end up with allocations all along H because more components will definitely lead to a higher log posterior probability. Meanwhile, if the amount of data is small then “where does your prior come from?” seems like an absolutely valid question because the prior becomes instrumental to deciding how many clusters there will be and a principled choice is important.

I’ve not yet encountered a principled way to approach "mixture models with unknown K" generically. To me it seems that I’ll just have to work harder to add lots more structure.

1 Like

Yes, the more I think about it, the harder it seems to do well automatically. In the end you need to justify the tradeoff your model makes between regularization and precision which likely involves the prior and data in a complicated way. I’ve noticed some people use Dirichlet process priors for the number of clusters, which claim to have some neat regularization properties, so this might be sensible (disclaimer: I’ve never used them, just read about them).

Unless you really need fully automatic identification of K, I’ve had good experience with fitting multiple different mixture models and using posterior predictive checks to choose the best mixture manually.

The problems I know of:

  • ordering is not enough when you have more than one parameter identifying each mixture component.
  • when the data are fit very well with fewer clusters then the model has, the posterior is free to add components with almost zero p_k and arbitrary parameters - so those may appear anywhere in the ordering causing multimodality

Hope that helps.

1 Like

Hey thank you again! I do really appreciate you thoughts. BDA3 says the Dirichlet process is approximated by the basic mixture model approach with a Dirichlet prior with parameter \frac{1}{H}, when H is large.

What’s tricky in my problem is that K is pretty much all I care about. I think I’ll have to do what you’ve advised in the end and think carefully either about very problem specific test quantities or posterior predictive checks that don’t automatically improve as K increases. Then I can just fit the model for multiple K and compare them based on tests quantities.

Hi!

I did not find a real solution to that problem (I almost abandonned mixture models actually!). However, I found some promising avenue (I think) when I formulated the parameters of the dirichlet distribution as hierarchical. Something like that :

\lambda \sim Dirichlet(\alpha_k)\\ \alpha_k \sim student(\nu, \mu,\sigma)

In some situation, I managed to put some weights to (approximatly) zero that way, but it did not work in every case.

Given the fact that the dirichlet distribution exclude 0 and 1, with steep gradients close to these extremes, one may want to use another distribution (logit-multivariate normal??)??

Have a good day!
Lucas

I believe you might get quite far with just ppc_dens_grouped for various groupings and while this would always improve with K, a sensible choice is to take the K after which the improvement becomes negligible…

Good luck with the project!