Other constraints than mean ordering to identify mixture model?


I am interested in fitting multivariate mixture model, defined as follow :

p(\boldsymbol{y}_{ip} | \boldsymbol{\lambda}, \boldsymbol{\mu},\Sigma) = \sum_{k=1}^K\lambda_{kp}MN(\boldsymbol{\mu}_k, \Sigma)\\ \boldsymbol{\lambda_1} = 0\\ \lambda_{kp} = \alpha_{k} + \beta_{k}x_p\\

Where i stands for individuals and p for an experimental unit. For simplicity, the variance-covariance matrix is common to every observation.

The obvious way to identify the model is to put an ordering constraints on \boldsymbol{\mu}, such as \mu_{j1} < \mu_{j2} < \mu_{jk} for any variable j. This approach works wonderfully in my case, and give well-explored posteriors.

However, some of the variables j are negatively correlated. In that case, it makes no sense to define a component which would have the greater mean in every variable, because components with the greater mean in some variables are expected to have the lower in some other.

Is there another way to identify this model without the mean ordering constraints? I tried to order the intercepts of component weights, but it results in degenerate posteriors, with exchangeable means, even with only two components.

Or is there any other idea I am missing?

Thank you very much!

EDIT : for information, my data look generally like this :

Ok, after having read the posts by @betanalpha, it seems that my quest is kind of vain…

“I have decided that mixtures, like tequila, are inherently evil and should be avoided at all costs.”
Larry Wasserman

Try the repulsive prior. It may help. I think just setting rho to 0 works best but you can try it as a parameter like I have here. I made it so rho gets larger for smaller distances but that hinged on the ordering of mu.

functions {
  real repulsive_lpdf(vector mu, vector rho) {
    int K = num_elements(mu);
    matrix[K, K] S = diag_matrix(rep_vector(1, K));
    matrix[K, K] L;
    int c;

    for (k1 in 1:(K - 1))
      for (k2 in (k1 + 1):K){
        c = K - (k2 - k1);
        S[k1, k2] = exp(- squared_distance(mu[k1], mu[k2]) / (0.5 + rho[c]));
        S[k2, k1] = S[k1, k2];

    L = cholesky_decompose(S);
    return 2 * sum(log(diagonal(L)));

data {
  int<lower=1> K;
  int<lower=1> N;
  real y[N];
parameters {
  ordered[K] mu;
  positive_ordered[K - 1] rho;
  real<lower=0, upper=1> sigma[K];
  simplex[K] lambda;
model {
  // Prior model
  mu ~ normal(0, 5);
  sigma ~ std_normal();
  lambda ~ dirichlet(rep_vector(3, K));
  rho ~ gamma(0.5, 1.0);
  mu ~ repulsive(rho);
  // Observational model
  for (n in 1:N) {
    real comp_lpdf[K];
    for (k in 1:K) {
      comp_lpdf[k] = log(lambda[k]) + normal_lpdf(y[n] | mu[k], sigma[k]);
    target += log_sum_exp(comp_lpdf);  
1 Like

Thank you very much! Repulsive priors seem great. However, isn’t still necessary to use ordered constraints on means? mu is ordered in your code.

Ordering will help the label switching and the repulsive prior will help from collapsing modes on top of each other. In other words, you don’t need ordering for the repulsive to work, it’s just not going to help much or at all with label switching.

To be clear in exchangeable mixture models, where all of the components are equivalent and hence there’s a fundamental ambiguity in what component will model which part of the data generating process , are problematic. Non-exchangeable mixture models where the form of each component or the prior assigned to each component breaks the ambiguity can be quite powerful in practice when that breaking is based on domain expertise about the structure of the data generating process. See for example zero-inflated models and their ilk.

To be fair ordering technically breaks the formal exchangeability of a mixture model. The problem is that for any finite data set there will still be many model configurations that explain the observed data well enough that the posterior will be highly degenerate. By far the most successful approach is to stop trying to cluster and start understanding the various overlapping contributions to your data generating process and then modeling each of those one at a time in preparation for a non-exchangeable mixture model.


That is the most pertinent reflexion I have read on the subject! And it forces to think deeply about our data, which one would normally do to produce interesting and valid scientific inference!

In my case, I do suspect there should be a “generative” logic behind how my data points can be clustered, and I am not sure an exchangeable mixture model would produce anything that interpretable.

Edit : I think my attraction toward exchangeable mixture models arose because I guess (theoretically and empirically) there is clusters, but at the sampling time, we sampled for a different question and we do not have the data to cluster observations genetically. So at first, exchangeable mixture models seemed to be a way to compensate, but I realized one step at a time this idea was an illusion. Hard to answer questions with data harvested to answer another one!