Multivariate Gaussian mixture model help

Hello, I’m trying to fit a two component multivariate normal mixture model to some multivariate data which is well separated into two multivariate normal looking distributions but I keep getting results that collapse the posterior distribution into one component only. I was wondering whether you can tell this is because there’s an error in my Stan code, or whether this is because it’s inherently very difficult to fit Bayesian multivariate mixture models.

Am I right in thinking the posterior likelihood doesn’t take into account the extremely low probability of your data not having any observations from the proposed posterior mean or mode - so if your data if split between either side of this mean or mode, it builds evidence for the posterior mean or mode to be in the middle (rather than giving me the two separate components I am hoping to uncover)?

data {
int<lower=1> K; // number of mixture components
int<lower=1> J; // number of individual profiles
int<lower=1> D; // dimensions of parameter vector
vector[D] beta_j[J]; // parameter vectors for each individual profile
vector[D] mu0_1; //prior on mu vector for component 1
vector[D] mu0_2; //prior on mu vector for component 2
matrix[D,D] sigma_mu; //prior on correlation matrix for all components

parameters {
simplex[K] theta; // mixture component proportions
ordered[K] beta_0; // ordering the first element of the parameter vector, between the components
vector[D-1] beta_1[K]; // remaining elements of the parameter vector
cholesky_factor_corr[D] L[K]; // mixture component Cholesky factor correlation matrices

transformed parameters {
vector[D] mu[K];
vector[K] lps[J]; 

for (k in 1:K) {
mu[k]=append_row(beta_0[k], beta_1[k]); // parameter vector

for (j in 1:J) {
vector[K] log_theta = log(theta);  // cache log calculation
for (k in 1:K) {
lps[j, k] = log_theta[k] + multi_normal_cholesky_lpdf(beta_j[j] | mu[k], L[k]);

model {

for (k in 1:K) {
L[k] ~ lkj_corr_cholesky(4);

theta ~ beta(100,100);

mu[1] ~ multi_normal(mu0_1, sigma_mu);
mu[2] ~ multi_normal(mu0_2, sigma_mu);

for (j in 1:J) {
target += log_sum_exp(lps[j,]);

generated quantities {
simplex[K] posterior_assignment[J];
for (j in 1:J) { 

A good way to to test for your question is to write some code to generate some data from two well separated multivariate Guassians, mix them to a known proportion and then see if your code does a good job of recovering the known parameters. You can then up the number of well separated mixture components to see how it behaves for your use case as they increase. This should give you some confidence that your code works. Finally, you can bring the components closer together to see what happens when they are not so well separated and I think this will answer your question RE collapse to one component.

IMHO, whether a mixture model will work or not will depend on how non-exhangeable you can make the priors in your mixtures and tricks like ordering the first element won’t do much to help when you have more than a couple components. I’ve some (but limited) experience in trying to fit mixtures, and it’s just my IMHO so take it with a pinch of salt.


^Thanks for the tips! If the data is well separated into 2 components to start with, does it still make sense if the MCMC sampling often collapses the distribution into a single component? I was wondering what would be the statistics/maths reasoning behind this happening.

My understand is “basic” compared to the Bayesian demi-gods that live here and they are a much better source for this but here’s a stab:

I think something of this form \lambda \pi(x\mid \theta) + (1-\lambda)\pi(x\mid \theta) that is well separated (i.e. means are at least 1 or 2 standard deviations away from each other) is multi-modal. Say you draw a whole bunch of data from that distribution. If it’s multi-modal there’s no way you can better approximate it with one mode than you can with two given that the model you’re fitting is of the same form.

The promise of MCMC is ultimately that it guarantees convergence to the true posterior given infinite samples. The true posterior is multi-modal, you’re fitting a model of the same form, so it can’t be the case that you converge onto a unimodal distribution.

@nhuurre you’re pretty awesome at this sort of question. Maybe you can correct me?


I tried fitting the above model to a simulated 2D dataset and got
Grey dots are the simulated data. Colored dots are the estimates of the cluster means. The lines are the first twenty steps in the Markov chain, when the sampler is warming up.
So looks like everything is fine…oh wait sometimes it does this
The random starting position places mu[1,2] higher than mu[2,2] so the sampler tries to move the blue closer to the top data and orange closer to the bottom data. This improves the fit just enough to convince the sampler that it’s definitely the right thing to do.
And then it hits the constraint mu[1,1] < mu[2,1] (blue must be to the left of orange).
Contrary to common advice, an ordering constraint does not prevent label-switching. It just makes it weird.
You could try and set inits that make the sampler start near a reasonable cluster assignment.

I don’t see a reason to fit a mixture model when the clusters are so far apart that you could assign the points by eye. There are better algorithms than HMC for clustering.


I wonder if in this case order constraining the variance instead of the mean could prevent this? It’d eliminate the effect @nhuurre points out but it’s make the distributions non-exchangeable?

1 Like

I wonder if declaring theta as a simplex is doing something too - my Stan program really wants there to be only one component when there are really two - is that plausible? If theta[1] stays around 0.25, it forces theta[2] to stay around 0.75(?)

The model works fine in JAGS, I was wondering if that has something to do with including a discrete labelling variable vs. marginalising it out.

That makes sense, thank you! How did you make those plots?

import pystan
import numpy as np
import matplotlib.pyplot as pp


def categorical(J, theta):
    t = theta.cumsum()
    x = np.empty(J,dtype=int)
    for i in range(J):
        s = 0
        u = np.random.uniform()
        while t[s] < u: s += 1
        x[i] = s
    return x
def gen(J,mu):
    mu = np.asarray(mu)
    K = mu.shape[0]
    D = mu.shape[1]
    theta = np.random.dirichlet([20]*K)
    a = categorical(J,theta)
    beta_j = np.random.normal(mu[a],1)
    return dict(J=J,K=K,D=D,mu=mu,theta=theta, a=a, beta_j=beta_j,
                mu0_1 = [0.0] * D,
                mu0_2 = [0.0] * D,
                sigma_mu = 20*np.identity(D))

dat = gen(100,[[-3.0,-5.0],[3.0,5.0]])

def plot(fit,num=25):
    m = fit.extract()['mu']
    w = fit.extract(['mu'],permuted=False,inc_warmup=True)['mu']
    pp.scatter(dat['beta_j'][:,0],dat['beta_j'][:,1],c='grey', label='data')
    pp.scatter(m[:,0,0],m[:,0,1], c='C0', label="mu[1]")
    pp.scatter(m[:,1,0],m[:,1,1], c='C1', label="mu[2]")

plot(fit, 20)

Does the JAGS model have the constraint mu[1,1] < mu[2,1]? How many chains does JAGS run? You can’t see label-switching if there’s only one MCMC chain.

^No constraint in JAGS, and just one chain. Is it valid to run one chain at a time, if you don’t want the chains to label the components differently and mess up the posterior output?

If I run one or multiple chains in Stan (with or without the ordering constraint removed) it really wants there to be just one component (with posterior means = approx the average of the true means of the two true components), while JAGS recovers the true parameters.

Effective sample size estimate is more reliable if there are multiple chains. If the priors mu0_1 and mu0_2 are not the same then the components are not exchangeable and you can’t just ignore the relabeling issue. But if they are identical then single-chain results are still correct.

Strange. How large is the posterior standard deviation?

1 Like

^The posterior standard deviation is quite high for the empty or low probability component and low for the other component.

I looked again at the first picture. It looks like the distance between the clusters ~1.5 for every beta_j and the width of the clusters is much smaller.
In the model L is a correlation matrix Cholesky factor so the marginal standard deviation is 1. So the clusters are much smaller than the model expects and close enough to fit inside a single component. Can you add varying cluster size?

parameters {
  real<lower=0> tau;
model {
  tau ~ std_normal();
  lps[j, k] = log_theta[k]
             + multi_normal_cholesky_lpdf(beta_j[j] | mu[k], tau*L[k]);
1 Like

^That seems to make things a lot better - at least the correct component proportions are recovered about half the time now (the other half of the chains still want there to be one component).

And running the code on simulated data that is actually multivariate normal distributed works, so that’s good. The real data is not well separated so I have to see how things go with less well separated data.

@dandan would you mind talking about your specific use case a bit? I’m curious. Also, how did you get on with JAGS for this model on the real data out of interest? Why are you porting it to Stan?

1 Like

I want to cluster some time series data by first fitting a piecewise linear model to the data to get the rough shape of each series, and then cluster on the slope parameter vectors extracted from the piecewise linear regressions. So modelling the slopes as multivariate Gaussian mixture distributed. At first my simulation study involved using some data generated around sin and cos curves (two different component groups).

What does ‘porting it to Stan’ mean? My supervisor suggested using Stan to start with, and I got used to it, and didn’t try JAGS until something went wrong (the problem I posted above).

Thanks for your all your help and guidance, btw - it’s very kind.

1 Like

Sorry for jumping in without reading everything in the thread but it appears that you are trying to fit a mixture of 2D Gaussians? If so, I fear it is an open research question on how to do this well - at least I remeber someone discussing it as such here, but don’t have a link right now (particularly how to avoid label-switching), As some of the examples above show - you can’t consistently enforce an ordered constraint on points in 2D space.

There might be an easy way out if you happen to have some domain-specific knowledge that would let you avoid the problem completely (e.g. say that the mixtures are linearly separable by a known line and then make the model constrain mu[1] to be the cluster “below” the line and mu[2] the one “above”).

The hard way is getting in the weed of the literature - if I remember correctly, some people ignore the multimodality and then postprocess the samples by flipping the components in some chains. There are probably also more clever ways to do this, but I am not knowledgeable enough about them to give you any specific advice.

That’s a weird statement :-D I really don’t think we should build personality cults around here ;-) While I agree there is a lot of great people around (including you), I think we are all very frequently reminded how constrained our knowledge is and how hard some models are to get right.