Identifiability of Gaussian mixture mode

Dear Stan users:

I am currently using Rstan to fit a mixture of 2-component Gaussian model over T periods.

observation y_it represents the ith observation in period t and can be represented as

y_it = lambda_t * N( mu1, sigma1) + (1-lambda_t) * N(mu2, sigma2)

hence the mixture components locations and scales are the same in all time periods while only the mixing weights are time dependent.

Further, I impose the overall mixture mean lambda_t * mu1 + (1-lambda_t) * mu2 = z_t (which is a smooth function)

In order to satisfy the above equation, I let mu= (inverse(lambda’ * lambda) (lambda’(z)))

And then assign priors on lambda and z.

I formulate lambda prior as its natural parameterisation eta ~ MVN ( mu_eta, Sigma_eta) and back-transform
lambda = softmax(eta) where mu_eta and Sigma_eta have hyperpriors which are chosen in such a way that lambda is uniform between (0,1) .

To ensure identifiability, I would like to assign mu as ordered[2] mu, however, whenever, I define this it generates lots of errors saying that mu is not a valid ordered vector, which leads to me think that perhaps I miss assign some prior constraints.

Since lambda_t is a cyclic function in my real data so I do not want to constrain lambda_t to be ordered.

Instead, I want to put some indicators function on the prior on eta such that it only draws values from my specified MVN if mu[2] >mu[1].

Through reading the manual, I came across STEP function and I used in my programme, I am not sure whether this is correct or appropriate to impose such a condition.

Any suggestions are greatly welcome.

I have attached my programme ( this is only for prior distributions, there is no data involved at the moment)

visualise prior distribution.stan (1.9 KB)

In this above programme, I basically model the hyperparameters and then in the generated quantities draw eta, lambda, mu from the specified distribution which I would like to see the prior distributions of these parameters. I am not sure whether it is the right way to visualise my model with only prior distributions without actual data involved. Thanks in advance for your feedback!

I think what you are doing cannot work because of

mu= (inverse(lambda’ * lambda) (lambda’(z)))

mu is time invariant while lambda and z are time varying. It’s probably going to be easier to write lambda as a function of mu and z.

lambda_t = (z_t - mu2) / (mu1 - mu2) = (mu2 - z_t) / (mu2 - mu1)

to keep lambda_t between 0 and 1, and to constrain mu2 > mu1

mu_1 < z_t < mu_2

You better double check my math on this.

Hi stijn:

Thanks for your reply.

My understanding is that here lambda_t is no longer stochastic but it is now deterministic and depends on the above equation.

Therefore, there is no need for me to implement natural parameterisation on lambda_t as eta_t and assign MVN prior on eta_t, instead I should assign priors on z_t, mu=(mu1, mu2) with appropriate constraints.

so I will declare lambda_t as a simplex in transformed parameter block with lambda_t depends on the equation and then
assign priors on z_t and mu.

Thanks, I will give it a try now/

Does lamba_t need to be a simplex? My understanding was that lambda_t is a vector of probabilities but not that that the sum of the probabilities needs to be 1. If lambda_t needs to sum to 1, you will need stronger restrictions on z_t as well.

lambda_t is mixing weights per time period and thus lambda_t1 + lambda_t2 = 1 since there are 2 mixture components per time period.

In my current model, I declare

vector[2] lambda_raw[tt];
matrix[tt,2] lambda;

for(t in 1:tt){
lambda_raw[t,1] = log((mu[2]-y_mean[t])/(mu[2]-mu[1]));
lambda_raw[t,2] = log(1-(mu[2]-y_mean[t])/(mu[2]-mu[1]));
lambda[t] = to_row_vector(softmax(lambda_raw[t]));
}
In the above, y_mean[t] which is z_t and is separately modelled as
y_mean = B_T * theta_overall; where B_T is a data (B-spline cyclic basis matrix ) and theta_overall is the coefficients to be estimates whose prior is just a MVN.

and in this way, I obtain lambda_t as a valid simplex.
Do you think it is a sensible way to construct lambda_t?

Ah ok, I see. I think it is enough to work with

parameters{
vector[2] mu;
vector[tt]<lower = mu[1], upper = mu[2]> z;
}
transformed parameters{
vector[tt] lambda = (mu[2] - z) / (mu[2] - mu[1]); 
}

And to use (1 - lambda[t]) whenever you need it, but if you use log_mix for your mixture specification, you won’t even need the (1 - lambda[t]).

thanks, I will give it a try as well.
thanks for your suggestions.

Hi stijn:

given the current construction, my model has many divergent transitions while running the model it gives me the error message that
Error evaluating the log probability at the initial value.
validate transformed params: y_mean[1] is -1.02569, but must be greater than or equal to 0.468656
Rejecting initial value:

where y_mean[t] is the z_t described above, which I constrain it to be between mu[1] and mu[2].
since y_mean[t] = B_T ( a data matrix dimension = (24 * 6) ) * theta_overall ( dimension (6 * 1) coefficients to be estimated)
where theta_overall ~ quite a uninformative MVN prior to promote that consecutive theta_overall are similar to each other.
Like you suspect, I perhaps need to have a stronger prior on theta_overall to ensure that mu[1] < y_mean[t] < mu[2] for all time periods.

However, in this case, I cannot think of any sensible stronger prior to be placed on theta_overall.

I want to construct something like if a draw of theta_overall from MVN and then multiply by B_T does not meet the constraint then I discard this draw otherwise I will keep it.

Are there anything that I can look up to construct the above prior or do similar conditioning?

Thanks in advance!

Sorry, I should have checked the entire model. Yes, it is more complicated now.

mu[1] < y_mean[t] < mu[2] ==> mu[1] < B_T * theta_overall < mu[2]

So the constraint on y_mean needs to be rewritten for theta_overall, this is more complicated than just multiplying with inverse(B_T). I need to think about it. Maybe, I directed you in the wrong direction.

Just in summary for any other readers, the difficulty with this model is that you have a two-component mixture model where.

  1. You want to identify the mixture model by fixing the order of the means of the two components.
  2. You want to impose a further constraint on the means that is mu[1] lambda_t + mu[2] (1 - lambda_t) = B_T * theta_overall.

Would it be possible to identify the model by fixing the order of the sigmas in the two components with ordered[2]<lower=0> sigma? You can then impose lambda_t = (mu[2] - B_T * theta_overall)/(mu[2] - mu[1]) for the second constraint above.

thanks for your efforts.

Previously, fitting the model with my initial construction, writing mu = inverse (lambda’ * lambda) * lambda’* y_mean, I just have identifiability issue with mu[1] and mu[2], they are estimated accurately. However, when implement the above construction, all the constraints appear to be satisfied but I have lots of divergent transitions and the final estimates are far away from my simulated ‘true’ values.

I think there is something fundamentally wrong with my model and I need to think hard about it.

any ideas and suggestions are greatly appreciated !

Just a word of warning – even after you remove the label switching non-identifiability exchangeable mixture models like this are still incredibly poorly identified when you’re fitting the component variances/covariances along with the means. All kinds of weird things can happen with components collapsing onto each other or single components expanding and trying to explain multiple true clusters as these are all as consistent with the data as the ground truth. Stan just makes the poor identifiably easier to see!

Hi Michael :

Thank you for your reply. In my real model with the data likelihood written in the model block. I do fit the component variances however, so far they are being fitted independently with the component means and given a quite a strong prior normal (0,1 ) with positive constraints and that’s it.

I am wondering how can I fit a structure that would imply the relationship between component means and variances simultaneously?

Thanks again for your answer.

Hi Michael:

I also have some general questions regarding the behaviours of my chains.
I run 4 chains with 500 iterations which results in 38 divergent transitions, 213 chains exceed the maximum tree depth.
However, when I examine mixture location in each chain, I found that
in chain 1, mu[1] and mu[2] are estimated accurately. (All 500 iterations mu[1] = 5 and mu[2] = 10, which are the ‘true’ value I set).
in chain 2, they swap position, now mu[1] =10, mu[2] = 5 for all 500 iterations.
However, in chain 3, mu[1] is so wrong, the following is the estimated values for mu[1] while mu[2] is estimated to be around 8.

[1] -470.1721 -495.7726 -517.4052 -502.3977 -500.0183 -520.8420 -506.6726 -478.9285 -470.4257 -476.0645 -472.9713 -478.8896 -482.6464 -484.7046 -491.3626

chain 4 once again, the positions and the values of mu’s are correctly estimated.

What I do not understand here is that why one of the chains would behave so erratically compared to the other chains?
If I can fix this behaviour, I think my model is at least reasonably constructed.

Thanks in advance.

I’m afraid I don’t quite understand the question. What relationship between component means and variances are you trying to impose? What do you mean that they’re being fit “independently” right now? Are you fixing the means and then fitting the variances or vice versa?

In general we haven’t found any way of making exchangeable mixture models sufficiently well posed that we can accurately fit them and so I am dubious of their use in a full Bayesian analysis. They can, however, be useful for finding a useful configuration for, say, exploratory data analysis, similar to Latent Dirichlet Allocation.

This is the beauty of running multiple chains! Even one chain with erratic behavior indicates that the other chains are exploring the posterior completely!

It’s hard to look at the means in isolation – I bet something weird is also happening with the variances in the third chain.

Basically the posterior for models like this has modes around each of the label switching solutions connected by valleys where you get mode collapse and large variances and other issues. The model isn’t exactly multimodal, but the valleys can be sufficiently hard to explore that the chains get stuck in one of the approximate modes for long periods of time and miss the valleys entirely, as you observe. In other words, it’s not that you’re just resolving a small problem away a good model but rather the small problem is the tip of the proverbial iceberg indicating deeper problems.

Hi Michael:

Sorry for not being clear.

Currently, the mean of the mixture model is solved by
mu= (inverse(lambda’ * lambda) (lambda’(y_mean)));
so I think my mu’s are determined deterministically according to this equation and thus I assign priors to parameters lambda and y_mean.

lambda is a matrix (24, 2) mixing weight for each period, so I use natural parameterisation eta and then assign MVN prior to eta and use softmax to backtransform to lambda.

while y_mean = B_T ( cyclic bspline basis function 24 * 4 dimensions and it is a data input ) * beta (bspline coefficients dim 6*1) so to estimate beta, I assign a MVN prior to beta’s.

for the mixture component scale parameter sigma, right now I just constrain them to positive_ordered[2] sigma with a relatively strong prior normal (0,2).

That is the current construction.

I have attached my stan code for better clarification
mixture model.stan (2.1 KB)

I have attached the graphs for chains on mixture location and scale.
the first three chains, both parameters are estimated correctly mu = (5, 10) sigma = (0.25,0.5)
while in the last chain, mu switches position, while sigma is estimated to be roughly the same but still obeys sigma2 > sigam1 constraint imposed.
Rplot01.pdf (57.7 KB)
Rplot.pdf (59.1 KB)

Ah, I see, thanks.

Having structure for mu as you do here makes the problem more subtle. If the splice bases are not exchangeable then $y_mean$ should break the exchangeability of the mixture components that causes so much trouble, but it might not do it well enough to completely identify the model, especially if you’re also fitting the variances/standard deviations at the same time.

So let’s step back. Why are you fitting a mixture model with component means defined by splines? What kind of data are you trying to model?

the real data I am going to analyse is the ultra-fine particle size distribution, which is believed to be multimodal, so commonly people fit a finite mixture model to it at a single time period. However, people are also interested in the dynamic behaviours of the particle size distributions over time, so I impose a smooth cyclic bspline function to model the changes of the particle size distribution from one time period to the next. Hope it makes sense.

What is known about the multimodal structure of the particle distribution and how it might vary in time?

From what I read, an important feature of these particle size distributions is that because aerosol particles are governed by formation and transformation processes they tend to form well distinguished modal features so the most common practice is to treat the size distribution at any time point as a set of individual typically normal distributions and thus fitting a finite mixture of normal.
since, PSD is often collected in small time intervals and thus it is believed that the parameters of the mixture model at each time point are likely to be correlated with the neighbouring time points.

so in my current model set-up, I assume that there are a mixture of 2 components underlying the whole time periods while the shape of each time period PSD varies with varying mixing weights.