Identification of mixture of multivariate normal distributions

fitting-issues

#1

Hello,

I am trying to implement a latent profile analysis, i.e. a mixture of multivariate normal distributions.
The first step I took to promote identification is to define the mean of the first variable in the multivariate normal distributions as an ordered vector.

…
parameters {
  ordered[K] mu1;
  vector[D-1] mu_r[K];
 …
}
transformed parameters {
  vector[D] mu[K];
  for (k in 1:K) {
    mu[k][1] = mu1[k];
    for (d in 2:D) mu[k][d] = mu_r[k][d-1];
  }
…
}`

(full model below)

As expected, this is not sufficient to fully identify the model. So I tried deal with the label switching problem by using Stephen’s algorithm (in the R package label.switching). This gives not great but, from what I can tell, OK results. That is, the means of the multivariate normal distributions are no longer multi modal. Still, one can still see that the posterior distributions of the means are constructed from posterior samples that suffered from label switching.
Now I am wondering if the following approach would be OK:

  1. Run the Stan model that partially insures identification through ordering of the first variables of the multivariate normal distributions.
  2. Use Stephen’s label switching algorithm to remove most of the label switching and learn the order of the means other variables
  3. Run an updated Stan model that implements (some of the) ordering constraints learned in step 2.

Do you think this is a sound approach?
I would also welcome any other idea to identify the model.

Thanks in advance!
Guido

Full model
data {
 int N;             
 int D;           # number of dimensions for mvn
 int K;           # number of mixture components
 int sigmas;      # number of variance vectors for mvn (1 or K)
 matrix[N,D] y;   # data
 real v;          # prior for lkj_corr_cholesky
}

transformed data {
  int tau_idx[K];
  int lambda_n;
  for (k in 1:K) tau_idx[k] = K == sigmas ? k : 1;
  lambda_n = K == 1 ? 1 : N;
}

parameters {
  ordered[K] mu1;
  vector[D-1] mu_r[K];
  vector<lower = 0>[D] sigma[sigmas];  
  cholesky_factor_corr[D] L_Omega;
  simplex[K] lambda[lambda_n];
}

transformed parameters {
  vector[D] mu[K];
  cov_matrix[D] Sigma[sigmas];  

  
  for (k in 1:K) {
    mu[k][1] = mu1[k];
    for (d in 2:D) mu[k][d] = mu_r[k][d-1];
  } 
  for (g in 1:sigmas) Sigma[g] = quad_form_diag(multiply_lower_tri_self_transpose(L_Omega), sigma[g]);
}

model {
  L_Omega ~ lkj_corr_cholesky(v);
  
  mu1 ~ normal(0,3);
  for (k in 1:sigmas) {
    sigma[k] ~ normal(0,1);
    }
  for (k in 1:K) mu_r[k] ~ normal(0,3);
  

  for (n in 1:N) {
    vector[K] lps;
    for (k in 1:K) lps[k] = log(lambda[n][k]) + 
                                     multi_normal_lpdf(y[n,] | mu[k], Sigma[tau_idx[k]]);
    target += log_sum_exp(lps);
  }  
   
}

#2

Hi Guido,

Thanks for your post! I have a similar use case, addressing label switching for the ability and discrimination parameters in multidimensional IRT models, and am working through what to do.

If you haven’t seen Jasra, Holmes and Stephens (2005), I’d definitely give it a read, particularly page 60. They note that in the case of a random beta mixture model (defined in the paper), imposing identifiability constraints on the mu parameters (i.e. ordering) produces different estimated mean values for those parameters than using Stephens’ algorithm for relabeling. That kind of echoes a cautionary point by @betanalpha here and Michael Betancourt here. Problems appear to arise when the posterior distributions of the K mu parameters overlap to any considerable extent, and whether they do is an empirical question.

You said:

I’m not sure I understand, isn’t that what you want? The reason I’m confused is that Stephen’s algorithm is a relabeling algorithm. So let’s say you have two parameters, alpha and beta, with alpha coming from distribution A and beta coming from distribution B, but relabeling alpha as beta and beta as alpha doesn’t change the joint posterior probability of these two parameters, given the data. Then let’s say you run two chains and draw just three samples each, such that your parameter estimates look like a list {aaabbb} for alpha and {bbbaaa} for beta (where “a” represents a sample from distribution A and so on). Running Stephen’s algorithm would return {aaaaaa} for alpha and {bbbbbb} for beta (or vice versa). You’re saying that you’re satisfied that the output from Stephen’s algorithm is two unimodal distributions, but dissatisfied that the input to Stephen’s algorithm was multimodal, right? But that’s fine, multimodality is inherent in the structure of your problem, and I think you just want to assign one consistent set of modes to parameters across chains while ensuring that you’re still exploring the entire parameter space. That’s exactly what relabeling algorithms are designed to do. Am I misunderstanding your question?

Regardless, it would help to see screenshots and your R code that transforms data in a stanfit object into input to the label.switching package, calls the “STEPHENS” method, and then converts the output into a new stanfit object or otherwise examines the posterior distributions of the parameters in question. That helps us to understand what’s going wrong for you. Also selfishly for me, I’m about to write exactly the post-processing R code that you just wrote – and I bet I’m not the only person out there with the use case of needing to post-process stanfit objects with relabeling algorithms.

As a side note, I’m finding that working with simulated data is helping me to debug because I can distinguish label switching from other forms of multimodality (between chains) such as rotation invariance.

Hope that helps, and thank you in advance as well!

Cheers,

Richard


#3

Hi Richard,

before responding to a few of your points, I’ll point that I haven “given up” on using Stan for this analysis, because I remain uncertain if the results I am getting are valid (even though the posteriors look nice …). One observation that makes me very suspicious about the approach i described is that when I do this and use loo to identify the best model, I find that elpd_loo constantly increases as I increase the number of classes from 2-10 for a data set with N = 150. Because I do not expect so many classes, and using loo on log likelihoods obtained after re-labeling once indicates a more reasonable number of latent classes, I currently think that the approach I described in my original post is not good.
Still, just using Stephens’ algorithm might work well enough.

I agree that the ordering constraint can be a problem. I used the ordering constraint as a first attempt to identification, before I tried Stephens’ algorithm. I don’t think that the ordering constraint is needed when one plans on using a re-labeling algorithm.

That’s not what I meant. I meant I was dissatisfied that while the posteriors were no longer bi-modal after using Stephens algorithm, they looked OK but not great (admittedly, “OK” and “great” are intuitive here, maybe this figure helps to show what I mean. In the figure the original posterior is in black, the relabeled is in red, and the posterior obtained after the procedure described in my original post is in blue).

N = number participants, K = number of latent classes, D = number of dimensions of the multivariate normal. Further, sf is a stan-fit object, lambda is a N * K matrix with class probabilities and mu is a D * K matrix of latent class means. Then you get re-ordered class means as follows.

library(label.switch)
p = extract(sf,"lambda")$lambda
switch = stephens(p = p)
raw_mu = extract(sf,"mu")$mu
reordered_mu = permute.mcmc(raw_mu,switch$permutations)[[1]]

You can create arrays that can be examined with bayesplot functions as follows:

mpc = sf@sim$iter-sf@sim$warmup
tmp = t(apply(reordered_mu,1,as.vector))
amu = array(NA,dim = c(mpc,sf@sim$chains,dim(tmp)[2]))
for (cc in 1:sf@sim$chains) amu[,cc,] = tmp[(1:mpc)+(cc-1)*mpc,]
mmu = monitor(amu, print = F, warmup = 0)
rownames(mmu) = paste0("mu[",rep(1:K,6),",",sort(rep(1:6,K)),"]")

The last assigns parameter names and needs to be adapted to reflect how parameters are organized in the original stan code.

Lastly, even though this is probably self evident, I used really long chains (4 chains with 5000 post warm-up samples) to be sure that Stan has a chance to explore the full posterior (move from one mode to another). Of course, I don’t know if this is enough.

Good luck, Guido


#4

Hi Guido,

Thanks for the quick reply!

Got it, thanks for the screenshot! The posteriors that are in red indeed appear to potentially retain some small amounts of multimodality (cell G2 D4 for example). But maybe it’s just sampling error. Looking at the bayesplot equivalent of a traceplot of the relevant parameter would indicate whether your four (relabeled) chains are converging to the same mode. It might be hard to see, though, given the modes would still be close together if there are two of them. Personally I think those blue posteriors might look too good to be true because while they look more like normal distributions, they don’t exhibit any probability mass at all outside of a narrow range. Those identifiability constraints could be lopping off part of what the posteriors would otherwise be. The red posteriors, with their long tails approaching zero, look healthier in that sense.

That makes total sense. Also, if your identifiability constraints are forcing your posteriors into narrow ranges, I think that could lead to overfitting, which would also explain why loo keeps suggesting that you increase the number of classes. To be fair, though, I still need to read the loo and WAIC papers once I get to that step in my own analysis.

If you have the time, as a sanity check you might consider generating some data with N = 150 in which you control the number of latent classes yourself, know all the true parameters associated with your model, and make sure you can properly do the relabeling, get your parameters back from Stan, and select the right number of classes.

My understanding from Michael Betancourt’s write-up and my own limited experience with IRT models is that label switching between chains is more common than within chains. So for complicated models it’s the number of chains that matters. In my case, if my IRT model has K dimensions, then there are K! ways to permute the labels. If I truly wanted to explore all the posterior modes then I would want a minimum of K! chains, and probably more. But in practice I think that’s probably overkill. If several chains all indicate convergence after relabeling for each of my parameters, I’d stop there and then make sure that the results are robust to the choice of priors.

Thanks very much for the R code, as a novice Stan user I certainly appreciate it. I’ll have to check this, but it looks like that reordered_mu array could also be dropped right back into a (copied) stanfit object in order to check convergence with functions such as traceplot.

Cheers!

Richard


#5

I made a brief parameter-recovery test at the outset, and this worked fine. However, I used “idealized” data for that, where it was not so hard to distinguish classes. I think crux lies in figuring out when a Stan program works in principle (it recovers the correct parameter when the signal is relatively clear) and when it stops working in practice because the data just get to noisy.

That also fits my intuition. Still, the longer one runs a chain, the higher the chance is to explore multiple modes within one chain. But I could still be that to explore as many modes as possible it is still best to maximize the number of chains and not the number of iterations per chain

This won’t be much different than what I have done, because reordered_mu is a matrix with as many rows as there are total samples (no_chains*no_post_warmup_iterations), but to get it back into the stanfit object you need to split the chains again (which also means that you have to take care that you don’t permute the samples when you extract them from the stanfit object in the first place). If you want to put things back into stanfit object, you’ll have to put it into stanfit@sim$samples, which is a list with as many entries as you have chains.
The reason I am not putting things back into the stanfit object is that all functions in bayesplot work fine with arrays.


#6

Yep! I agree with all of that. : - )

Cheers,

Richard