Suppose that we have a gaussian mixture model with N > 1 components. The parameters to infer are
vector[N] mu; // mean for each component
vector[N] sigma; // sd for each component
but, as Michael Betancourt’s discusses (http://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html), there are identifiability issues: assuming symmetric priors, permuting the component indices leaves the joint probability density unchanged, guaranteeing us at least N! posterior modes.
One approach is to impose a constraint that mu[i] < mu[i+1] for all 1 <= i < N. One way of looking at it is to define the canonical form of (mu, sigma) to be the pair of vectors
canonical(mu, sigma) = (mu', sigma')
obtained by permuting indices such that the elements of mu’ are in increasing order. The constraint is then that (mu, sigma) must already be in canonical form.
But what if we simply impose canonical form after sampling is done, that is, take the sample and map (mu, sigma) to canonical(mu, sigma)? Effectively we are saying that (mu1, sigma1) and (mu2, sigma2) are the same point if they have the same canonical form. We’ve moved from a Euclidean space to a space that is locally Euclidean but not globally Euclidean. This could make things hairy if the proposal distribution defined by NUTS were asymmetric, but luckily it is symmetric, so it seems to me that the detailed balance equations should still hold.
Am I missing something here?
Kevin:
You can do this, but it can make the sampling much slower and convergence much more difficult if the different modes are separated in the posterior distribution.
Not sure I follow you here. When you say “make the sampling much slower,” are you talking about slower iterations, or more iterations to get convergence?
If we consider ourselves to be working in a “canonical” parameter space that is a 1/N! slice of the original space as I proposed, it seems to me that you would want to canonicalize the draws before computing N_eff and R_hat. So you don’t have all those symmetric modes anymore, but you’ve avoided the problems that can occur when you approach the boundaries of the constrained space when explicitly imposing an ordering constraint.
In more detail: any path taken in the original space is mapped, via canonical(), to an equivalent path in the canonical space. Any symmetric proposal distribution in the original space is mapped to a symmetric proposal distribution in the canonical space. The probability density at any point in the original space is the same (up to a factor of N!) as the probability density in the canonical space.
This approach presumes that you have an exact symmetry in your posterior and technically works because it does the same thing as the constraint (identifying a unique orthant). That said, it will mess up adaptation and diagnostics which don’t know about this symmetry and hence end up in weird configurations. Hence it’s always better to remove it in the model specification itself.
That said, be careful because removing the label switching only peels away the first layer of pathologies in exchangeable mixture models. With more than two components there are myriad more subtle yet equally problematic non-identifiabilites and weak identifiabilities that nowhere near as easy to manage.
I do not recommend using exchangeable mixture models at all! Non-exchangeable mixture models are great, but once the components are all degenerate you’re in for trouble.