My painful but fruitful Bayesian mixture model saga continues :-)
Say I have a mixture model, involving K Gaussians. I have the problem that some or all of those Gaussians end up overlapping causing one of the non-identifiability conditions mentioned in Betancourt 2017.
I’ve spent some hours reading through the literature on repulsive priors which ought to solve this problem, however it’s pretty complicated and theoretical so far as I can tell. I’m looking for something more practical.
In stan, would be be possible to force a separation between the location of my Gaussians with a condition in the model that will reject samples if location values don’t meet a certain constraint? E.g. | \mu_i-\mu_j| > max\{\sigma_i^2,\sigma_j^2\} . Is there some smart way of achieving this?
Will it have the effect I’m hoping for? I.e. the separation of mixture model location parameters.
Introducing discontinuous behavior in the likelihood like this will give HMC problems when it’s near the discontinuity.
I’m guessing since you want these cutoffs, then the sampler would be near the discontinuities a bunch and that’d be no good.
There might be a way to do a soft separation, but I guess that’s what the repulsive prior papers are doing.
If this is 1D maybe you could parameterize the locations of the different means with offsets and make these offsets be lower-bounded by some positive number (or do a positive, zero avoiding prior)? That might not work well though either.
@bbbales2 thanks, do you know of a way of creating a joint prior for something like this? I.e. joint in such a way that it has natural separations? For example, if I made a join density \pi(\theta_1,\theta_2) \propto \frac{1}{d(\theta_1,\theta_2)} where d is some useful distance measure between the location parameters?
Also, do you think a Gibb’s sampler would be better suited to the discontinuities in the original idea?
Here’s an example implementation that uses a determinantal point process prior to induce an exchangeable repulsion between the components. The scaling of the prior is hard to tune and, well, it only reveals more problems. Even without label switching and component collapse you get horrendous posterior because of the degeneracies in the model for finite data. Ultimately exchangeable mixture models are not well suited for Bayesian inference. Once I finish all of my other case studies I’ll go back and update my mixture model case study with the whole progression of fixes and the deeper problems that reveal.
Exchangeable mixture models: not even once.
functions {
real repulsive_lpdf(vector mu, real rho) {
int K = num_elements(mu);
matrix[K, K] S;
matrix[K, K] L;
real log_det = 0;
for (k1 in 1:K)
for (k2 in 1:K)
S[k1, k2] = exp(- square(mu[k1] - mu[k2]) / square(rho));
L = cholesky_decompose(S);
for (k in 1:K)
log_det = log_det + 2 * log(L[k, k]);
return log_det;
}
}
data {
int<lower=0> K;
int<lower = 0> N;
real y[N];
}
transformed data {
int K_excess = K;
}
parameters {
ordered[K] mu;
real<lower=0> sigma[K];
simplex[K] lambda;
}
model {
mu ~ normal(0, 5);
mu ~ repulsive(5);
sigma ~ normal(0, 1);
lambda ~ dirichlet(rep_vector(3.0, K));
for (n in 1:N) {
vector[K] comp_lpdf;
for (k in 1:K) {
comp_lpdf[k] = log(lambda[k])
+ normal_lpdf(y[n] | mu[k], sigma[k]);
}
target += log_sum_exp(comp_lpdf);
}
}
Ultimately exchangeable mixture models are not well suited for Bayesian inference
I had hoped this wasn’t the case, and I wait with abated breath for your updatea on the mixture model write-up.
In my case I have a large data-set and an unknown K of Gaussian sub-populations. I can see a way through if I knew K. So if I just had some way of determining K, I could incrementally solve the rest.
Where a mixture model is the right choice, is there typically an alternative that would also be appropriate?
I’m going to claim that even if K is known you won’t be able to fit the model due to the degeneracies. One way to think about this is in terms of experimental design – for any finite data set observations from an exchangeable mixture of Gaussians are very poorly informative of those latent components and the resulting posterior will be extremely degenerate. You either need very informative priors or additional observational processes that can break that degeneracy.
It feels like you’re implying the corollary that mixture models are only principled in situations where K is known and the priors are non-exchangeable (or otherwise have a very strong, asymmetric effect on the components of the mixture).
I’d hazard to guess that unknown K and non-exchangeable priors doesn’t really make sense, so it seems I’m going to have to look for alternative formulations for my particular puzzle.
Basically. The same problems arise with Dirichlet processes and neural networks – the more universal the model is asymptotically the more degenerate it will be for finite data. It would be great if this weren’t true but I have yet to see any exceptions.