When doing model selection, we can expand our model to span the different candidate models using a mixture construct akin to how Bayes factors are constructed and justified, with a model of the form
w\sim \text{dirichlet}(\mathbf{1})\\
m\sim \text{categorical}(w)\\
x\sim p_m
where m is a discrete index selecting the model class, so that X is generated by a mixture of the models. The prior on w is a bit unorthodox from the Bayes factor point of view, but allows us to integrate out m, yielding
w\sim \text{dirichlet}(1)\\
x\sim \sum_{m=1}^M w_mp_m
which could be modelled in Stan as
model {
w ~ dirichlet(rep_vector(1,K));
vector[K] comp_lpmf;
for (k in 1:K) {
comp_lpmf[k] = w[k]*likelihood_lpmf(y | param[k]);
}
target += log_sum_exp(comp_lpmf);
}
When sampling from a model of this form, the samples of w become incredibly concentrated and differ across chains, indicating poor mixing. Is my reasoning here flawed, or is this to be expected?
edit: as I note below, the Stan implementation here is not at all doing what I wanted. The text still holds.
Have you looked at the examples in the manual for mixtures?
Sure, that’s basically how I got the model structure, but this is a slightly different application from a standard mixture, being one rung lower in the hierarchy.
I don’t know much about this application, so I’m only going to be limited help. But it looks like you’re getting a highly concentrated multimodal distribution, which Stan is just not going to be able to fit well (but neither will most other things). The problem is that it’s really hard for the Markov chain to traverse the area of very low probability mass between modes.
It’s not hard to imagine a case where there are two possible modes that are far away on the simplex and the only paths between them go through very unlikely parts of the model space. So it might just be that the problems with this model are unavoidable.
i also feel like the parameter in your Dirichlet should decrease with K to make sure the prior is concentrated on the boundaries of the simplex (which is where you want it to be concentrated).
There’s some practical advice in this paper (which you have probably seen). It focusses on Gibbs samplers, but some of the recommendations around prior specification could be useful.
Mixture models can be useful, but I’d like to remind also that mixture models can be unstable as Bayes factors in case of small data. See Section 4.3 in Using Stacking to Average Bayesian Predictive Distributions (with Discussion) for a recent example with Stan.
Hm, the problem is partially that each component effectively only observes a fraction of the data - there is a credit assignment effect at play.
But couldn’t this be circumvented if the models have shared parameters?
For instance, if each component of the mixture is a factor model (e.g. probabilistic PCA), with component k being the factor model with k factors, we could improve posterior concentration by letting factor model k+1 be equal to factor model k, but with a single new vector added.
This is my actual case, but I still observe poor mixing for the random variable k.
Bad mistake here - the log-probability statement I provided is equivalent to \frac{1}{Z}\sum_{k=1}^Kp_k(y)^{w_k} which I am barely sure makes sense under any interpretation (I guess it is roughly a equally-weighted mixture of tempered likelihoods…). I will try to reimplement.