Hi guys, I’m following the advise in section 22.3 and 22.4 of BDA3 with regards to fitting mixture models with an unspecified number of components (K). In my case, say the data comes from a mix of 3 beta distributions in proportions \lambda=(.5,.25,.25) with \alpha=(1,2,3) and \beta=(1,1,1) respectively.
All I want is to discover K. My understanding is as follows:
(1) Specify a probability model. E.g.
h=1..H, where H is an upper bound on the possible number of clusters.
y=(y_1,..,y_N)
z=(z_1,...,z_N) where z_i \in \{1,...,H\} indicates the component y_i comes from.
a_h is the h component of \alpha, with constraint \alpha_1 \lt ... \lt \alpha_H.
\lambda_h is the h component of \lambda.
\phi=(\alpha,\beta,\lambda)
y_i \mid z_i=h,\phi \sim Beta(\alpha_h,\beta)
z_i=h,\phi \mid \sim Categorical(\lambda)
\lambda \sim Dirichlet(\frac{1}{H},...,\frac{1}{H})
(2) Treat H as an upper bound (e.g. H=10), and make a MCMC model which averages over H and z_i to discover the global parameters \theta=(\alpha,\beta).
(3) Gibbs time… Hope is that we can discover the discrete components z_i, and in so doing that K \approx \sum_{h=1}^H1_{n_h>0}. I.e. K is roughly the same as the number of components with at least one data point allocated to them.
Following section 11.1 and 22.3, my Gibbs sampler will take \theta to be known global parameters and just iterate the following steps to convergence.
(1) Sample \lambda from prior Dirichlet. Update z \mid \theta,\lambda.
(2) Sample z from prior \lambda. Update \lambda \mid z.
(4) What now? Can I use argmax with P(z_i=h \mid \lambda_h) to allocate points to mixtures components and then estimate K \approx \sum_{h=1}^H1_{n_h>0}?
My questions are:
(1) Is my understanding of this architecture correct?
(2) Is the Gibbs sampler correct?
(3) The two steps in my Gibbs sampler both have updates which require posteriors. Am I correct in saying that my posteriors must be analytically deduced?
(4) What do I do at point 4?
Many thanks in advance.
EDIT:
So after more staring at the Gibbs sampler examples I now understand that the above won’t work. Basically, I don’t think Stan can be used in this regard at all since the entirety of the model above would need to be expressed by the Gibbs sampler. K would then be derived as the posterior mode of \sum_{h=1}^H1_{n_h>0} following draws after convergence. To make the sampler efficient, all my distributions would need to have conjugate priors so I can specify their posteriors analytically. Arghhhh…