Marginalized parameter recovery - circular dependencies?

I have a need to recover a marginalize discrete parameter very similar to what is described here and in the rater paper by Pullin, Gurrin, and Vukcevic:

I know how to implement this for my model using standard HMM algorithms (Forward-Backward). However, my concern is that, in the following formula from the STAN docs, when we condition Y on theta, since theta is drawn from the posterior, theta is conditional on Y. How is it mathematically justifiable to have Y conditioned on theta and theta conditioned on Y?

Thanks in advance for any help with this!

It turns out that the quantities that you need to calculate the posterior distribution of the latent state Z include the model likelihood, conditional on the posterior distribution for theta.

Maybe the key insight is that the quantity we are trying to compute isn’t the probability of the data Y (conditional or otherwise). We just need to collect these likelihood terms in order to compute the posterior for Z.

Thanks, Jacob. After thinking about this some more, I think the explanation that makes the most sense to me is that (1) the use of Bayes Thm to condition on a single value of theta is perfectly valid regardless of where that value of theta came from, and moreover (2) the appearance of circularity only occurs when taking the expectation of that posterior w.r.t. theta; but that is valid as well because we can always take the expectation of any measurable function having a finite expectation. Thus, according to my understanding, the only implication of conditioning Y on theta and then taking the expectation with respect to theta (which is conditioned on Y) is in the interpretation of that expectation – but there’s nothing actually invalid about doing so. Do you agree with that reasoning?

Thanks!

That’s right. Note that this is true in general when working with likelihoods. The likelihood is the probability of Y conditional on theta. When we work with the likelihood of a fitted model, this situation is what results.

I’say the intuitive way of looking at it is that the posterior is being computed for each possible value of k, so within each k we can integrate the probability density of the continuous parameters as usual by summing over the samples and get the each probability mass.

\Pr[Z = k \mid \Theta = \theta^{(i)}, Y] = \frac{\Pr[Y \mid Z = k, \Theta = \theta^{(i)}] \Pr[Z = k \mid \Theta = \theta^{(i)}]} {\sum_{k = 1}^K \Pr[Y \mid Z = k, \Theta = \theta^{(i)}] \Pr[Z = k \mid \Theta = \theta^{(i)}]}.

gives the formal expression, but I don’t see it as being any different from integrating over any other (continuous) parameter, which would be

\Pr[\theta_k \mid \Theta = \theta^{(i)}, Y] \propto \Pr[Y \mid \theta_k, \Theta = \theta^{(i)}] \Pr[\theta_k \mid \Theta = \theta^{(i)}]

But normalized by the sum over the existing samples, not all infinite possible values to obtain the marginal density for \theta_k or whatever else.

You could discretize the continuous variable (after all, the samples are also a discrete sample) and work as if they were sampled, you could even just compute the posteriors on a grid for all variables and work with them the same way as MCMC samples in computing expectations, but of course looping over each possible value (with something like for (t in 1:T) { lp_e[t + 1] = lp_e[t] + poisson_lpmf(D[t] | e); lp_l[t + 1] = lp_l[t] + poisson_lpmf(D[t] | l); } in one of the examples) is expensive, so it’s only practical to do it over discrete variables that are problematic for HMC.

That’s one way I’d look at it and justify it intuitively.