Thank you very much for your prompt reply and sorry for not being clear before. My goal for this task is to reproduce the results from the pseudo-marginal HMC paper (Alenlöv et al., 2021) and thus all my model settings strictly follow their implementations.
As described previously, Alenlöv et al. are considering a generalised linear mixed model, where Y_{ij} \sim \text{Bernoulli}(p_{ij}), \text{logit}(p_{ij}) = X_i + Z_{ij}^T\beta, and the latent variables X_{i} \sim wN(\mu_1, \lambda_1^{-1}) + (1-w)N(\mu_2, \lambda_2^{-1}). Hence the parameters involved here are (\beta^T, \mu_1, \mu_2, \lambda_1, \lambda_2, w), and to satisfy that \lambda_1, \lambda_2 > 0 and 0<w<1, the target parameters are rendered as
\theta = (\beta^T, \mu_1, \mu_2, \log\lambda_1, \log\lambda_2, \text{logit}(w)).
And yes, I’m using the HMC and in particular pseudo-marginal HMC to conduct the sampling. My Hamiltonian function is defined as
H(\theta, \rho, u, p) = -\log p(\theta) - \log p(y|\theta, u) + 0.5 \times (\rho^T\rho + u^Tu + p^Tp).
Here \theta is the aforementioned target parameter and \rho is its corresponding momentum variable. Alenlöv et al. assume the prior of \theta to be N(0, 100I) and let \rho \sim N(0, I). The u is exactly the auxiliary variable for the latent variable X_i's in the pseudo-marginal method, and p is its corresponding momentum variable. Alenlöv et al. assume both of them follow standard normal distribution.
The motivation of using pseudo marginal method here is that we want to use a surrogate p(y|\theta, u) to compute p(y|\theta, X). Hence an importance sampling is used
p(y|\theta, u) = \prod_{i=1}^T \prod_{j=1}^n p(y_{ij} | \theta, u_i) = \prod_{i=1}^T \prod_{j=1}^n \{\frac{1}{N}\sum_{l=1}^N \frac{g_\theta (y_{ij} | X_{il}) f_\theta(X_{il})}{q_\theta(X_{il})} \},
where q_\theta(X_{il}) is the proposal distribution in importance sampling and here we set it to N(0, 3^2) and use X_{il} = 3U_{il}.
My problem here is: using this pseudo-marginal HMC, the \lambda_1 or \lambda_2 always go to a large value no matter how small the step size I’m choosing. I suspect that one of the gaussian mixture component is fitting a single data point and therefore, the precision tends to go to infinity. And once I change the distribution of X_i from gaussian mixture to just gaussian, the \lambda_1 and \lambda_2 will not have this explosive behavior.
To debug whether there is any error in my implementation, I have doubled checked my gradient computations and the integrator. It seems to me that they are both correct after several tests. Therefore, I’m quite struggling with the abnormal sampling results.
I would greatly appreciate it if you could provide any advice or guidance on this matter, which will be invaluable to me. Thank you for your time.
Reference:
Alenlöv, J., Doucet, A., & Lindsten, F. (2021). Pseudo-Marginal Hamiltonian Monte Carlo. Journal of Machine Learning Research, 22.