I am interested in fitting a state space model (dynamic structural equation model, to be specific), where I have observed timeseries for a large number of subjects. I assume that the latent process is measured by multiple observations at each timepoint, and hence I need to compute the marginal likelihood, integrating over the distribution of latent variables, in order update the parameters in the model.
If I had a Gaussian measurement model this would be easy to do exactly using the gaussian_dlm_obs() function. However, I am in a situation where the measurements follow some exponential family distribution instead, so a Kalman filter is not appropriate.
A natural approach is to replace the Kalman filter with a particle filter, yielding the particle marginal Metropolis-Hastings algorithm of Andrieu et al. (2010). Conceptually, I believe that all the consistency results of Andrieu et al. should carry over to the NUTS sampler, but I’m not sure if it’s practically feasible.
So my question is: Does it at all make sense within Stan to try to replace gaussian_dlm_obs() with a function I implement myself, that estimates the marginal likelihood by running multiple (stochastic) particle filters using the current values of the model parameters?
Using a particle filter to stochastically estimate the marginal likelihood as part of computing the target log density would be problematic when combined with Stan’s gradient-based Hamiltonian Monte Carlo (HMC) based sampling, as the target density is expected to be a smooth and deterministic function of the model parameters. If each evaluation of the target log density and its gradient is subject to noise this would break the nice properties the leapfrog integrator usually brings (see Betancourt; 2015) in terms of having a bounded change in the Hamiltonian, and so high probability of acceptance, over long simulated trajectories.
One option to overcome this issue with noisy gradients, is to ‘fix’ the randomness used in the marginal likelihood estimates within a given proposal - that is sample one set of values for all auxiliary random variates used in the particle filter estimate of the marginal likelihood and its gradient, and use these in all subsequent evaluations during the current proposal (for different values of the model parameters). In between the updates to the parameters given fixed values of these auxiliary variates, we need to update the values of the auxiliary variates to ensure ergodicity - a simple approach is to use a Metropolis independence sampler step to update the auxiliary variates given fixed values of the parameters - this just corresponds to computing a second independent marginal likelihood estimate for fixed values of the parameters, and accepting the new estimate (and auxiliary variables) based on the ratio of the estimate target density values. This basic approach (though without using HMC for the parameter updates) is what is described in the papers pseudo-marginal slice sampling (Murray and Graham, 2016) and correlated pseudo-marginal Metropolis-Hastings (Deligiannidis, Doucet and Pitt, 2018), with the idea further adapted specifically to the HMC setting in pseudo-marginal Hamiltonian Monte Carlo (Alenlöv, Doucet and Lindsten, 2021). In practice rather than using an independence sampler to update the auxiliary variates, it usually makes sense to use a perturbative proposal such as random-walk Metropolis or slice sampling.
While this approach of fixing the randomness in the estimator will overcome the issue with noise in the density and gradients, specifically for particle filter estimators of the marginal likelihood, there is an additional issue that the estimated density, for a fixed set of auxiliary random variates, is not a smooth function of the parameters, due to the non-differentiability of the resampling steps. This will introduce discontinuities that will still severely limit the effectiveness of a HMC based update to the parameters. An alternative is to use a simple sequential importance sampling based estimate of the marginal likelihood which avoids the resampling steps; however this then suffers from high variance of the estimator, which manifests itself in pathological sticking behaviour in the resulting chains when using pseudo-marginal type updates.
The auxiliary variates in a sequential importance sampling estimate of the marginal likelihood, are just the state particles at each observation time, and in the limiting case of one particle we just revert to targetting the joint distribution on the latent states and model parameters directly (modulo some reparametrization depending on the choice of proposal in the particle filter). This suggests that a simple alternative to pseudo-marginal approaches, is to just use HMC to sample from the joint distribution on the model parameters and sequence of latent states directly, with the samples of the model parameters then corresponding to samples of the marginal distribution of interest. As the latent states are updated jointly with the model parameters, this can be a more effective solution than the approaches described above which separately update the parameters given auxiliary variables (latent states) and latent states given parameters, as typically the parameters and latent variables will have complex dependencies. Depending on how informative the observations are, it can sometimes be beneficial to instead use a non-centred parametrization in terms of the state noise increments rather than the latent states themselves (with there a deterministic relationship between the two given the initial state).
If you can write the model down in mathematical notation, that’s usually a huge help in these discussions.
Usually you can turn this around and just write the density directly in Stan. It may not be efficient enough, though conceptually it is fine to do.
I’m not sure what results you are expecting to carry over? I find Andrieu et al.'s paper over my head in terms of measure theory.
This won’t be a problem if you just write the complete density down rather than trying to frame it as a filtering problem. As I mentioned above, scale is usually the problem that defeats this. The Kalman filter updates are very efficient due to their conjugacy.
Often you don’t have to calculate marginal likelihoods at all, you can just write down the full data likelihood and parameterize the unknowns. Then MCMC does the marginalization you implicitly. That is, if I have p(y, z \mid \theta), I can either try to marginalize z to to get a marginal likelihood p(y \mid \theta), or I can introduce parameters for z and fit the joint model and then just forget about the z later.
Yes, that is what I also suggested above after the snippet you quoted 😉:
For the point
As I mentioned above, scale is usually the problem that defeats this.
Do you mean that this approach becomes computationally too costly in terms of the expense of gradient evaluations (should be roughly the same as simulating one trajectory from the model so significantly cheaper than running a particle filter)? Or because of resulting high dimension of latent space and potentially challenging geometry of posterior? Or something else?
From previous numerical experiments I’ve done with applying HMC to jointly sample latent states and parameters in state space models (using JAX to implement model and another Python package Mici to sample the chains, but the using the same dynamic HMC + adaptation algorithms as Stan so should be largely similar to using Stan directly), I found that the approach generally performed well providing the observations are not too strongly informative, though this was in cases where the dimension of the latent state was small (two or three dimensional). Depending on the specifics of the model, I think things could still scale well in larger state dimensions though, beyond the increase in cost of simulating from the model (and evaluating the joint density / its gradient).
@osorensen, as Bob also requested above, would it be possible to get more details about your model, for example a mathematical description?