Particle marginal Metropolis-Hastings in Stan?

I am interested in fitting a state space model (dynamic structural equation model, to be specific), where I have observed timeseries for a large number of subjects. I assume that the latent process is measured by multiple observations at each timepoint, and hence I need to compute the marginal likelihood, integrating over the distribution of latent variables, in order update the parameters in the model.

If I had a Gaussian measurement model this would be easy to do exactly using the gaussian_dlm_obs() function. However, I am in a situation where the measurements follow some exponential family distribution instead, so a Kalman filter is not appropriate.

A natural approach is to replace the Kalman filter with a particle filter, yielding the particle marginal Metropolis-Hastings algorithm of Andrieu et al. (2010). Conceptually, I believe that all the consistency results of Andrieu et al. should carry over to the NUTS sampler, but Iā€™m not sure if itā€™s practically feasible.

So my question is: Does it at all make sense within Stan to try to replace gaussian_dlm_obs() with a function I implement myself, that estimates the marginal likelihood by running multiple (stochastic) particle filters using the current values of the model parameters?

Using a particle filter to stochastically estimate the marginal likelihood as part of computing the target log density would be problematic when combined with Stanā€™s gradient-based Hamiltonian Monte Carlo (HMC) based sampling, as the target density is expected to be a smooth and deterministic function of the model parameters. If each evaluation of the target log density and its gradient is subject to noise this would break the nice properties the leapfrog integrator usually brings (see Betancourt; 2015) in terms of having a bounded change in the Hamiltonian, and so high probability of acceptance, over long simulated trajectories.

One option to overcome this issue with noisy gradients, is to ā€˜fixā€™ the randomness used in the marginal likelihood estimates within a given proposal - that is sample one set of values for all auxiliary random variates used in the particle filter estimate of the marginal likelihood and its gradient, and use these in all subsequent evaluations during the current proposal (for different values of the model parameters). In between the updates to the parameters given fixed values of these auxiliary variates, we need to update the values of the auxiliary variates to ensure ergodicity - a simple approach is to use a Metropolis independence sampler step to update the auxiliary variates given fixed values of the parameters - this just corresponds to computing a second independent marginal likelihood estimate for fixed values of the parameters, and accepting the new estimate (and auxiliary variables) based on the ratio of the estimate target density values. This basic approach (though without using HMC for the parameter updates) is what is described in the papers pseudo-marginal slice sampling (Murray and Graham, 2016) and correlated pseudo-marginal Metropolis-Hastings (Deligiannidis, Doucet and Pitt, 2018), with the idea further adapted specifically to the HMC setting in pseudo-marginal Hamiltonian Monte Carlo (Alenlƶv, Doucet and Lindsten, 2021). In practice rather than using an independence sampler to update the auxiliary variates, it usually makes sense to use a perturbative proposal such as random-walk Metropolis or slice sampling.

While this approach of fixing the randomness in the estimator will overcome the issue with noise in the density and gradients, specifically for particle filter estimators of the marginal likelihood, there is an additional issue that the estimated density, for a fixed set of auxiliary random variates, is not a smooth function of the parameters, due to the non-differentiability of the resampling steps. This will introduce discontinuities that will still severely limit the effectiveness of a HMC based update to the parameters. An alternative is to use a simple sequential importance sampling based estimate of the marginal likelihood which avoids the resampling steps; however this then suffers from high variance of the estimator, which manifests itself in pathological sticking behaviour in the resulting chains when using pseudo-marginal type updates.

The auxiliary variates in a sequential importance sampling estimate of the marginal likelihood, are just the state particles at each observation time, and in the limiting case of one particle we just revert to targetting the joint distribution on the latent states and model parameters directly (modulo some reparametrization depending on the choice of proposal in the particle filter). This suggests that a simple alternative to pseudo-marginal approaches, is to just use HMC to sample from the joint distribution on the model parameters and sequence of latent states directly, with the samples of the model parameters then corresponding to samples of the marginal distribution of interest. As the latent states are updated jointly with the model parameters, this can be a more effective solution than the approaches described above which separately update the parameters given auxiliary variables (latent states) and latent states given parameters, as typically the parameters and latent variables will have complex dependencies. Depending on how informative the observations are, it can sometimes be beneficial to instead use a non-centred parametrization in terms of the state noise increments rather than the latent states themselves (with there a deterministic relationship between the two given the initial state).

3 Likes

If you can write the model down in mathematical notation, thatā€™s usually a huge help in these discussions.

Usually you can turn this around and just write the density directly in Stan. It may not be efficient enough, though conceptually it is fine to do.

Iā€™m not sure what results you are expecting to carry over? I find Andrieu et al.'s paper over my head in terms of measure theory.

This wonā€™t be a problem if you just write the complete density down rather than trying to frame it as a filtering problem. As I mentioned above, scale is usually the problem that defeats this. The Kalman filter updates are very efficient due to their conjugacy.

Often you donā€™t have to calculate marginal likelihoods at all, you can just write down the full data likelihood and parameterize the unknowns. Then MCMC does the marginalization you implicitly. That is, if I have p(y, z \mid \theta), I can either try to marginalize z to to get a marginal likelihood p(y \mid \theta), or I can introduce parameters for z and fit the joint model and then just forget about the z later.

1 Like

Yes, that is what I also suggested above after the snippet you quoted šŸ˜‰:

For the point

As I mentioned above, scale is usually the problem that defeats this.

Do you mean that this approach becomes computationally too costly in terms of the expense of gradient evaluations (should be roughly the same as simulating one trajectory from the model so significantly cheaper than running a particle filter)? Or because of resulting high dimension of latent space and potentially challenging geometry of posterior? Or something else?

From previous numerical experiments Iā€™ve done with applying HMC to jointly sample latent states and parameters in state space models (using JAX to implement model and another Python package Mici to sample the chains, but the using the same dynamic HMC + adaptation algorithms as Stan so should be largely similar to using Stan directly), I found that the approach generally performed well providing the observations are not too strongly informative, though this was in cases where the dimension of the latent state was small (two or three dimensional). Depending on the specifics of the model, I think things could still scale well in larger state dimensions though, beyond the increase in cost of simulating from the model (and evaluating the joint density / its gradient).

@osorensen, as Bob also requested above, would it be possible to get more details about your model, for example a mathematical description?

1 Like

Thanks to both of you for your good answers @matt-graham and @Bob_Carpenter. Here are some mathematical details that I hope are sufficient to explain what Iā€™m aiming for.

I try to implement an extension of dynamic structural equation models to a situation with non-Gaussian responses. Consider the following setting: we have N subjects i=1,\dots,N for each of whom we have T observations at equally spaced timepoints t=1,\dots,T, where T is typically at least in the hundreds. For each subject-timepoint we observe a vector of responses y_{it} \in \mathbb{R}^{V}. Let us assume all elements of the response vector follow the same exponential family distribution, so we have for the $v$th component,

f(y_{itv}) = \exp\left\{ \frac{y_{itv}\theta(\mu_{itv}) - b(\theta(\mu_{itv}))}{\phi} + c(y_{itv},\phi) \right\}

where

\mu_{itv} = g^{-1}(\nu_{v} + \Lambda_{t} \eta_{i,t}).

In the above equation, \eta_{i,t} \in \mathbb{R}^{U}, with U < V is a vector of latent variables. \Lambda_{t} is an appropriately sized matrix of factor loadings. g^{-1}(\cdot) is an inverse link function, which for this case very well can be a canonical link. So the point of this model is to represent the high-dimensional responses y_{it}, which might come from any exponential family distribution, by a lower-dimensional latent variable vector \eta_{i,t}, which is Gaussian.

Next, the latent variables have an autoregressive dynamic, and are given by

\eta_{i,t} = \alpha + \sum_{l=0}^{L} B_{l} \eta_{i,t-l} +\Gamma_{l} X_{i,t-l} + \xi_{it}

where \xi_{it} \sim N(0,\sigma^{2}), \alpha \in \mathbb{R}^{U} is a vector of intercepts, X_{i,t-l} is a matrix of covariates, \Gamma_{l} is a matrix of regression coefficients, and B_{l} is a properly sized matrix, where B_{0} is strictly upper-triangular to ensure that the model is recursive.

The case is now that we have ā€œtop-levelā€ parameters \Psi = \{\Lambda_{t},\nu_{v},B_{l},\Gamma,\alpha,\sigma^{2},\phi\}, which are quite easy to sample using Stan, given that I define decent priors. These parameters, which govern the time series dynamics, can also be made individual-specific, e.g., \Lambda_{i,t} if I want the factor loadings to vary between individuals. To avoid sample \eta_{1:T,1:N} for all timepoints and all subjects, I would be happy by just integrate them out, to get

p(\Psi | y) = \int p(\eta_{1:T,1:N}, \Psi | y_{1:T,1:N}) d \eta_{1:T,1:N}

which I guess can also be written

p(\Psi | y) = p(\Psi) \int p(y_{1:T,1:N} | \eta_{1:T,1:N}, \Psi) p(\eta_{1:N} | \Psi) d\eta_{1:T,1:N}

If the inverse link function g^{-1}(\cdot) above was the identity function, and hence y_{1:T,1:N} were Gaussian variables, the integral can be computed exactly. I imagine this could be done pretty easily with the function gaussian_dlm_obs(), although I havenā€™t actually tried this. Presumably this would also let me analytically compute the predicted values of all latent variables at all timepoints, in cases I needed them.

My question, however, is for the case in which g^{-1}(\cdot) is not the identity function, e.g., with logistic or Poisson regression. Based on reading Andrieu et al. (2010), I realized that running particle filters could be a way to compute this integral inside MCMC while still target the correct posterior. In particle marginal Metropolis-Hastings, a particle filter estimate of the marginal likelihood is used in the Metropolis-Hastings ratio to accept or reject proposals for the parameters \Psi.

My question was hence related to whether this would make sense with Stan. Based on @matt-grahamā€™s answer, it does seem to me that this, to say the least, would not be straightforward.

Exactly. If you have multiple observations across time, and each of them is a multivariate normal update in high dimensions with covariance, it gets quite costly.