Fit Stan model without explicit density - solely based on simulations?

Hello,

I need to implement a model involving a diffusion process. More specifically, for the sake of simplicity, assume that T_1, \dots T_n are i.i.d response times whose distribution equals the first passage time of a (general) diffusion process. That is:

T_i = inf\{t >0 : X_t = c\}

and X_t follows a general diffusion process with some unknown underlying parameters (collected in the vector \theta).

Now suppose that

a) there is no explicit formula for the density of the response time

b) however, it is possible to simulate T_i for any given parameter value \theta.

Is it possible (and if so, any advice on how to proceed would be highly welcome) to fit such a model in Stan?

Thanks for reading and for any help!

2 Likes

This seems to be a case of a doubly-intractable problem, where you can’t even compute the likelihood. I have enjoyed limited success with noisy approximations with controlled error, but in all of these I could express my likelihood as an infinite sum which I could then approximate with controlled truncation error. I don’t think you can get away with a Monte Carlo-based estimate of the density because Stan relies on being able to compute derivatives of everything with respect to the parameters/quantities of interest. Coarsely said, the RNGs are not differentiable.

@betanalpha can probably tell you a dozen more reasons noisy HMC is a bad idea.

1 Like

Is there a reason why you would like to do this specifically with Stan, or would other software packages be ok? There are software packages with specific algorithms for making inference for stochastic simulators, such as ELFI - Engine for Likelihood-Free Inference — ELFI 0.8.0 documentation

1 Like

This sounds like a problem that likelihood-free approximate Bayesian computation (ABC) techniques try to address. I’m specifically thinking of ABC based on sequential Monte Carlo (Toni et al 2009 - I’m sure there are more recent papers as well). We’ve been trying to use these techniques to calibrate complex individual-based simulation models. With mixed results :). It’s extremely computational expensive (despite being more efficient than simple rejection sampling), especially as the number of parameters increases. But hey, who knows, maybe it’s worth exploring!

Thanks for your reply - your argument that there is a problem with the computation of the gradient is convincing. I guess, I would have to look for theorems on numerical approximation of the density in order to progress with Stan.

Thank you very much for this hint. In fact, I do not need to stick to some particular approach (Likelihood inference or Bayesian) or software. Any type of feasible inference for the described scenario would be great.
The link on the ELFI looks very interesting!

Thanks for your proposal. I will have to take a look into the reference first, because I am not very familiar with ABC and sequential MC.
With respect to the computation: Can I fit this type of model in Stan or do I need to use some libraries like tensorflow?

Nope, you can’t do ABC-SMC in Stan. Stan is explicitly for full Bayesian inference and requires a tractable likelihood function (or an approximation of it).

There are other packages/software for ABC-SMC, but I find it is quite a jungle out there so I’m developing and coding my own algorithms :)

Firstly let me mention that mathematically if you can explicitly simulate a process then you can write down a probability distribution for that process. A probability density function or probability mass function representation might be complicated, but one can always write one down. Really the only practical obstruction is not whether or not a probability distribution can be constructed but rather whether or not that distribution has a fixed dimension or not. In the diffusion case this limitation might arise when using a numerical discretization of the driving stochastic differential equation that use a different number of time points from one solve to another.

That said, what can we do with simulations alone?

Bayesian inference is implemented by updating a prior distribution into a posterior distribution. When we have an explicit family of probability density functions for the observational model, \pi(y \mid \theta), we can evaluate them on the observed data, \tilde{y}, to give an explicit likelihood function \pi(\tilde{y} \mid \theta). The posterior distribution is then given by scaling the prior distribution with this function: formally,

\mathbb{E}_{\text{posterior}}[ f ] = \mathbb{E}_{\text{prior}}[ \pi(\tilde{y} \mid \theta) \cdot f ]

for all functions f : \Theta \rightarrow \mathbb{R}, or more colloquially for probability density functions,

\pi(\theta \mid \tilde{y} ) \propto \pi(\tilde{y} \mid \theta) \, \pi(\theta).

Now constructing the posterior density function, \pi(\theta \mid \tilde{y} ), isn’t actually all that useful in practice. When we want to extract information from the posterior distribution we need to evaluate posterior expectation values, \mathbb{E}_{\text{posterior}}[ f ], which requires integrating over the posterior density function. In most cases these integrals can’t be done analytically and so we have to rely on approximation methods. For example Monte Carlo and Markov chain Monte Carlo construct expectation value estimators from sequences of samples from the posterior distribution,

\tilde{\theta}_{i} \sim \pi(\theta \mid \tilde{y} ) \\ \mathbb{E}_{\text{posterior}}[ f ] \approx \frac{1}{I} \sum_{i = 1}^{I} f(\tilde{\theta}_{i}).

Stan uses the unnormalized posterior density function, \pi(\theta \mid \tilde{y} ), to generate these samples. In other words the closed-from likelihood function is needed to construct a closed-form posterior density function which is needed to inform Stan’s Hamiltonian Monte Carlo sampler how to explore.

Are there other ways of generating posterior samples? One common approach that people consider is rejection sampling. If we know the likelihood function – and it’s maximum value – then we can sample from the prior and reject any samples with likelihood values that are too small. More formally we sample form the prior

\tilde{\theta} \sim \pi(\theta),

and construct the acceptance probability,

P[\text{accept}] = \frac{ \pi(\tilde{y} \mid \tilde{\theta} ) }{ \pi(\tilde{y} \mid \theta_{\max} ) },

sample from a unit uniform,

u \sim U[0, 1],

and then reject the prior sample if

u > P[\text{accept}] = \frac{ \pi(\tilde{y} \mid \tilde{\theta} ) }{ \pi(\tilde{y} \mid \theta_{\max} ) }.

The samples accepted by this process are guaranteed to be exact samples from the posterior distribution, but there’s no guarantee that we will accept any after a finite number or proposals!

The problem is that the prior distribution and posterior distribution will not overlap strongly unless the likelihood function is nearly trivial. As we go to higher and higher dimensional model configuration spaces reasonably overlap requires exponentially more flat likelihood functions. This is a method that works in theory but fails spectacularly in practice.

A corollary to rejection sampling is importance sampling, but that suffers from the same problems. For some discussion about circumstance see for example Approximating posterior expectations by likelihood expectations.

Rejection sampling isn’t practical but because it requires an explicit likelihood function it wouldn’t be viable for the simulation circumstance anyways. But what if we could reject prior samples another way? In particular maybe we can reject the prior sample if simulations from the observational model look too different from the observed data?

“Approximate Bayesian Computation” methods, also known colloquially as “simulation-based inference” methods, introduce a distance metric on the model configuration space to quantify how close simulated data are to the observed data.
Formally we propose some model configuration \tilde{\theta}, which is often a sample from the prior but could also be generated in other ways, simulate data conditional on that model configuration,

\tilde{y}' \sim \pi(y \mid \tilde{\theta}),

evaluate the distance between the observed data and the simulated data,

d( \tilde{y}, \tilde{y}' ),

and the reject the sample if the distance is above some threshold,

d( \tilde{y}, \tilde{y}' ) > \epsilon.

The theory behind this method is…a little flimsy. What one can show is that in certain cases if the posterior concentrates to a point with infinite data then in the limit \epsilon \rightarrow 0 this simulation-based rejection procedure will also return that point. It is much, much harder to say anything about what happens when \epsilon > 0 or when we have only finite data so that the posterior distribution does not collapse to a point. Many people naively assume that these methods generate approximate posterior samples which can be used to estimate posterior expectation values, but the theory doesn’t actually support that!

In practice we can’t take \epsilon = 0 or else we would just reject everything. The smaller the threshold the better the samples will hopefully approximate exact posterior samples, but also the more likely we are to reject samples. Just as in rejection sampling the ultimately performance of any such technique strongly depends on just how well we can propose model configurations from the posterior typical set in the first place.

The bigger problem, however, is the distance metric itself. The choice of distance metric corresponds to a choice of what properties of the observational space are compared to inform a rejection. For example a distance metric might only consider some components of the space, ignoring others, or only some summaries, such as comparing centrality but ignoring breadth. This is why the distance function is often referred to a “summary statistic”. These Approximate Bayesian Computation methods only reject model configurations that give rise to simulated data sets with the wrong summaries even if the simulations themselves look very different from the observed data in other ways!

What this means is that even in the \epsilon \rightarrow 0 limit this procedure can return samples that look nothing like exact posterior samples. There is no guarantee that the output samples will be contained in the posterior typical set or expand into the entire posterior typical set. Consequently any posterior expectation value estimators derived from these samples can be arbitrarily wrong.

To summarize these “simulation-based” methods offer few if any guarantees for accurate posterior quantification. Typically more accurate posterior quantification requires either very clever proposal mechanisms or simply lots of proposals and the corresponding cost of generating many rejections for a single acceptance.

Because of these issues much of the statistical analysis has moved onto treating these methods not so much as approximate computation but rather approximate modeling. The simulation mechanisms and distance function together can be interpreted as defining an implicit likelihood function, and hence an implicit posterior distribution, with the rejection sampling generating exact samples from this model. How close this likelihood function and posterior distribution are to the actual likelihood function and posterior distribution specified by your model, however, is left completely unspecified. In that sense it’s not Approximate (Bayesian Computation) but rather (Approximate Bayesian) Computation.

I spent a lot of time trying to make probabilisitic computation more feasible, and I’ve investigated these approaches quite a bit. In my informed opinion they are fundamentally flawed and should not be used in any meaningful analysis. Others, however, will have different opinions. There are many papers laying out variations of these methods and many packages implementing certain variations with which you can experiment. Just keep in mind your ultimate objective – if it’s to accurately quantify the posterior distribution then judge these methods by that criteria.

9 Likes

Thank you very much for these detailed insights. I guess this will spare me a lot of trouble. I might therefore look for alternative approaches first, and only as a last resort consider the ABC approach.

It is also a great reference for someone like me, who is not really familiar with the topic but needs some concise overview.

1 Like

Is it not generally true that ABC as described samples from the approximate posterior

\pi(\theta,\tilde y' | d(\tilde y,\tilde y')< \epsilon)?

I take your point that the distance may be hard to choose appropriately. There is also a paper on these limitations:

https://www.pnas.org/content/108/37/15112

And as you mentioned many newish papers proposing improvements:

https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2011.01010.x

A consistent ABC method will generate values consistent with that distribution, but few if any methods have any guarantees that those values will expand into the full breadth of that distribution. In other words if successful they will quantify a part of that distribution but maybe not all of it.

Markov chain Monte Carlo can suffer from similar problems. The existence of a stationary distribution does not imply that Markov chains will explore that distribution in finite or even infinite time. Instead we need careful technical conditions to insure first asymptotic consistency and then pre-asymptotic consistency (much harder and why I’m always yelling about central limit theorems).