Fit Stan model without explicit density - solely based on simulations?

Firstly let me mention that mathematically if you can explicitly simulate a process then you can write down a probability distribution for that process. A probability density function or probability mass function representation might be complicated, but one can always write one down. Really the only practical obstruction is not whether or not a probability distribution can be constructed but rather whether or not that distribution has a fixed dimension or not. In the diffusion case this limitation might arise when using a numerical discretization of the driving stochastic differential equation that use a different number of time points from one solve to another.

That said, what can we do with simulations alone?

Bayesian inference is implemented by updating a prior distribution into a posterior distribution. When we have an explicit family of probability density functions for the observational model, \pi(y \mid \theta), we can evaluate them on the observed data, \tilde{y}, to give an explicit likelihood function \pi(\tilde{y} \mid \theta). The posterior distribution is then given by scaling the prior distribution with this function: formally,

\mathbb{E}_{\text{posterior}}[ f ] = \mathbb{E}_{\text{prior}}[ \pi(\tilde{y} \mid \theta) \cdot f ]

for all functions f : \Theta \rightarrow \mathbb{R}, or more colloquially for probability density functions,

\pi(\theta \mid \tilde{y} ) \propto \pi(\tilde{y} \mid \theta) \, \pi(\theta).

Now constructing the posterior density function, \pi(\theta \mid \tilde{y} ), isn’t actually all that useful in practice. When we want to extract information from the posterior distribution we need to evaluate posterior expectation values, \mathbb{E}_{\text{posterior}}[ f ], which requires integrating over the posterior density function. In most cases these integrals can’t be done analytically and so we have to rely on approximation methods. For example Monte Carlo and Markov chain Monte Carlo construct expectation value estimators from sequences of samples from the posterior distribution,

\tilde{\theta}_{i} \sim \pi(\theta \mid \tilde{y} ) \\ \mathbb{E}_{\text{posterior}}[ f ] \approx \frac{1}{I} \sum_{i = 1}^{I} f(\tilde{\theta}_{i}).

Stan uses the unnormalized posterior density function, \pi(\theta \mid \tilde{y} ), to generate these samples. In other words the closed-from likelihood function is needed to construct a closed-form posterior density function which is needed to inform Stan’s Hamiltonian Monte Carlo sampler how to explore.

Are there other ways of generating posterior samples? One common approach that people consider is rejection sampling. If we know the likelihood function – and it’s maximum value – then we can sample from the prior and reject any samples with likelihood values that are too small. More formally we sample form the prior

\tilde{\theta} \sim \pi(\theta),

and construct the acceptance probability,

P[\text{accept}] = \frac{ \pi(\tilde{y} \mid \tilde{\theta} ) }{ \pi(\tilde{y} \mid \theta_{\max} ) },

sample from a unit uniform,

u \sim U[0, 1],

and then reject the prior sample if

u > P[\text{accept}] = \frac{ \pi(\tilde{y} \mid \tilde{\theta} ) }{ \pi(\tilde{y} \mid \theta_{\max} ) }.

The samples accepted by this process are guaranteed to be exact samples from the posterior distribution, but there’s no guarantee that we will accept any after a finite number or proposals!

The problem is that the prior distribution and posterior distribution will not overlap strongly unless the likelihood function is nearly trivial. As we go to higher and higher dimensional model configuration spaces reasonably overlap requires exponentially more flat likelihood functions. This is a method that works in theory but fails spectacularly in practice.

A corollary to rejection sampling is importance sampling, but that suffers from the same problems. For some discussion about circumstance see for example Approximating posterior expectations by likelihood expectations.

Rejection sampling isn’t practical but because it requires an explicit likelihood function it wouldn’t be viable for the simulation circumstance anyways. But what if we could reject prior samples another way? In particular maybe we can reject the prior sample if simulations from the observational model look too different from the observed data?

“Approximate Bayesian Computation” methods, also known colloquially as “simulation-based inference” methods, introduce a distance metric on the model configuration space to quantify how close simulated data are to the observed data.
Formally we propose some model configuration \tilde{\theta}, which is often a sample from the prior but could also be generated in other ways, simulate data conditional on that model configuration,

\tilde{y}' \sim \pi(y \mid \tilde{\theta}),

evaluate the distance between the observed data and the simulated data,

d( \tilde{y}, \tilde{y}' ),

and the reject the sample if the distance is above some threshold,

d( \tilde{y}, \tilde{y}' ) > \epsilon.

The theory behind this method is…a little flimsy. What one can show is that in certain cases if the posterior concentrates to a point with infinite data then in the limit \epsilon \rightarrow 0 this simulation-based rejection procedure will also return that point. It is much, much harder to say anything about what happens when \epsilon > 0 or when we have only finite data so that the posterior distribution does not collapse to a point. Many people naively assume that these methods generate approximate posterior samples which can be used to estimate posterior expectation values, but the theory doesn’t actually support that!

In practice we can’t take \epsilon = 0 or else we would just reject everything. The smaller the threshold the better the samples will hopefully approximate exact posterior samples, but also the more likely we are to reject samples. Just as in rejection sampling the ultimately performance of any such technique strongly depends on just how well we can propose model configurations from the posterior typical set in the first place.

The bigger problem, however, is the distance metric itself. The choice of distance metric corresponds to a choice of what properties of the observational space are compared to inform a rejection. For example a distance metric might only consider some components of the space, ignoring others, or only some summaries, such as comparing centrality but ignoring breadth. This is why the distance function is often referred to a “summary statistic”. These Approximate Bayesian Computation methods only reject model configurations that give rise to simulated data sets with the wrong summaries even if the simulations themselves look very different from the observed data in other ways!

What this means is that even in the \epsilon \rightarrow 0 limit this procedure can return samples that look nothing like exact posterior samples. There is no guarantee that the output samples will be contained in the posterior typical set or expand into the entire posterior typical set. Consequently any posterior expectation value estimators derived from these samples can be arbitrarily wrong.

To summarize these “simulation-based” methods offer few if any guarantees for accurate posterior quantification. Typically more accurate posterior quantification requires either very clever proposal mechanisms or simply lots of proposals and the corresponding cost of generating many rejections for a single acceptance.

Because of these issues much of the statistical analysis has moved onto treating these methods not so much as approximate computation but rather approximate modeling. The simulation mechanisms and distance function together can be interpreted as defining an implicit likelihood function, and hence an implicit posterior distribution, with the rejection sampling generating exact samples from this model. How close this likelihood function and posterior distribution are to the actual likelihood function and posterior distribution specified by your model, however, is left completely unspecified. In that sense it’s not Approximate (Bayesian Computation) but rather (Approximate Bayesian) Computation.

I spent a lot of time trying to make probabilisitic computation more feasible, and I’ve investigated these approaches quite a bit. In my informed opinion they are fundamentally flawed and should not be used in any meaningful analysis. Others, however, will have different opinions. There are many papers laying out variations of these methods and many packages implementing certain variations with which you can experiment. Just keep in mind your ultimate objective – if it’s to accurately quantify the posterior distribution then judge these methods by that criteria.

8 Likes