Ok, further thoughts:
Suppose we have a scheme for sampling locations from a target distribution p(x), using some sort of HMC like process, but it has divergences, and/or it involves smoothing out regions where divergences occur, or it involves adding noise to the system in the process of reducing computational cost, or the like. In the end, we expect that it produces a biased sample, but we donāt know exactly what happens.
Suppose then that it samples from a density proportional to p(x)b(x) where p(x) is our desired target density, and b(x) is a bounded non-negative function biasing the distribution.
Letās further suppose that the bias is in some sense small. Maybe in terms of a Kullback-Liebler divergence or an expectation of b(x) under p(x) or the like.
Now, suppose we have an exact, but potentially computationally expensive Markov process T(x,i) which converges to p(x) exactly and doesnāt suffer from the biases (for example, random isotropic normal proposal diffusive Metropolis) but also doesnāt give as broadly distributed a sample in the same wall time. It moves like a random walk.
Let {x_i} be a sample from the original HMC process, then:
{T(x_i,j)} is a family of samples which converges as j increases to a sample from p(x)
Now, suppose we have a family of expectations we are interested in, for example, mean and sd of each parameter, or the means of certain group level hierarchical parameters, or the mean lp__ or what have you. Our concern is to eliminate biases in these inferences caused by the difficulties experienced in sampling using our original sampler.
So, we run {T(x_i,j)} forward several steps in j, calculate the expectations, fit a linear regression to each sequence of expectations, and test to see if the ensemble of trend-lines through the expectations can be rejected as a sample from a jointly normal distribution with mean 0 vector and sd of each entry estimated from the se of each estimate. If the trends in expectations stay statistically āflatā for several iterations of this process, we assume that we have a sufficiently converged sample.
Now, how does this help us?
Suppose HMC wants small step sizes, and has divergences, as is the case in models such as the ones @Charles_Driver and I have been running.
One way to deal with this is to set adapt_delta very high, say 0.99999 and take very small step sizes, use very long treedepth, run a long adaptation to get a decent mass matrix estimate, and use a lot of wall clock time to get an exact sample from HMC. In my problems this could be literally days or weeks of computing to get a sample without divergences. Or, perhaps I can never get a divergence free sample.
Another way to deal with the issue, is to set adapt_delta low, say 0.8, set treedepth low, say 7, take a large number of iterations in less wall time (say 1000 iterations, treedepth 7), which costs 128000 function/gradient evaluations, compared with treedepth say 15 which would do the same number of evaluations in 4 iterations. And recognize that the sample is biased. Then, using an exact metropolis transition, evolve the entire ensemble until the expectations of interest flatten out.
Suppose that gradient evaluation takes 0.03 seconds, which is typically the kind of thing Stan spits out. Then 2000 iterations at treedepth of 12, takes 3 days. 200 iterations at treedepth of 7 takes 13 minutes. If I have a budget for say 15 minutes of computing time, I now can do 2 minutes of metropolis, and I donāt have to calculate any gradients. So letās assume not doing gradients drops computational time to 0.003 seconds per function eval. I can now do 200 iterations of metropolis on each of the 200 samples.
Furthermore, thereās no reason I canāt parallelize the metropolis transitions, so I can read in N samples from a file, and then ask different processes to run metropolis chains on each one for K steps, and then output the new samples, lather, rinse, repeat. Wall clock time is close to say 2/4 minutes for 4 cores.
Now, Iām sure there are other very good ideas about how to get fast unbiased samples, such as Michaelās Adiabatic sampling etc. But, at the moment, as a practical way to handle divergences and small step sizes and soforth, does anyone see any major impediment to correctness or implementation of this sanitization procedure?
Also, @Bob_Carpenter, can the Stan libraries etc be made to produce a function that simply evaluates lp without evaluating the gradients? If it were possible to call a function that would read a Stan model, and compile me a C++ function that evaluates lp without gradients, I might actually be able to write this sanitizer thingy.