HINTS: An alternative to NUTS that doesn't require gradients

Not appealing to me. How do we even talk about efficiency of an algorithm that can’t recover parameters from simulated data?

I’m totally OK telling users that Stan doesn’t solve every problem.

I’m not quite sure what’s being proposed. Some extension to Stan’s language to allow BART to be a component of models that also have continuous components fit with HMC? Or would it be standalone? We don’t even know how to mix simple discrete sampling with HMC reliably. PyMC3 supports it, but I don’t think they’ve done the simulation-based calibration to show it reliably recovers parameters.

Edit. Oops. Saw Dan said same thing below (now above!):

1 Like

I skimmed this paper, thanks. I suspect the main distinction between their result and what you’d get with HINTS is that we use subsampling to build directed proposals, and go on to accept or reject these proposals using larger data subsets (and the whole dataset eventually), which guarantees to eliminate any bias. (In deep learning there’s been a lot of success using gradients of minibatches and I suspect the same will be true in a sampling context.)

We are working on a number of things which I think are getting conflated here. First, we are working on novel algorithms, eg SMC samplers, HINTS (and combinations therein). We will get those to run on a branch on github such that they can be applied to problems that Stan can handle. Our proposal will then be to merge those algorithms into the main release. Second we are working to extend Stan’s language to enable it to be as expressive as other probabilistic programming languages (eg PyMC3) and so tackle problems that Stan/NUTS cannot currently handle (which I clearly see as much more important than some of the other people engaged on this thread) but which these new algorithms can deal with. Again, we’ll get that working on a branch and then we’ll propose to merge.

We need to solve these software integration issues to deliver on projects that we are being paid to work on. What the Stan developer community decides to do when we have something working will be interesting. My hope was that by adopting a strategy of engaging early we’d have some steer on how to best configure our work to maximise the chance that current and future Stan users would gain access to and be beneit from what we will soon have done. I must confess to currently being somewhat unsure as to whether that was a good strategy.

It may be helpful to write up a design doc for these extensions

1 Like

@stevebronder: Thanks for the suggestion to articulate our plans in a design doc.
Simon

Maybe, but see:

and

+1 to that. It’s really awkward to get a completed PR we wind up not wanting.

With new services methods like the ones we have?

What are you thinking of here? PyMC3 lets you do blocked sampling with discrete sampling, but that’d be a huge change to the language and the samplers. And we’re not even sure it works in PyMC3 because they’ve never evaluated it (at least they hadn’t the last time I’d asked them). We’d want to see SBC-type calibration studies from simulations before believing mixing discrete sampling and HMC would work with HMC’s adaptation. And see an application where it works better than marginalization. The only such application I can imagine is missing count data, because nothing else I know involving discrete sampling (variable selection, etc.) is going to pass SBC or be faster than marginalizing.

For samplers, we’re going to want to see them evaluated on our 300+ test problems to see they get the right answer. The right place to start is the dozen or so models in the repo stat_comp_benchmarks. If a general algorithm can’t solve those to within tolerances in reasonable time, it’s very unlikely we’ll accept it.

Any language work will need to be in OCaml in the stanc3 repo, because the current C++ parsers are scheduled for demolition. The beta is already out for the new parser and we’re not accepting changes to the old one, which will be removed n Stan 2.21 or 2.22.

@Bob_Carpenter: In response:

  1. Thanks for the pointer to the OCaml development.
  2. We are already working towards evaluation of alternative inference algorithms on the corpus of Stan files that we understood @avehtari to be working up (though when we last engaged on that axis, we understood that there were about 200 such files and that NUTS was not yet applied to them and/or that the “right answer” was not yet defined, so it may be that we are thinking of different things).
  3. We recognise that adding discrete variables would be a signficant change and understand your desire to document extensions in a PR, have tests that are similar to SBC, include service methods you mention etc. However, my sense at this point in time is that such discussion is premature and we need to make more progress before engaging with the pre-existing Stan developer community.

Bob,

Thanks for taking the time to reply! Yes, you showed me the first paper before and your subsampling plots are convincing: we should not naively assume that biases in chains run on subsets will ‘cancel out’ in the aggregated sample. (I like the look of the new acceptance rule here https://bair.berkeley.edu/blog/2017/08/02/minibatch-metropolis-hastings/ but I haven’t tried their approach so I can’t say whether the correction they introduce is effective.)

The HINTS approach is a bit more subtle: it doesn’t trust subsets to give us unbiased samples from the true target (over the whole dataset); proposals constructed using subsets are always accepted/rejected using the whole dataset in the end, so there is no bias. All it uses the subsets for is to give us proposals that can make large directed jumps in the state space. (Gradients are optional with HINTS but my intuition is they will be useful for complex models.)

I think a good way forward would be for the Liverpool researchers to see whether either of these approaches does a good job on that logistic regression fit!

My random 50 cents on this: even if posterior samples underlying estimates bases on subsets of the data are far away from the true posterior (based on all data), could’t one merge the subsample-based estimates with some sort of meta-analysis?

@LucC: yes. That’s basically what HINTS does.

There are problems that I’m interested in such as inference of PDE/DDE that lack of gradient is a norm for existing solvers. I don’t know if HINTS is the answer but there’s certainly a market for non-gradient-based samplers.

1 Like

Everybody has that intuition, but nobody knows how to do it effectively. The main problem is concentration of measure.

See the linked paper of @betanalpha above (Fundamental Incompatibility of Hamiltonian Monte Carlo and Data Subsampling) and to the post I linked on Subsampling in Parallel and MCMC for an illustration of the concentration problem.

You can use techniques like differential evolution, which use an ensemble of particles, which at stationarity, can be used to estimate posterior covariance and thus inform Metropolis proposals. But without gradients, you still get Metropolis, and it’s slow. There are other particle techniques, but they all have the same problem with not being able to follow the flow defined by the typical set.

Those aren’t in Stan yet (nor is it clear they will be), so you’ll have to ask @avehtari. There are some evaluations that @yuling did of ADVI using a large set of models, but I’m not sure how he did it. Those were drawn largely from the example models repo.

The first step should be to evaluate with the stat comp benchmark models.

There is some tooling on top of that in this repo.

Yes, particle filter also what I’m think about. The point I’m trying to make is there are applications where scalability is less important than making sensitivity-free solvers to work, and maybe those applications(say, geophysics) can be in the purview of Stan if we include other samplers.

There are roughly 200 example models in total and HMC has pathological results on a few of them.

1 Like

How about ADVI 's performance on them?

@yizhang: We are actively working on improving particle filters to incorporate, for example NUTS as a proposal distribution and to achieve variance that is below the currently perceived lower bound. I’d like Stan to be able to describe problems that involve unbounded streams of incremental data (with and without time-varying states) as well as facilitate, for example, PMCMC (where we might use NUTS in the proposal for both the particle filter and for the MCMC bit (which requires the gradient of the particle filter’s log likelihood, which we know how to calculate)).

@Bob_Carpenter (and with more relevance to this thread’s title): as @Malcolm_S highlights, HINTS differs from the algorithms described in the “Fundamental Incompatibility of Hamiltonian Monte Carlo and Data Subsampling” note and I’m convinced that the counter-argument for data subsampling does hold for HINTS. We’re working to see what the truth is.

This would be a very nice thing to have. Somethings that I’m working on(stochastic differential equation models and general HMM) could be hugely benefited by online updating, as current sampling framework in Stan is more geared toward batch mode.