HINTS: An alternative to NUTS that doesn't require gradients

Have Stan developers considered HINTS [1]? I fell that it’s an overly-obscure algorithm that obeys detailed balance, is efficient at sampling the typical set and doesn’t require gradients. It’s therefore a potentially useful augmentation to Stan: motivated by the potential to do efficient inference where gradients are not readily calculated (eg for large discrete spaces), we are working to implement HINTS and will also add it as a sibling to NUTS in Stan in the hope that doing so would be useful.

See here for the paper:

I’ll be fascinated to understand Stan developers’ thoughts and comments.

Cheers
Simon

[1] Malcolm Strens. “Efficient hierarchical MCMC for policy search.” Proceedings of the twenty-first international conference on Machine learning . ACM, 2004.

Their baseline MCMC looks like a diffusive Metropolis.

I don’t understand their language well enough to understand what they’re doing or how their control problem of landing a vehicle on a moving ship relates to statistics.

I’d be very skeptical of a method that doesn’t use gradients, as it’s pretty much the only guide we have:

HINTS as described, does not make use of gradient information that may be available.

I don’t know what existed in 2004 when the paper was written, but we will compare with Stan’s NUTS implementation (where a gradient exists).

For me, the fact it is (very) informed by data but yet does not rely on gradients is precisely what makes it interesting. I doubt it outperforms NUTS when you can apply NUTS. The fact you can use it when the gradient is not available is exciting.

Cheers
Simon

It seems unlikely an algorithm will scale with dimension without using gradients.

The way HMC is able to move long distances is by following the flow induced by the Hamiltonian, which is done in practice using gradients in the leapfrog algorithm.

Algorithms like differential evolution (ter Braak) or walkers (Goodman and Weare) use multiple particles to try to estimate curvature empirically with a lot of parallelization. Unfortunately, these methods don’t scale well with dimension because interpolations among existing draws is unlikely to fall in the typical set, so you’re back to small steps like Metropolis, albeit with better curvature estimation than pure random walk.

2 Likes

I think you are right that HINTS won’t scale as well as NUTS to high dimensions unless you have lots of data: HINTS aggregates naive random walks to appear informative so its ability to implicitly infer gradients is a function of the amount of aggregation (ie the number of data). We’ll look at that, but I expect you will be shown to be correct.

However, I’m not drawn to HINTS because I think it would replace NUTS where NUTS is good but because HINTS is applicable in situations where you can’t use NUTS. More specifically, HINTS would let you do something like NUTS in situations where there is no gradient, eg because the space is discrete (eg for BART/CART). I feel (relatively strongly) that Stan would benefit from catering for such models and we’re working to facilitate that. Integrating HINTS alongside NUTS is a step on that path.

If you’re interested in the algorithm, go ahead and test it out. (Maybe a framework like this could be helpful https://arxiv.org/abs/1804.06788). There is currently not enough information available to suggest it would work robustly and scalable in “industrial strength” code that doesn’t require hand tuning (which is the minimum standard for even considering inclusion in Stan. Almost no MCMC algorithms meet this standard. Currently none do for generic discrete prob,ems) , but if you get closer to that goal please keep letting us know.

1 Like

Hi, as author of the HINTS paper, Simon directed me here, and I’ll make a couple of remarks:

  • It’s a sampling algorithm, so it’s just unfortunate that the only publication focused on optimisation. (As you know, samplers can be useful for optimisation.)
  • The point is that it is cheap to build up directed proposals from small steps, using small subsets of the data. These directed proposals make big moves, but are much more likely to be accepted on more expensive evaluations (larger subsets) than random moves of similar scale.
  • While the paper builds these directed proposals from small Gaussian-distributed steps, gradient based proposals could be used equally well for the primitive steps. You just have to be able to calculate the same information you’d use in the acceptance rule for HMC or Langevin. (Happy to explain further…)
  • I also have a proof (unpublished) that it provides unbiased samples of the target density.
  • Clearly, HINTS is related to the very recent work on Minibatch MCMC.
4 Likes

You might want to take the pulse of the project on this one. There’s a lot of mixed feelings about combinatorially intractable models like BART or LDA and whether we should be bothering with them.

From a project management point of view, our big advantage is that we have a rich modeling language that’s differentiable. Without the need for derivatives, it’s a lot easier to just ask people to plug in simpler functions, like the emcee package in Python.

Has anyone gotten that to work for interesting models? @betanalpha has an arXived paper suggesting it’s hopeless:

3 Likes

Thanks much for joining the discussion. We really appreciate it.

Bob,
Regarding models like bart and lda, i find it hard to simply tell peole that another algorithm would be worse (or better) and would prefer to show them the relative merits of different approaches. Is the transparency that would accompany an efficient bart implementation alongside an altenative that some would say was inferior (eg logistic regression perhaps) not appealing, particualrly when users want it and think that they need it?
Simon

Ps i will leave it to malcolm to comment on michael’s arxiv paper.

I’m fairly certain nothing like this will happen in Stan in the near future. There are many reasons, but a particularly compelling one is that it is most likely several person-years worth of work that no one in volunteering to do.

There are other packages that try to fit BART and LDA models. Maybe one will come along based on HINTS. Users who need to fit those models can use these packages. Stan doesn’t have to do everything.

@anon75146577: sorry if this wasn’t apparent but i am not asking for this to be integrated by someone else but asking for advice and comments on our plans to undertake the many person-years of development you describe. So i do think it is likely to happen.

3 Likes

Not appealing to me. How do we even talk about efficiency of an algorithm that can’t recover parameters from simulated data?

I’m totally OK telling users that Stan doesn’t solve every problem.

I’m not quite sure what’s being proposed. Some extension to Stan’s language to allow BART to be a component of models that also have continuous components fit with HMC? Or would it be standalone? We don’t even know how to mix simple discrete sampling with HMC reliably. PyMC3 supports it, but I don’t think they’ve done the simulation-based calibration to show it reliably recovers parameters.

Edit. Oops. Saw Dan said same thing below (now above!):

1 Like

I skimmed this paper, thanks. I suspect the main distinction between their result and what you’d get with HINTS is that we use subsampling to build directed proposals, and go on to accept or reject these proposals using larger data subsets (and the whole dataset eventually), which guarantees to eliminate any bias. (In deep learning there’s been a lot of success using gradients of minibatches and I suspect the same will be true in a sampling context.)

We are working on a number of things which I think are getting conflated here. First, we are working on novel algorithms, eg SMC samplers, HINTS (and combinations therein). We will get those to run on a branch on github such that they can be applied to problems that Stan can handle. Our proposal will then be to merge those algorithms into the main release. Second we are working to extend Stan’s language to enable it to be as expressive as other probabilistic programming languages (eg PyMC3) and so tackle problems that Stan/NUTS cannot currently handle (which I clearly see as much more important than some of the other people engaged on this thread) but which these new algorithms can deal with. Again, we’ll get that working on a branch and then we’ll propose to merge.

We need to solve these software integration issues to deliver on projects that we are being paid to work on. What the Stan developer community decides to do when we have something working will be interesting. My hope was that by adopting a strategy of engaging early we’d have some steer on how to best configure our work to maximise the chance that current and future Stan users would gain access to and be beneit from what we will soon have done. I must confess to currently being somewhat unsure as to whether that was a good strategy.

It may be helpful to write up a design doc for these extensions

1 Like

@stevebronder: Thanks for the suggestion to articulate our plans in a design doc.
Simon

Maybe, but see:

and

+1 to that. It’s really awkward to get a completed PR we wind up not wanting.

With new services methods like the ones we have?

What are you thinking of here? PyMC3 lets you do blocked sampling with discrete sampling, but that’d be a huge change to the language and the samplers. And we’re not even sure it works in PyMC3 because they’ve never evaluated it (at least they hadn’t the last time I’d asked them). We’d want to see SBC-type calibration studies from simulations before believing mixing discrete sampling and HMC would work with HMC’s adaptation. And see an application where it works better than marginalization. The only such application I can imagine is missing count data, because nothing else I know involving discrete sampling (variable selection, etc.) is going to pass SBC or be faster than marginalizing.

For samplers, we’re going to want to see them evaluated on our 300+ test problems to see they get the right answer. The right place to start is the dozen or so models in the repo stat_comp_benchmarks. If a general algorithm can’t solve those to within tolerances in reasonable time, it’s very unlikely we’ll accept it.

Any language work will need to be in OCaml in the stanc3 repo, because the current C++ parsers are scheduled for demolition. The beta is already out for the new parser and we’re not accepting changes to the old one, which will be removed n Stan 2.21 or 2.22.