HINTS: An alternative to NUTS that doesn't require gradients

Have Stan developers considered HINTS [1]? I fell that it’s an overly-obscure algorithm that obeys detailed balance, is efficient at sampling the typical set and doesn’t require gradients. It’s therefore a potentially useful augmentation to Stan: motivated by the potential to do efficient inference where gradients are not readily calculated (eg for large discrete spaces), we are working to implement HINTS and will also add it as a sibling to NUTS in Stan in the hope that doing so would be useful.

See here for the paper:

I’ll be fascinated to understand Stan developers’ thoughts and comments.

Cheers
Simon

[1] Malcolm Strens. “Efficient hierarchical MCMC for policy search.” Proceedings of the twenty-first international conference on Machine learning . ACM, 2004.

Their baseline MCMC looks like a diffusive Metropolis.

I don’t understand their language well enough to understand what they’re doing or how their control problem of landing a vehicle on a moving ship relates to statistics.

I’d be very skeptical of a method that doesn’t use gradients, as it’s pretty much the only guide we have:

HINTS as described, does not make use of gradient information that may be available.

I don’t know what existed in 2004 when the paper was written, but we will compare with Stan’s NUTS implementation (where a gradient exists).

For me, the fact it is (very) informed by data but yet does not rely on gradients is precisely what makes it interesting. I doubt it outperforms NUTS when you can apply NUTS. The fact you can use it when the gradient is not available is exciting.

Cheers
Simon

It seems unlikely an algorithm will scale with dimension without using gradients.

The way HMC is able to move long distances is by following the flow induced by the Hamiltonian, which is done in practice using gradients in the leapfrog algorithm.

Algorithms like differential evolution (ter Braak) or walkers (Goodman and Weare) use multiple particles to try to estimate curvature empirically with a lot of parallelization. Unfortunately, these methods don’t scale well with dimension because interpolations among existing draws is unlikely to fall in the typical set, so you’re back to small steps like Metropolis, albeit with better curvature estimation than pure random walk.

2 Likes

I think you are right that HINTS won’t scale as well as NUTS to high dimensions unless you have lots of data: HINTS aggregates naive random walks to appear informative so its ability to implicitly infer gradients is a function of the amount of aggregation (ie the number of data). We’ll look at that, but I expect you will be shown to be correct.

However, I’m not drawn to HINTS because I think it would replace NUTS where NUTS is good but because HINTS is applicable in situations where you can’t use NUTS. More specifically, HINTS would let you do something like NUTS in situations where there is no gradient, eg because the space is discrete (eg for BART/CART). I feel (relatively strongly) that Stan would benefit from catering for such models and we’re working to facilitate that. Integrating HINTS alongside NUTS is a step on that path.

If you’re interested in the algorithm, go ahead and test it out. (Maybe a framework like this could be helpful https://arxiv.org/abs/1804.06788). There is currently not enough information available to suggest it would work robustly and scalable in “industrial strength” code that doesn’t require hand tuning (which is the minimum standard for even considering inclusion in Stan. Almost no MCMC algorithms meet this standard. Currently none do for generic discrete prob,ems) , but if you get closer to that goal please keep letting us know.

1 Like

Hi, as author of the HINTS paper, Simon directed me here, and I’ll make a couple of remarks:

  • It’s a sampling algorithm, so it’s just unfortunate that the only publication focused on optimisation. (As you know, samplers can be useful for optimisation.)
  • The point is that it is cheap to build up directed proposals from small steps, using small subsets of the data. These directed proposals make big moves, but are much more likely to be accepted on more expensive evaluations (larger subsets) than random moves of similar scale.
  • While the paper builds these directed proposals from small Gaussian-distributed steps, gradient based proposals could be used equally well for the primitive steps. You just have to be able to calculate the same information you’d use in the acceptance rule for HMC or Langevin. (Happy to explain further…)
  • I also have a proof (unpublished) that it provides unbiased samples of the target density.
  • Clearly, HINTS is related to the very recent work on Minibatch MCMC.
3 Likes