HINTS: An alternative to NUTS that doesn't require gradients

s.maskell · September 9, 2019, 7:07am

Thanks for the pointer to the OCaml development.
We are already working towards evaluation of alternative inference algorithms on the corpus of Stan files that we understood @avehtari to be working up (though when we last engaged on that axis, we understood that there were about 200 such files and that NUTS was not yet applied to them and/or that the “right answer” was not yet defined, so it may be that we are thinking of different things).
We recognise that adding discrete variables would be a signficant change and understand your desire to document extensions in a PR, have tests that are similar to SBC, include service methods you mention etc. However, my sense at this point in time is that such discussion is premature and we need to make more progress before engaging with the pre-existing Stan developer community.

Malcolm_S · September 9, 2019, 9:58am

Bob,

Thanks for taking the time to reply! Yes, you showed me the first paper before and your subsampling plots are convincing: we should not naively assume that biases in chains run on subsets will ‘cancel out’ in the aggregated sample. (I like the look of the new acceptance rule here https://bair.berkeley.edu/blog/2017/08/02/minibatch-metropolis-hastings/ but I haven’t tried their approach so I can’t say whether the correction they introduce is effective.)

The HINTS approach is a bit more subtle: it doesn’t trust subsets to give us unbiased samples from the true target (over the whole dataset); proposals constructed using subsets are always accepted/rejected using the whole dataset in the end, so there is no bias. All it uses the subsets for is to give us proposals that can make large directed jumps in the state space. (Gradients are optional with HINTS but my intuition is they will be useful for complex models.)

I think a good way forward would be for the Liverpool researchers to see whether either of these approaches does a good job on that logistic regression fit!

LucC · September 9, 2019, 5:17pm

My random 50 cents on this: even if posterior samples underlying estimates bases on subsets of the data are far away from the true posterior (based on all data), could’t one merge the subsample-based estimates with some sort of meta-analysis?

s.maskell · September 10, 2019, 7:10am

@LucC: yes. That’s basically what HINTS does.

yizhang · September 12, 2019, 4:55pm

There are problems that I’m interested in such as inference of PDE/DDE that lack of gradient is a norm for existing solvers. I don’t know if HINTS is the answer but there’s certainly a market for non-gradient-based samplers.

Bob_Carpenter · September 13, 2019, 2:12pm

Everybody has that intuition, but nobody knows how to do it effectively. The main problem is concentration of measure.

See the linked paper of @betanalpha above (Fundamental Incompatibility of Hamiltonian Monte Carlo and Data Subsampling) and to the post I linked on Subsampling in Parallel and MCMC for an illustration of the concentration problem.

You can use techniques like differential evolution, which use an ensemble of particles, which at stationarity, can be used to estimate posterior covariance and thus inform Metropolis proposals. But without gradients, you still get Metropolis, and it’s slow. There are other particle techniques, but they all have the same problem with not being able to follow the flow defined by the typical set.

Bob_Carpenter · September 13, 2019, 2:20pm

Those aren’t in Stan yet (nor is it clear they will be), so you’ll have to ask @avehtari. There are some evaluations that @yuling did of ADVI using a large set of models, but I’m not sure how he did it. Those were drawn largely from the example models repo.

The first step should be to evaluate with the stat comp benchmark models.

There is some tooling on top of that in this repo.

yizhang · September 13, 2019, 2:24pm

Yes, particle filter also what I’m think about. The point I’m trying to make is there are applications where scalability is less important than making sensitivity-free solvers to work, and maybe those applications(say, geophysics) can be in the purview of Stan if we include other samplers.

yuling · September 13, 2019, 4:41pm

There are roughly 200 example models in total and HMC has pathological results on a few of them.

yizhang · September 13, 2019, 4:53pm

How about ADVI 's performance on them?

s.maskell · September 13, 2019, 5:33pm

@yizhang: We are actively working on improving particle filters to incorporate, for example NUTS as a proposal distribution and to achieve variance that is below the currently perceived lower bound. I’d like Stan to be able to describe problems that involve unbounded streams of incremental data (with and without time-varying states) as well as facilitate, for example, PMCMC (where we might use NUTS in the proposal for both the particle filter and for the MCMC bit (which requires the gradient of the particle filter’s log likelihood, which we know how to calculate)).

@Bob_Carpenter (and with more relevance to this thread’s title): as @Malcolm_S highlights, HINTS differs from the algorithms described in the “Fundamental Incompatibility of Hamiltonian Monte Carlo and Data Subsampling” note and I’m convinced that the counter-argument for data subsampling does hold for HINTS. We’re working to see what the truth is.

yizhang · September 13, 2019, 5:43pm

This would be a very nice thing to have. Somethings that I’m working on(stochastic differential equation models and general HMM) could be hugely benefited by online updating, as current sampling framework in Stan is more geared toward batch mode.

yuling · September 13, 2019, 6:16pm

I didn’t check it explicitly as we typically view NUTS as a golden standard for approximation algorithms (e.g. VI).

For exact inference, SBC (Talts et al) is more appropriate for evaluation.

breckbaldwin · September 13, 2019, 10:18pm

All,
This was a painful thread to read and thankfully the ‘tone’ has calmed.
I want to remind everyone to be aware of how forum posts are a very volatile form of communication because much is read into the little parts.
Please be kind in your communications.

That said I’d like to offer my perspective:

The Stan project is very resource limited and big changes can be easily seen as threats to Stan. HINTS presents a huge change and as such has induced an ‘immune system’ response of overt skepticism in my opinion.
Limited resources leads to conservative resource allocation. We want low risk/high reward on the engineering side.
The Stan project is just getting up to speed on our ‘model database’ of 200 or so models for inference evaluation. It is not ready. We mention it because it is the right way to evaluate HINTS but very preliminary.
The Stan project does however aspire to a well documented and supported code base for developers because that is essential to making Stan work with limited resources–we want external contributions. We want to help on the technical side.
HINTS is exploring sampling space that makes many of our senior contributors nervous but there no decision has been made excluding it. Some of our other senior contributors are excited by the possibility. Stan is, as you are seeing, a community and no one speaks for it as an individual even though they may express themselves that way.

I suggest that the HINTS team start working their way through the diagnostic models while getting the help they need to interface with the parts of Stan that allow them to experiment. Until the Stan community starts seeing new modeling territory being explored successfully the skepticism will remain.

Also consider helping Stan function as it is before trying to change how we do things. Pick up some simple issues and submit some pull requests. We need to get to know each other a bit. We have coding standards that take a while to learn.

I for one am rooting for you, although I am skeptical. I think Bayesian modeling is due for another breakthrough, NUTS around 2012, Gibbs in the 1990’s, Met-Hastings in the ??50’s-70s??, Galton boxes 1890’s and LaPlace reads Bayes’ notes in 1790’s. Nice exponential scale, 2020 for the next big thing?

Breck

Bob_Carpenter · September 16, 2019, 1:42pm

The reason we haven’t gone down this road before is that there seem to be reasonable packages for doing this in Python like emcee.

The difficulty for Stan is in how we incorporate black-box functions in C, Fortran, etc., which is the baggage most of these models have. And how we control for them not supporting derivatives so people don’t use them in cases where they won’t work with Stan.

Then the question arises as to how useful the Stan modeling language is at that point.

Are you on the right thread? If so, what did you find painful? I’m trying to be constructive, not contentious.

I don’t think anyone’s seeing this as a threat. If we can add particle methods to Stan that improve sampling performance, that’d be awesome and we’d welcome it. It’s not like this is going to present any kind of interface challenge.

This is because nothing else has ever been shown to work in high dimensions other than HMC and there are a lot of reasons to believe particle-based methods won’t work. There’s nothing particular about this case—I haven’t even read the algorithm details.

The burden of proof here really has to be on the contributor, as I get a lot of people reaching out telling me I should be using their sampling idea, and I’m not even the sampling expert on Stan. To that end, we need to make evaluation easier, which we’re trying to do. We’ll talk again with people if they can pass the simple stat_comp_benchmarks for accuracy and show promise for performance beyond what Stan can alreadyd do.

If you’re talking about the one Aki’s working on, I dont think that’s being set up for evaluation. There were all sorts of requirements being thrown around like having a place for people to contribute models. What we need for evaluation is models where we know the right answer.

I don’t understand why you think anyone’s nervous about this. I think we’re just waiting for the proposed evaluations and trying to explain what those are going to have to look like.

I’m down with the sentiment here, even if Breck is stating it as if he’s speaking for the project :-). I hope I’m not giving anyone the impression that I’m trying to speak for the project as a whole.

P.S. On the timeline, Michael wrote a nice history of MCMC you might be interested in reading. Metropolis was 1953; Hastings generalization was in 1970. HMC was introduced in 1987, about the same time as Gibbs sampling, but the auto-tuning version NUTS didn’t show up until 2012. We really needed the autodiff and language to make HMC/NUTS practical (AMDB got a long way there conceptually with autodiff and early HMC implementations).

yizhang · September 16, 2019, 3:52pm

The grant that we were discussing recently with @andrewgelman has an optional item for PDEs. So if we don’t consider those black-box solvers, we’ll have to take route on doing that in Stan, which will essentially make the feature useless(except to fill the checkbox). Moreover, there are some black-box 3rd-party/commercial sensitivity-enabled PDE solvers(I used to work on such a develpment), so it’s not completely without merit to consider supporting them.

seantalts · September 16, 2019, 9:44pm

I think Stan is incredibly useful even if it supports externing functions with gradients. The idea is to increase interoperability with other packages, not to encourage people to drop into C++ to hand-code parts that Stan can’t express.

Back on the topic, I just wanted to say that @s.maskell was concerned that his groups’ algorithms weren’t being judged on the same benchmarks as @wds15’s parallelization of NUTS and said that it was a judgment call whether something constitutes a new algorithm or not. I just wanted to say that Sebastian’s work actually returns the exact same numerical results sample by sample and leapfrog by leapfrog, so I don’t think it requires a judgment call to say that it’s the same algorithm. Passing the stat_comp_benchmark tests seems like a great first goalpost for any new algorithm that doesn’t return the same results to be included in the Stan source; but of course Stan is architected in such a way that it should be relatively easy to take the Stan compiler and Math library and use them together with one’s own algorithms.

avehtari · September 16, 2019, 9:52pm

I think it’s being set up for evaluation and more (progress is bit slow in September as teaching and grant deadline season started).

yizhang · September 16, 2019, 10:07pm

extern in stan3 is supposed to do this, is it?

seantalts · September 16, 2019, 10:10pm

There should be something like that but its design hasn’t been fully fleshed out. If you have an idea and want to make a design-doc for how it should work to get the conversation started that’d be awesome!

Topic		Replies	Views
New algorithm: Gradient-based Adaptive Markov Chain Monte Carlo Algorithms mcmc	13	2111	February 16, 2022
NUTS inside a Sequential Monte Carlo sampler Algorithms	12	1904	August 10, 2021
How to compile and profile NUTS in stan/src/stan/mcmc/hmc/nuts CmdStan	2	523	November 8, 2022
Reference for current version of NUTS/HMC in stan? Algorithms	10	1760	March 15, 2022
Heuristics for the advantages of NUTS/HMC/Stan vs RWMH General	30	3909	September 13, 2017

HINTS: An alternative to NUTS that doesn't require gradients

Related topics