Heuristics for the advantages of NUTS/HMC/Stan vs RWMH

anon75146577 · September 3, 2017, 7:49pm

I 100% agree with this! I think the best you can say is that if you miss some of the posterior mass you potentially miss some structure in the posterior predictive (which can manifest in overfitting), but Bayesian models are not more robust than non-Bayesian models. Statistics is hard no matter what toolbox you use.

betanalpha · September 5, 2017, 1:19pm

Yes, which means that most ML analyses are actually utilizing models that are defined only implicitly via the specific combination of base model and algorithm.

This is bad enough for interpretability and reproducibility, but it gets worse when compounded with non-identfiable objectives. Many model configurations will look good on a single set test data, most of those will do poorly when generalized to other sets of tests data (an immediate argument of our friend concentration of measure!) so optimizing to an objective defined via a single test set can lead to fragile generalization.

Let’s hold off on the self-driving car for the moment and focus on Go. AlphaGo isn’t some canned deep learning model – it is a carefully tuned reinforcement model with hand-crafted strategies based on expert Go knowledge and deep learning to fill in some of the gaps. Not an unimpressive feat, but at the same time not as simple as throwing deep learning on some Kaggle experiment!

Note also that these methods also end up using much strong objectives in practice that interrogate more of the predictive distribution and hence improve the utility of the resulting fit.

Sure, but who’s arguing that? The argument is what’s better for decision making within the confines of given assumptions: using all the model configurations consistent with the data or just one/a few of them?

betanalpha · September 5, 2017, 1:20pm

Definitely. But following that logic, doesn’t a point estimate miss quit a bit of posterior mass, like all of it measure-theoretically, and hence is massively susceptible to overfitting?

Bob_Carpenter · September 10, 2017, 10:47am

If I had a $ for every time I told someone that, I’d have dozens of $s. I dislike the hype about just throwing deep learning at a problem. It’s all about engineering the predictors (inputs, features) and the network topology.

Won’t that depend on the model? I’m imagining you have wildly misspecified tails, which could then cause worse inference than just fitting an MLE and pluggin in the estimate for inference.

betanalpha · September 10, 2017, 1:50pm

But how would you know that the tails are misspecified but the bulk is fine? This argument is common in many fields, chiefly econometrics and machine learning, but it relies on the naive intuition that the bulk is somehow much easier to model than the tails. Both can be hard to model in practice, and the only way to evaluate the model robustly is to be able to construct accurate predictions through accurate fits!

Relying on point estimates convolves the purported model with the fitting algorithm to define some implicit model, confusing what properties of each contribute to that final model and hence impeding accurate model validation.

Bob_Carpenter · September 10, 2017, 2:06pm

I’m agreeing with you here :-)

What drove me crazy in ML was not only the fact that they were convolved, but that a single operating point (let’s say 17 iterations when 1–100 were tried) was cherry picked (under X-validation) and then used as the bold-face my-system-wins point on NIPS-like papers.

You wouldn’t. There’s no way to test the amount of uncertainty in a posterior for a real problem—it’s all “subjective” in the “subjective Bayes” sense (and by that, I mean it’s about our knowledge at any given time, not that it’s somehow touchy-feely belief stuff). You can test on held-out data, and that’s what I’d recommend. It does seem to be the gold-standard in all of this system evaluation stuff. It’s just that I’d want to measure calibration of uncertainty, not the point estimates.

I’m just saying it could happen that inferences with an MLE would be fine, but inferences with fat tails in Bayes would blow up. I don’t have an example in mind and it’s not something that keeps me up at night because the models I work with tend to be much better behaved than this.

I think what would be really helpful would be examples where Bayes gives clearly better inferences than taking point estimates. One example is in small count binomial models—if you take point estimates, they’re going to underestimate uncertainty which can be seen pretty clearly with real data calibrations (I go over an example in my repeated binary trials case study). Are there other clear examples like this we shoudl be pointing people to?

The problem we’ll have is that we have to take on the ML people on their own turf—we need to make better point predictions. If we don’t do that, they won’t care.

betanalpha · September 10, 2017, 3:01pm

Yes, it could happen. The question is whether or not it’s sufficiently common to assume as a default, which is what many in econometrics, machine learning, etc do. I argue that it’s not all that common, and more importantly the only way to validate is to be able to accurately fit the full model!

This is one of the problems with machine learning – they gratuitously move goal posts. If one argue that indeed it is predication that matters then we can many cases were we can show that full predictive distributions are more powerful than point predictions. But then if you artificially say that you care about only point predictions then it gets trickier. Especially since all of the benefits of Bayes can then be cast away as forms of regularization (itself a very common perspective in much of machine learning!).

Bob_Carpenter · September 11, 2017, 11:11am

I completely agree here. I also don’t think it’s going to be common.

But what I’d like to see is more motivating examples of where Bayesian inference propagating uncertainty is a clear win in prediction tasks.

Maybe we could formulate this in terms of discrete decisions instead? It’s not the point estimates per se that I’m concerned about so much as predictions. As in “what’s the chance of rain today?” (probabilistic prediction) or “should my self-driving car drive itself off the road so it doesn’t kill that schoolbus full of children” (discrete decision).

wds15 · September 11, 2017, 1:21pm

+1 here! On top:

are we going to make a wrong decision after all if not using B?
how do decisions differ when going B or something else/point estimate based?

Ideally, this can be calculated in an objective fashion by some cost function, for example.

avehtari · September 11, 2017, 7:05pm

How about what Turing did?

Bob_Carpenter · September 13, 2017, 11:32am

Which of Turings many doings are you talking about?

Topic		Replies	Views
HMC (jittered) vs. NUTS on 1000-dimensional standard normal Algorithms mcmc	9	3843	April 29, 2019
NUTS vs HMC Algorithms	8	6855	August 4, 2020
NUTS differences in Stan vs paper Algorithms	3	1123	February 2, 2017
Arguments for HMC General	6	1233	February 18, 2021
Using Stan HMC as Metropolis-Within-Gibbs step in C++ Interfaces	2	836	March 22, 2020

Heuristics for the advantages of NUTS/HMC/Stan vs RWMH

Related topics