Heuristics for the advantages of NUTS/HMC/Stan vs RWMH


I 100% agree with this! I think the best you can say is that if you miss some of the posterior mass you potentially miss some structure in the posterior predictive (which can manifest in overfitting), but Bayesian models are not more robust than non-Bayesian models. Statistics is hard no matter what toolbox you use.


Yes, which means that most ML analyses are actually utilizing models that are defined only implicitly via the specific combination of base model and algorithm.

This is bad enough for interpretability and reproducibility, but it gets worse when compounded with non-identfiable objectives. Many model configurations will look good on a single set test data, most of those will do poorly when generalized to other sets of tests data (an immediate argument of our friend concentration of measure!) so optimizing to an objective defined via a single test set can lead to fragile generalization.

Let’s hold off on the self-driving car for the moment and focus on Go. AlphaGo isn’t some canned deep learning model – it is a carefully tuned reinforcement model with hand-crafted strategies based on expert Go knowledge and deep learning to fill in some of the gaps. Not an unimpressive feat, but at the same time not as simple as throwing deep learning on some Kaggle experiment!

Note also that these methods also end up using much strong objectives in practice that interrogate more of the predictive distribution and hence improve the utility of the resulting fit.

Sure, but who’s arguing that? The argument is what’s better for decision making within the confines of given assumptions: using all the model configurations consistent with the data or just one/a few of them?


Definitely. But following that logic, doesn’t a point estimate miss quit a bit of posterior mass, like all of it measure-theoretically, and hence is massively susceptible to overfitting?


If I had a $ for every time I told someone that, I’d have dozens of $s. I dislike the hype about just throwing deep learning at a problem. It’s all about engineering the predictors (inputs, features) and the network topology.

Won’t that depend on the model? I’m imagining you have wildly misspecified tails, which could then cause worse inference than just fitting an MLE and pluggin in the estimate for inference.


But how would you know that the tails are misspecified but the bulk is fine? This argument is common in many fields, chiefly econometrics and machine learning, but it relies on the naive intuition that the bulk is somehow much easier to model than the tails. Both can be hard to model in practice, and the only way to evaluate the model robustly is to be able to construct accurate predictions through accurate fits!

Relying on point estimates convolves the purported model with the fitting algorithm to define some implicit model, confusing what properties of each contribute to that final model and hence impeding accurate model validation.


I’m agreeing with you here :-)

What drove me crazy in ML was not only the fact that they were convolved, but that a single operating point (let’s say 17 iterations when 1–100 were tried) was cherry picked (under X-validation) and then used as the bold-face my-system-wins point on NIPS-like papers.

You wouldn’t. There’s no way to test the amount of uncertainty in a posterior for a real problem—it’s all “subjective” in the “subjective Bayes” sense (and by that, I mean it’s about our knowledge at any given time, not that it’s somehow touchy-feely belief stuff). You can test on held-out data, and that’s what I’d recommend. It does seem to be the gold-standard in all of this system evaluation stuff. It’s just that I’d want to measure calibration of uncertainty, not the point estimates.

I’m just saying it could happen that inferences with an MLE would be fine, but inferences with fat tails in Bayes would blow up. I don’t have an example in mind and it’s not something that keeps me up at night because the models I work with tend to be much better behaved than this.

I think what would be really helpful would be examples where Bayes gives clearly better inferences than taking point estimates. One example is in small count binomial models—if you take point estimates, they’re going to underestimate uncertainty which can be seen pretty clearly with real data calibrations (I go over an example in my repeated binary trials case study). Are there other clear examples like this we shoudl be pointing people to?

The problem we’ll have is that we have to take on the ML people on their own turf—we need to make better point predictions. If we don’t do that, they won’t care.


Yes, it could happen. The question is whether or not it’s sufficiently common to assume as a default, which is what many in econometrics, machine learning, etc do. I argue that it’s not all that common, and more importantly the only way to validate is to be able to accurately fit the full model!

This is one of the problems with machine learning – they gratuitously move goal posts. If one argue that indeed it is predication that matters then we can many cases were we can show that full predictive distributions are more powerful than point predictions. But then if you artificially say that you care about only point predictions then it gets trickier. Especially since all of the benefits of Bayes can then be cast away as forms of regularization (itself a very common perspective in much of machine learning!).


I completely agree here. I also don’t think it’s going to be common.

But what I’d like to see is more motivating examples of where Bayesian inference propagating uncertainty is a clear win in prediction tasks.

Maybe we could formulate this in terms of discrete decisions instead? It’s not the point estimates per se that I’m concerned about so much as predictions. As in “what’s the chance of rain today?” (probabilistic prediction) or “should my self-driving car drive itself off the road so it doesn’t kill that schoolbus full of children” (discrete decision).


+1 here! On top:

  • are we going to make a wrong decision after all if not using B?
  • how do decisions differ when going B or something else/point estimate based?

Ideally, this can be calculated in an objective fashion by some cost function, for example.


How about what Turing did?


Which of Turings many doings are you talking about?