Why are Bayesian Neural Networks multi-modal?


Thanks for the reference. Andrew would love this quote from the slides:

You don’t want to think too hard about your prior. The bad effects of bad priors may be greater in high dimensions.

I’d think that’d make you want to think harder, not less hard!

I couldn’t quite follow the language of the rest of it, but it looks like binary problems and it looks like they did some dimensionality reduction up front, but maybe I misunderstood—there’s about two dozen techniques mentioned in there.

I perhaps overstepped trying to do all 10 digits for MNIST. Binary would be much easier.


It doesn’t matter how hard you think, the second sentence is true anyway :)

Unfortunately they were less open access at that time and it’s not easy to find the more detailed description, but it exists somewhere. Maybe we could get some student to do simple comparison even just with the examples from FBM manual.


Being a phycisist, I found the observations and analogies recently outlined in “Comparing Dynamics: Deep Neural Networks versus Glassy Systems” particular insightful and thought their observations might help in this discussion or actually might force us to revise our assumptions or intuitions regarding pathologies and inefficiencies in approximating the posterior of BNNs:

We analyze numerically the training dynamics of deep neural networks (DNN) by using methods developed in statistical physics of glassy systems. The two main issues we address are the complexity of the loss-landscape and of the dynamics within it, and to what extent DNNs share similarities with glassy systems. Our findings, obtained for different architectures and datasets, suggest that during the training process the dynamics slows down because of an increasingly large number of flat directions. At large times, when the loss is approaching zero, the system diffuses at the bottom of the landscape. Despite some similarities with the dynamics of mean-field glassy systems, in particular, the absence of barrier crossing, we find distinctive dynamical behaviors in the two cases, showing that the statistical properties of the corresponding loss and energy landscapes are different. In contrast, when the network is under-parametrized we observe a typical glassy behavior, thus suggesting the existence of different phases depending on whether the network is under-parametrized or over-parametrized.

If in the overparametrized case the problem is really due to “an increasingly large number of flat directions”, wouldnt HMC or NUTS actually be able to cope with this? I believe the barrier crossing they refer to in the underparametrized case corresponds to the previously mentioned multimodality.

I still have to read the paper in-depth…

Clearly, as mentioned before, I’m aware that such overparam. BNNs are not efficiently implementable in Stan due to AutoDiff. Also I know they don’t speak about BNNs and only about the loss function of NNs, but nevertheless the loss landscape should be closely related to the energy of the model in an Bayesian setting, when using HMC language.


I don’t know what “glassy behavior” is, so the analogy/analysis doesn’t help.

Stan has good support for the required autodiff. We don’t have it finely tuned for the logistic regression case, but we’re actually about to roll that out (that is, a bernoulli_logit_glm_lpmf with analytic derivatives w.r.t. parameters).

What we don’t have is all the specialized training and arithmetic that’s required to scale them. Nor do we have any network topology-based primitives. Nor are we primarily concerned with point estimates.

For HMC, the potential energy is just the negative log density, which should be the same as the loss in a probabilistic model.

I’m not sure what this means. You mean you have wide hidden layers so you have many ways to parameterize the same function?


Yes that’s how I understand it, but I haven’t found a precise definition of this.

This might be related in the overparamtrezied case?!


HMC is very good at following connected regions of high probability to explore a given probability distribution. If that probability is distributed diffusely with a complex geometry, however, then HMC will take forever to explore all of that probability!

Yes there are all kinds of indications that glassier-ish loss surfaces can help optimization find better neighborhoods, at least if the saddle points can be navigated sufficiently well (which is hard given that second-order information, or a proxy thereof, is often needed). But optimization just needs to follow one path and report one optimum. It can find a good optimum even while ignoring all of the paths and be reasonably computationally efficient.

Any Bayesian quantification, however, will want to quantify all of those paths and all of the optima. Again, HMC is the best tool we have for this but the problem is so ill-posed that it will struggle to do anything reasonable with finite computational resources.


Maybe an example will help.

In a regression, if I have two correlated predictors, say age and income, then I can move one predictor’s coefficient up and other one down and get roughly the same predictions.

Now imagine all those predictors are all latent. I can not only move the coefficients up and down, I can move the predictors themselves up and down. Or like in the income and education case, spread an effect over multiple predictors. Given that each of these operations is a subset-type operation (and therefore exponential), you can imagine the multimodality you run into.

Overparameterizations can be problematic if you’re using second order methods, because they defeat positive definteness. I’ve never seen them be useful, but Aki gets out more than me.


In this regard, I was wondering if optimization appears to be easier to achieve in such systems, why can one not achieve sampling from the posterior through running an ensemble of finely tuned optimization problems (e.g. through random variations of the corresponding energy function / log of posterior)?

As a result of a quick google search I found this:

“The problem of drawing samples from a discrete distribution can be converted into a discrete optimization problem [1, 2, 3, 4]. In this work, we show how sampling from a continuous distribution can be converted into an optimization problem over continuous space. Central to the method is a stochastic process recently described in mathematical statistics that we call the Gumbel process. We present a new construction of the Gumbel process and A∗ Sampling, a practical generic sampling algorithm that searches for the maximum of a Gumbel process using A∗ search. We analyze the correctness and convergence time of A∗ Sampling and demonstrate empirically that it makes more efficient use of bound and likelihood evaluations than the most closely related adaptive rejection sampling-based algorithms.”

See also https://stats.stackexchange.com/questions/112476/do-optimization-techniques-map-to-sampling-techniques


Why would an ensemble of optimizers, all tuned to the same objective and ending up in different local optima only because of different initial conditions, end up terminating in states distributed according to the target distribution?


I wasn’t referring to varying initial conditions, but varying the loss surface or energy function, so that each optimization has a randomly shifted optimum to find. Something along the lines like mentioned in the above A* paper:

This work approaches the problem from a different perspective. Specifically, it is inspired by an algorithm for sampling from a discrete distribution that is known as the Gumbel-Max trick. The algorithm works by adding independent Gumbel perturbations to each configuration of a discrete negative energy function and returning the argmax configuration of the perturbed negative energy function. The result is an exact sample from the corresponding Gibbs distribution. Previous work [1, 3] has used this property to motivate samplers based on optimizing random energy functions but has been forced to resort to approximate sampling due to the fact that in structured output spaces, exact sampling appears to require instantiating exponentially many Gumbel perturbations.
Our first key observation is that we can apply the Gumbel-Max trick without instantiating all of the (possibly exponentially many) Gumbel perturbations. The same basic idea then allows us to extend the Gumbel-Max trick to continuous spaces where there will be infinitely many independent perturbations. Intuitively, for any given random energy function, there are many perturbation values that are irrelevant to determining the argmax so long as we have an upper bound on their values. We will show how to instantiate the relevant ones and bound the irrelevant ones, allowing us to find the argmax — and thus an exact sample.


A* scales terribly with dimension due to its construction and also requires a large number of solves to get a reasonably accurate sample. It wouldn’t even be applicable to systems large enough to be relevant to this discussion.


Thanks for clarifying. What about this idea: Bayesian Learning via Stochastic Gradient Langevin Dynamics?


The first paragraph of this blog post shows that constant-rate gradient flow with Gaussian noise has a nice stationary distribution, which is the argument behind the Langevin sampling methods. My issue with this whole line of work is that there is an acute difference between asymptotically correct sampling and efficient sampling. I think any kind of mode-biased sampling is going to be a real issue for tail and mode exploration (practically, if not theoretically). But that’s just my gut feeling.

I have severely mixed feelings about this paper on converting mini-batch stochastic gradients into inference, but if you are going down this road it might be worth a look.

Given that neural network parameters are usually entirely artificial constructs, I do not think exploring around the mode leads to any kind of reasonable uncertainty quantification, although it might make the predictive model more robust. I think the real gain is if the full posterior could be explored, using the NN as a proxy for a function space, and it’s possible that you could make do with smaller networks if you got to do full Bayesian model averaging.

Prior specification is going to be hell though; seems insurmountable to go much beyond weight decay, unless we could code in particular properties like identifiability or output scale.


Stochastic gradient methods do not scale to high-dimensional problems, http://proceedings.mlr.press/v37/betancourt15.pdf.


Do those intuitions apply to e.g. MALA or other gradient-based samplers as well? Not that there’s anything else that is really competitive with HMC at the moment, but it’d be nice to get an idea about the scope of the negative result.


Because MALA applies a correction after every step is moves slowly enough that subsampling errors can be controlled, but you’re still just moving slowly. If you don’t correct for the error then you get a nasty error drift that manifests in an asymptotic bias. This was demonstrated analytically in some of the followups to the Welling and Teh paper.