I have read many times that people associate Bayesian Neural Networks with sampling problems for the induced posterior, due to the multi modal posterior structure.
I understand that this poses extreme problems for MCMC sampling, but I feel I do not understand the mechanism leading to it.
Are there mechanisms in NNs, other than of combinatorial kind, that might lead to a multi modal posterior? By combinatorial I mean the invariance under hidden neuron relabeling for fully connected NNs, see also
Think of the hidden layers as forming something like a mixed-membership mixture or factor model that’s sprinkled over the units. With this many degrees of freedom, you get lots of different ways to carve up the coefficients without even considering the label-switching problem.
Right, as @Bob_Carpenter and @avehtari have noted, the pathologies in the posteriors for these models run deep and you can experiment with them by considering a simple exchangeable mixture model, say a sum of Gaussians. Even ignoring the label switching you get these continuous non-identifiabilities that manifest as long, curving valleys in the posterior. If you throw dynamic Hamiltonian Monte Carlo at these posteriors it slows to a crawl as it tries to construct trajectories long enough to explore these expansive features.
Ultimately this is all a consequence of the flexibly of the models themselves – the more freedom in the network configuration you have to capture features the more of those configurations will be consistent with any particular data set.
In the paper you linked we handled the combinatorial non-identifiability by placing order constraints on the biases of the network (just like in @betanalpha’s case study ), but that non-identifiability only leads to discrete multi-modality. What gives samplers problems is continuous multi-modality like the Relu non-identifiability, which is the other case we handled in that paper.
The output of a Relu will be equivalent for all inputs when one simultaneously scales all the inputs by r and scales the output by 1/r. This means you’ll get a continuum of neural network parameters that are all equally good at explaining the data.
When you have a lot of data this manifests in a posterior that is like a very deep valley because the data identifies you in one direction but you’re completely non-identified in the other. The gradient of this posterior will be zero in the direction parallel to the valley. In the direction across the valley the gradient will be zero at exactly the bottom of the valley, but rapidly changing as you move across (high curvature). The fact that this gradient changes so rapidly means you need a small step size for gradient descent or HMC, otherwise you’ll go off in crazy directions. It’s the same as the problem with the neck of the funnel. Andrew Holbrook and Babak Shahbaba have a paper where they talk about this same problem in the context of PPCA where you need an orthonormal matrix else you experience these same non-identifiability issues.
In the paper you linked we tried to resolve this by forcing the inputs of a neural network to be unit length vectors, so that we collapsed the valley in to just a slice of the valley (the quotient space of the valley). We then sampled on that instead. Even then however, there are still non-identifiabilities like the singular components one @betanalpha mentions in his mixture model case study. The way to understand that one is if you only need one Relu to represent your function, but you’re using two, then either one can serve as the one you need and the other could be turned off. There are even more non-identifiabilities after that, and even after that the posterior is still highly multi-modal so you’d need something like adiabatic monte carlo to explore it properly.
An interesting side-note worth mentioning that @avehtari mentioned to me when I had a chance to talk to him at Stancon (glad there’s another Stancon soon!) is that for optimization, the non-identifiabilities could actually be a boon. In his words, they serve as a “tunnel” between modes that allows the optimizer to rapidly move through parameter space.
That’s one of the reasons we haven’t been exploring the non-identifiabilities lately. Also for the problems I’ve worked on, I’ve found a principled Bayesian model to be far more useful than a neural network that just predicts means with no explanation as to how. It’s also way more fun to iteratively build a Bayesian model that explains your data rather than blindly guessing and playing around with neural network architectures and opaque hyper-parameters. That being said, neural networks have shown to be powerful function learners at least in the image domain when you have a lot of data with accurate labels. So it might be nice to explore how they can be used more robustly, perhaps as a black-box input in to a Bayesian model as @andrewgelman and @betanalpha have talked about before on the Gelman blog.
No. These features arise only when you have more than two components in a mixture model. Try generating data from a mixture model with three components (means and standard deviations both free parameters) and recovering the true values. That will give you a simple system which manifests the nasty pathologies.
I just came across the paper “Issues in Bayesian Analysis of Neural Network Models”, which among many interesting aspects, discusses also the problem of “node duplication” (in section 3.7.). If I understand correctly, this is exactly what you referred to. Figure 5 in the paper is particularly enlightening in this regard.
In their paper they mention
We feel the conceptually clearest and straightforward approach is to explicitly include the number of hidden nodes M as a parameter in the model, i.e. use variable architecture NN models.
To this end, in Section 4 they discuss a particular construction where they associate Bernoulli variables to the hidden nodes, effectively switching them on or off (which reminds me about Dropout, btw.). Their conclusion based on two cases seems promising:
We illustrate the variable architecture model with two examples. The first one shows how model (5) may adapt to multimodalities. The second one suggests how model (5) may adapt to sharp edges. The exibility of this model for coping with these features make it very competitive with respect to other smoothing methods, including model (4).
Here model (5) implements the variable architecture NN model. Now my question, can we do something like this with Stan, and if so would you expect this to be a viable approach to handle related pathologies?
Unless you try to learn more about the shape of the posterior with NUTS, without need to implement it yourself. Even if NUTS is likely to have problems it may still be able reveal more than other current methods.
Don’t worry. I can see it, and Matt Hoffman and some others have successfully used it to learn something other methods can’t provide. Stan is useful for many many models for which it was not optimized. Maybe I need to clarify that Stan is useful tool to learn about the posterior of neural networks and benefits of integration even if it fails to mix through the whole posterior, but if you have big data sets then it can be very slow compared to frameworks designed for neural networks and much more crude inference.
Radford Neal’s FBM did quite fine with 1 and 2 hidden layer networks (and won the NIPS challenge years ago), and it was using alternating HMC and Gibbs, which is likely to do much more random walk than NUTS. It would be interesting comparison between FBM and Stan (even if the neural networks would be far from the neural networks.used with big data)
What scale are you talking about and what kind of run times? Is there an example somewhere? When I tried to recreate MNIST, it was very slow. I think I got up to a couple thousand data points fit in several hours of compute time.
In 1990’s computers I did usually run one hidden layer with 30 hidden units for data sets with 200<n<5000 with FBM in minutes to maybe couple hours (can’t remember exactly).
Radford Neal won NIPS challenge 2003 https://www.cs.toronto.edu/~radford/ftp/feat-sel-slides.pdf
with 25 hidden units in the first layer and 8 hidden units in the second layer and mentions “runs of about a day for each model” (with 2003 computers). The biggest data set had n=6000. For FBM the computation time scaled linearily wrt n (given enough memory, but FBM is quite memory efficient).
FBM has HMC variant with partial momentum update which helps to reduce random walk when doing the alternating weight and prior parameter sampling. NUTS should be better in avoiding random walk, but funnels may be problem.