In the paper you linked we handled the combinatorial non-identifiability by placing order constraints on the biases of the network (just like in @betanalpha’s case study ), but that non-identifiability only leads to discrete multi-modality. What gives samplers problems is continuous multi-modality like the Relu non-identifiability, which is the other case we handled in that paper.
The output of a Relu will be equivalent for all inputs when one simultaneously scales all the inputs by r and scales the output by 1/r. This means you’ll get a continuum of neural network parameters that are all equally good at explaining the data.
When you have a lot of data this manifests in a posterior that is like a very deep valley because the data identifies you in one direction but you’re completely non-identified in the other. The gradient of this posterior will be zero in the direction parallel to the valley. In the direction across the valley the gradient will be zero at exactly the bottom of the valley, but rapidly changing as you move across (high curvature). The fact that this gradient changes so rapidly means you need a small step size for gradient descent or HMC, otherwise you’ll go off in crazy directions. It’s the same as the problem with the neck of the funnel. Andrew Holbrook and Babak Shahbaba have a paper where they talk about this same problem in the context of PPCA where you need an orthonormal matrix else you experience these same non-identifiability issues.
In the paper you linked we tried to resolve this by forcing the inputs of a neural network to be unit length vectors, so that we collapsed the valley in to just a slice of the valley (the quotient space of the valley). We then sampled on that instead. Even then however, there are still non-identifiabilities like the singular components one @betanalpha mentions in his mixture model case study. The way to understand that one is if you only need one Relu to represent your function, but you’re using two, then either one can serve as the one you need and the other could be turned off. There are even more non-identifiabilities after that, and even after that the posterior is still highly multi-modal so you’d need something like adiabatic monte carlo to explore it properly.
An interesting side-note worth mentioning that @avehtari mentioned to me when I had a chance to talk to him at Stancon (glad there’s another Stancon soon!) is that for optimization, the non-identifiabilities could actually be a boon. In his words, they serve as a “tunnel” between modes that allows the optimizer to rapidly move through parameter space.
That’s one of the reasons we haven’t been exploring the non-identifiabilities lately. Also for the problems I’ve worked on, I’ve found a principled Bayesian model to be far more useful than a neural network that just predicts means with no explanation as to how. It’s also way more fun to iteratively build a Bayesian model that explains your data rather than blindly guessing and playing around with neural network architectures and opaque hyper-parameters. That being said, neural networks have shown to be powerful function learners at least in the image domain when you have a lot of data with accurate labels. So it might be nice to explore how they can be used more robustly, perhaps as a black-box input in to a Bayesian model as @andrewgelman and @betanalpha have talked about before on the Gelman blog.