Being a phycisist, I found the observations and analogies recently outlined in “Comparing Dynamics: Deep Neural Networks versus Glassy Systems” particular insightful and thought their observations might help in this discussion or actually might force us to revise our assumptions or intuitions regarding pathologies and inefficiencies in approximating the posterior of BNNs:
We analyze numerically the training dynamics of deep neural networks (DNN) by using methods developed in statistical physics of glassy systems. The two main issues we address are the complexity of the loss-landscape and of the dynamics within it, and to what extent DNNs share similarities with glassy systems. Our findings, obtained for different architectures and datasets, suggest that during the training process the dynamics slows down because of an increasingly large number of flat directions. At large times, when the loss is approaching zero, the system diffuses at the bottom of the landscape. Despite some similarities with the dynamics of mean-field glassy systems, in particular, the absence of barrier crossing, we find distinctive dynamical behaviors in the two cases, showing that the statistical properties of the corresponding loss and energy landscapes are different. In contrast, when the network is under-parametrized we observe a typical glassy behavior, thus suggesting the existence of different phases depending on whether the network is under-parametrized or over-parametrized.
If in the overparametrized case the problem is really due to “an increasingly large number of flat directions”, wouldnt HMC or NUTS actually be able to cope with this? I believe the barrier crossing they refer to in the underparametrized case corresponds to the previously mentioned multimodality.
I still have to read the paper in-depth…
Clearly, as mentioned before, I’m aware that such overparam. BNNs are not efficiently implementable in Stan due to AutoDiff. Also I know they don’t speak about BNNs and only about the loss function of NNs, but nevertheless the loss landscape should be closely related to the energy of the model in an Bayesian setting, when using HMC language.