Calculating layers in a Neural Net

I used to work with Bayesian neural networks using similar models and inference methods as Radford Neal (see, e.g., http://becs.aalto.fi/en/research/bayes/publications/LampinenVehtari_NN2001_preprint.pdf). NUTS didn’t exist yet, but I did use HMC and stepsize adjustments proposed by Neal. In simpler cases neural networks worked fine, and they were scaling better than GPs with more data. However, we observed the problems with 1) funnel shaped posteriors and 2) multimodality. We knew that it was likely that HMC wasn’t reaching the narrow parts of the funnel, but it wasn’t that problematic as the behavior was consistent and produced additional regularization (big part of modern deep neural network methods is still how to avoid narrow modes!) So even if we knew we are not getting to the narrow part of the funnels, we knew that if we repeat the inference we get the same result. Multimodality was the decisive reason I switched from neural networks to Gaussian processes. Even with long chains for nnets it was very likely to end up in different posterior modes. When these modes correspondent to different predictions it was really problematic. I’m still using some models which have multimodal posteriors (e.g. horseshoe produces these), but there were some neural network modeling cases where the predictions changed too much from mode to another and we couldn’t figure out how to control the prior to favor certain kind predictions. I know nnets can be useful, but it seems to be really really difficult to do reliable full Bayesian inference with them.

ps. I just remembered, that Matt “NUTS” Hoffman presented some recent results last year in NIPS.

3 Likes