Bayesian Neural Networks using R. Neal's code

Andrei_Berceanu · February 12, 2018, 11:06pm

Did Stan wrapp (or reimplement) R. Neal’s C software for Flexible Bayesian Modeling and Markov Chain Sampling? I want to to use his code (or an equivalent one) from Python

bgoodri · February 12, 2018, 11:26pm

No

Andrei_Berceanu · February 13, 2018, 12:33am

Sorry for not supplying more context, we used his code in a previous nuclear physics publication, and we were hoping we can switch to Stan for the next one. What do you recommend instead?

bgoodri · February 13, 2018, 12:55am

Possibly this

which was co-written by Matt Hoffman who developed the original version of NUTS for Stan. You can do neural network stuff in Stan, but the posterior tends to be multimodal and it is hard to do full Bayesian inference with the resulting draws. There is more discussion along these lines at

Andrei_Berceanu · February 13, 2018, 12:59am

Thanks for the link and info, will check it in more detail tomorrow!

By the way, Edward and PyMC3 also don’t have support for this right?

bgoodri · February 13, 2018, 1:01am

You can write down the posterior PDF of a neural network in almost any language. Sampling from the posterior distribution is the hard part.

avehtari · February 13, 2018, 11:50am

Radford Neal’s FBM is alternating sampling hyperparameters with Gibbs sampling and sampling weights with HMC, so you can’t use Stan to replicate that exactly. Matt Hoffman and some other people have used Stan to sample both hyperparameters and weights with NUTS variant of HMC. Since the sampling from Bayesian neural network posterior is really hard and it’s really hard to know whether you are sampling from the whole posterior (ignoring multiple modes to the label switching and aliasing) it is likely that FBM, Stan, Edward, PyMC3 and any other software will fail to sample correctly from the whole posterior. The sampling may produce useful predictive models, and then the question is are you happy with results where you don’t know what is the effect of model and priors and what is the effect of the algorithm (you could say then that it’s machine learning).

Bob_Carpenter · February 13, 2018, 11:50pm

Isn’t that the other way around in that they’re using neural nets to fit the curvature in the posterior rather than doing sampling to fit neural nets? I know they want to do the latter, too, so maybe it’s both.

Yes, you can use that algorithm in PyMC3. We don’t know how well it works, because HMC is so sensitive to tuning, and Gibbs changes the posterior.

Until someone comes up with a P = NP reduction, I don’t expect to see anyone fully explore the posterior of a neural network!

Yup. Not that there’s anything wrong with that. I want self-driving cars and better product recommendations, too.

linas · May 21, 2018, 1:07pm

By “forgetting” about such issues as sampling from full posterior and multiple modes I found that logistic regression code (which is referenced at the beginning of this post) provides good predictions and is much faster than FBM. Of course this may be related to the hyperpriors which are part of FBM. Annoying part is that parameters switch between modes within the same chain

while in FBM parameters tend to behave nicely.

However, FBM doesn’t handle binomial regression explicitly (one has to use some tricks) while stan has binomial_logit. I plan to post some codes in the near future which will resemble Neal’s FBM more closely (i.e. more flexible structure of hidden layers, hyperpriors etc).

avehtari · May 21, 2018, 1:34pm

So what happens if you add hyperparameters in Stan model?

Or vanilla HMC in FBM is not mixing so well that it would switch between the modes?

linas · May 21, 2018, 3:22pm

Well, by adding hyperparametes switching started to disappear but there are still multiple modes. There is a single mode in Neal’s FBM. Time almost doubled. My guess there is a lot to play with - prior choice, centered/non-centered parametrization. Neal uses gamma hyperpriors. I don’t know the exact details of stan’s HMC but my impression is that Neal’s HMC is different.

NNbtest.r (4.5 KB)

NNlogistic2.stan (2.82 KB)

avehtari · May 21, 2018, 8:06pm

FBM uses HMC with fixed number of steps (aka static HMC). Stan uses NUTS variant with dynamic number of steps (aka dynamic HMC). Stan has also automatic step size adaptation while in FBM you have to choose the stepsize.

linas · May 21, 2018, 8:11pm

I am not sure about step size. FBM has step adjustment factor to regulate rejection rate.

avehtari · May 21, 2018, 8:17pm

It’s different than the adaptation. FBM is using alternating Gibbs for hyperparameters and HMC for weights. Given the hyperparameter values step size can be adjusted based on the prior on each weight group, which is helpful, but still it’s the user who has to choose one value, which is not easy to choose.

linas · May 21, 2018, 8:26pm

Whatever are the differences the resulting predictive posteriors look similar while posteriors of individual parameters look different. I was able to get binomial regression up and running what saves a lot of time for my particular case (# of successes for fixed # of Bernoulli trials). NCP doesn’t seem to make a difference on the dual modes I observe in stan. Maybe priors will make a difference… The models are attached.
NNbinomial2ncp.stan (3.5 KB)
NNbinomial2.stan (2.8 KB)

Bob_Carpenter · May 21, 2018, 11:06pm

This seems to be a common occurrence. It’s why (some) people in machine learning have told us that they don’t care about estimating (some) parameters.

linas · May 23, 2018, 8:19pm

For realistic data when I run several chains I typically get different modes for each chain. It looks like each mode represents local optima. I wonder if it is possible to compare those chains - which one corresponds to global optima?

Bob_Carpenter · May 24, 2018, 3:59am

It’s combinatorially intractable to find the global optimum. Also, the local optima aren’t very relevant as they only give you density, not probability mass. It’s basically intractable to do full Bayesian inference on these kinds of models.

linas · May 26, 2018, 12:09pm

Yes, this is true. But what I found and Neal suggested looks like some practical solution, namely running large # of neurons. Perhaps it may be explained by the fact that with infinite # of neurons BNN becomes GP.

Below are traces and densities from a realistic example with 13 inputs, 2 hidden layers 10 neurons each, and one output. The problem is binomial regression.

Taysseer_Sharaf · September 13, 2019, 2:04pm

Try my new package on Bayesian Learning for neural network package in R (BLNN)

Topic		Replies	Views
Stan neural network models Modeling	7	3529	May 1, 2018
NUTS-within-Gibbs by stan General	3	2476	November 11, 2022
Bayesian Neural Network with Categorical Likelihood: efficient implementation Modeling	5	720	March 16, 2021
Education-oriented, functional, interpreted Stan/HMC implementation Project Proposals education	19	1933	February 14, 2022
Nutpie, now with normalizing flow adaptation Publicity	7	426	November 19, 2024

Bayesian Neural Networks using R. Neal's code

Related topics