Stan wrapp (or reimplement) R. Neal’s
C software for Flexible Bayesian Modeling and Markov Chain Sampling? I want to to use his code (or an equivalent one) from
Sorry for not supplying more context, we used his code in a previous nuclear physics publication, and we were hoping we can switch to Stan for the next one. What do you recommend instead?
which was co-written by Matt Hoffman who developed the original version of NUTS for Stan. You can do neural network stuff in Stan, but the posterior tends to be multimodal and it is hard to do full Bayesian inference with the resulting draws. There is more discussion along these lines at
Thanks for the link and info, will check it in more detail tomorrow!
By the way, Edward and PyMC3 also don’t have support for this right?
You can write down the posterior PDF of a neural network in almost any language. Sampling from the posterior distribution is the hard part.
Radford Neal’s FBM is alternating sampling hyperparameters with Gibbs sampling and sampling weights with HMC, so you can’t use Stan to replicate that exactly. Matt Hoffman and some other people have used Stan to sample both hyperparameters and weights with NUTS variant of HMC. Since the sampling from Bayesian neural network posterior is really hard and it’s really hard to know whether you are sampling from the whole posterior (ignoring multiple modes to the label switching and aliasing) it is likely that FBM, Stan, Edward, PyMC3 and any other software will fail to sample correctly from the whole posterior. The sampling may produce useful predictive models, and then the question is are you happy with results where you don’t know what is the effect of model and priors and what is the effect of the algorithm (you could say then that it’s machine learning).
Isn’t that the other way around in that they’re using neural nets to fit the curvature in the posterior rather than doing sampling to fit neural nets? I know they want to do the latter, too, so maybe it’s both.
Yes, you can use that algorithm in PyMC3. We don’t know how well it works, because HMC is so sensitive to tuning, and Gibbs changes the posterior.
Until someone comes up with a P = NP reduction, I don’t expect to see anyone fully explore the posterior of a neural network!
Yup. Not that there’s anything wrong with that. I want self-driving cars and better product recommendations, too.
By “forgetting” about such issues as sampling from full posterior and multiple modes I found that logistic regression code (which is referenced at the beginning of this post) provides good predictions and is much faster than FBM. Of course this may be related to the hyperpriors which are part of FBM. Annoying part is that parameters switch between modes within the same chain
while in FBM parameters tend to behave nicely.
However, FBM doesn’t handle binomial regression explicitly (one has to use some tricks) while stan has binomial_logit. I plan to post some codes in the near future which will resemble Neal’s FBM more closely (i.e. more flexible structure of hidden layers, hyperpriors etc).
So what happens if you add hyperparameters in Stan model?
Or vanilla HMC in FBM is not mixing so well that it would switch between the modes?
Well, by adding hyperparametes switching started to disappear but there are still multiple modes. There is a single mode in Neal’s FBM. Time almost doubled. My guess there is a lot to play with - prior choice, centered/non-centered parametrization. Neal uses gamma hyperpriors. I don’t know the exact details of stan’s HMC but my impression is that Neal’s HMC is different.
NNbtest.r (4.5 KB)
NNlogistic2.stan (2.82 KB)
FBM uses HMC with fixed number of steps (aka static HMC). Stan uses NUTS variant with dynamic number of steps (aka dynamic HMC). Stan has also automatic step size adaptation while in FBM you have to choose the stepsize.
I am not sure about step size. FBM has step adjustment factor to regulate rejection rate.
It’s different than the adaptation. FBM is using alternating Gibbs for hyperparameters and HMC for weights. Given the hyperparameter values step size can be adjusted based on the prior on each weight group, which is helpful, but still it’s the user who has to choose one value, which is not easy to choose.
Whatever are the differences the resulting predictive posteriors look similar while posteriors of individual parameters look different. I was able to get binomial regression up and running what saves a lot of time for my particular case (# of successes for fixed # of Bernoulli trials). NCP doesn’t seem to make a difference on the dual modes I observe in stan. Maybe priors will make a difference… The models are attached.
NNbinomial2ncp.stan (3.5 KB)
NNbinomial2.stan (2.8 KB)
This seems to be a common occurrence. It’s why (some) people in machine learning have told us that they don’t care about estimating (some) parameters.
For realistic data when I run several chains I typically get different modes for each chain. It looks like each mode represents local optima. I wonder if it is possible to compare those chains - which one corresponds to global optima?
It’s combinatorially intractable to find the global optimum. Also, the local optima aren’t very relevant as they only give you density, not probability mass. It’s basically intractable to do full Bayesian inference on these kinds of models.
Yes, this is true. But what I found and Neal suggested looks like some practical solution, namely running large # of neurons. Perhaps it may be explained by the fact that with infinite # of neurons BNN becomes GP.
Below are traces and densities from a realistic example with 13 inputs, 2 hidden layers 10 neurons each, and one output. The problem is binomial regression.