Hello!

My question: How to efficiently calculate the layer values in a neural net?

I was playing around with fitting (simple) neural nets in Stan. I started with the code provided here. However, I wanted a bit more flexibility in modelling the hidden layers. For that I modified the above code and took some inspiration from Neal (1996).

My first approach is the function above `calc_u`

, but I thought it could not be the most efficient thing. (My problem is basically how to add the bias term efficiently.) Therefore, I tried the alternative `calc_u_alt`

, which seemed a bit more elegant. However, when running and comparing the two, the first one is faster. Iâ€™m not a programmer, so for me itâ€™s not so obvious why that is. Is it because it basically just does the matrix multiplication once per layer?

Is there actually a way to improve the calculations?

```
functions {
// Calculate values for layers
matrix calc_u(matrix X, matrix weights, vector bias) {
matrix[rows(X), rows(weights)] first;
matrix[rows(X), rows(weights)] res;
first = X*weights;
for (j in 1:rows(weights))
res[,j] = bias[j] + first[,j];
return res;
}
// Calculate values for layers (alternative)
matrix calc_u_alt(matrix X, matrix weights, vector bias) {
matrix[rows(X), cols(weights)] res;
for(j in 1:cols(weights))
res[,j] = X*weights[,j] + bias[j];
return res;
}
}
```

Thanks in advance for any help!

Cheers,

Max

PS: Since itâ€™s my first time posting, I canâ€™t include the R file to run the model here, sorry!

simple_nn.stan (2.7 KB)

The Matrix ops are faster than vector ops for memory caching reasons. The additional thing with Stan is it has to do autodiff. For that, bigger operations are usually more efficient as well (cause there are various under the hood optimizations for them in place).

Can the bias not be added to the weights matrix? Something like: https://stats.stackexchange.com/a/124673 ? That should make this just a matrix multiply.

Thanks for the quick reply!

Actually, I didnâ€™t thought of including the bias into the weight matrix, because I wanted to put different (hierarchical) priors on the biases and weights (sort of in spirit of Neal 1996, Appendix A). However, now thinking about it again, I think this should still be doable in the case the bias is part of the weights matrixâ€¦ So, thanks for the hint!

Cool cool. Check out the `append_row`

and `append_col`

functions in the manual. You can use these to build matrices up in pieces (so you can keep the biases and weights separate for the purposes of defining parameters and specifying priors and then bring them together to do the NN forward calculation).

Good luck!

I now did this:

```
// Calculate values for layers
matrix calc_u(matrix X, matrix weights, vector bias) {
vector[rows(X)] bias_unit;
matrix[rows(X), cols(weights)] res;
for(i in 1:rows(X))
bias_unit[i] = 1;
res = append_col(X, bias_unit) * append_row(weights, bias');
return res;
}
```

â€¦which is considerably faster. Thanks again!

Using the first function in the first post I actually got a few diverging transitions sometimes. Didnâ€™t get any with this function, which is greatâ€¦ but also strange, since I didnâ€™t change anything else in the code?!

Anyways, thanks for the help!

Max

Be warned â€“ then interpreted as densities neural networks are essentially huge mixture models, and the corresponding posteriors will be hugely non-identified. Youâ€™ll be able to explore the distribution of the weights, but only in a small neighborhood around your initial conditions even with Hamiltonian Monte Carlo.

Because we can never fully quantify the posterior over these models itâ€™s somewhat dubious to call them â€śBayesianâ€ť as the computation induces an implicit model that you cannot explicitly describe. Having an incomplete characterization of uncertainty can still be better than no uncertainty in some problems, just be aware of the limitations.

Thanks for the heads-up, @betanalpha

Iâ€™ll have to let that sink in. Is it correct, that calling any neural network â€śBayesianâ€ť (in a strict sense) is thus generally a bad/misleading idea? (And the more appropriate thing would be to call it a â€śneural net estimated via HMCâ€ť.)

Are GPs the (more) â€śBayesianâ€ť approach when it comes to (quasi) non-parametric estimation?

I actually started reading the Neal book because of your introductory HMC paper, which was a great read. However, I am still learning (or trying to get an intuition for) all the technical stuff.

In my opinion calling an inaccurate fit a â€śBayesianâ€ť analysis is indeed misleading, largely because you cannot quantify an explicit posterior. Recently there has been a wave from machine learning arguing for â€śimplicit modelsâ€ť defined by the interaction of a nominal model and the fitting algorithm, but this is largely moving goal posts to accommodate their own research.

I think itâ€™s more fair to be explicit and note that you could not completely quantify the posterior and hence are utilizing in incomplete characterization of the problem. Again, this can still be much better than relying on a single point estimate in practice so it may be a useful incremental improvement.

Finally note that this issue is not unique to neural networks. Itâ€™s also a serious problem in trying to fit models with Dirichlet process priors, such as LDA.

Itâ€™s not that Gaussian processes are any more Bayesian, itâ€™s just that they can be easier to accurately fit when used correctly (in particular, marginalizing over the hyperparameters with principled hyperpriors).

Splines, GPs, neural networks, and such can all be considered methods for interpolating between noisy data and they all have their own unique benefits and advantages. Personally, of these I prefer GPs because they are more interpretable and hence easier to use in a principled fashion. At least when interpolating across 1 or 2 dimensional spacesâ€¦

Nealâ€™s papers are great and I encourage you to read through them. I particularly recommend his comprehensive review of MCMC methods, Abstract for ``Probabilistic Inference using Markov Chain Monte Carlo Methods''.

One of the best ways to come to appreciate the technical details is to experiment with pathological problems, and from this perspective neural networks are definitely worth exploring. Just make sure to always run multiple chains from *diffuse* initial conditions and keep an eye on the diagnostics to see how hard they can be to fit.

2 Likes

I used to work with Bayesian neural networks using similar models and inference methods as Radford Neal (see, e.g., http://becs.aalto.fi/en/research/bayes/publications/LampinenVehtari_NN2001_preprint.pdf). NUTS didnâ€™t exist yet, but I did use HMC and stepsize adjustments proposed by Neal. In simpler cases neural networks worked fine, and they were scaling better than GPs with more data. However, we observed the problems with 1) funnel shaped posteriors and 2) multimodality. We knew that it was likely that HMC wasnâ€™t reaching the narrow parts of the funnel, but it wasnâ€™t that problematic as the behavior was consistent and produced additional regularization (big part of modern deep neural network methods is still how to avoid narrow modes!) So even if we knew we are not getting to the narrow part of the funnels, we knew that if we repeat the inference we get the same result. Multimodality was the decisive reason I switched from neural networks to Gaussian processes. Even with long chains for nnets it was very likely to end up in different posterior modes. When these modes correspondent to different predictions it was really problematic. Iâ€™m still using some models which have multimodal posteriors (e.g. horseshoe produces these), but there were some neural network modeling cases where the predictions changed too much from mode to another and we couldnâ€™t figure out how to control the prior to favor certain kind predictions. I know nnets can be useful, but it seems to be really really difficult to do reliable full Bayesian inference with them.

ps. I just remembered, that Matt â€śNUTSâ€ť Hoffman presented some recent results last year in NIPS.

3 Likes

For a two-layer network with tanh, I used this Stan function:

```
/**
* Returns linear predictor for restricted Boltzman machine (RBM).
* Assumes one-hidden layer with logistic sigmoid activation.
*
* @param x Predictors (N x M)
* @param alpha First-layer weights (M x J)
* @param beta Second-layer weights (J x (K - 1))
* @return Linear predictor for output layer of RBM.
*/
matrix rbm(matrix x, matrix alpha, matrix beta) {
return tanh(x * alpha) * beta;
}
```

More layers look just the same.

This is still a mess of inefficient autodiff compared to actually building the back-prop algorithm statically. So if we really wanted to do these efficiently, weâ€™d write custom derivatives for functions like `rbm()`

direclty in C++. Calling autodiff requires lots of extra space and is also slower. The direct C++ implementation would require very little memory and be at least 4 times faster.

But for reasons @betanalpha mentions and because this still isnâ€™t going to scale in parallel, we havenâ€™t been very focused on neural nets (aka deep belief nets).

1 Like

And the paper was just published in ICML

Learning Deep Latent Gaussian Models with Markov Chain Monte Carlo

Aki

1 Like

Thanks @betanalpha, @avehtari, and @Bob_Carpenter for the great answers and comments! I really appreciate it!