Calculating layers in a Neural Net


#1

Hello!

My question: How to efficiently calculate the layer values in a neural net?

I was playing around with fitting (simple) neural nets in Stan. I started with the code provided here. However, I wanted a bit more flexibility in modelling the hidden layers. For that I modified the above code and took some inspiration from Neal (1996).

My first approach is the function above calc_u, but I thought it could not be the most efficient thing. (My problem is basically how to add the bias term efficiently.) Therefore, I tried the alternative calc_u_alt, which seemed a bit more elegant. However, when running and comparing the two, the first one is faster. I’m not a programmer, so for me it’s not so obvious why that is. Is it because it basically just does the matrix multiplication once per layer?

Is there actually a way to improve the calculations?

functions {
  // Calculate values for layers
  matrix calc_u(matrix X, matrix weights, vector bias) {
    
    matrix[rows(X), rows(weights)] first;
    matrix[rows(X), rows(weights)] res;
    
    first = X*weights;
    
    for (j in 1:rows(weights))
      res[,j] = bias[j] + first[,j];
    
    return res;
  }
  
  // Calculate values for layers (alternative)
  matrix calc_u_alt(matrix X, matrix weights, vector bias) {
    
    matrix[rows(X), cols(weights)] res;
    
    for(j in 1:cols(weights))
      res[,j] = X*weights[,j] + bias[j];
    
    return res;
  }
}

Thanks in advance for any help!

Cheers,
Max

PS: Since it’s my first time posting, I can’t include the R file to run the model here, sorry!

simple_nn.stan (2.7 KB)


Bayesian Neural Networks using R. Neal's code
#2

The Matrix ops are faster than vector ops for memory caching reasons. The additional thing with Stan is it has to do autodiff. For that, bigger operations are usually more efficient as well (cause there are various under the hood optimizations for them in place).

Can the bias not be added to the weights matrix? Something like: https://stats.stackexchange.com/a/124673 ? That should make this just a matrix multiply.


#3

Thanks for the quick reply!

Actually, I didn’t thought of including the bias into the weight matrix, because I wanted to put different (hierarchical) priors on the biases and weights (sort of in spirit of Neal 1996, Appendix A). However, now thinking about it again, I think this should still be doable in the case the bias is part of the weights matrix… So, thanks for the hint!


#4

Cool cool. Check out the append_row and append_col functions in the manual. You can use these to build matrices up in pieces (so you can keep the biases and weights separate for the purposes of defining parameters and specifying priors and then bring them together to do the NN forward calculation).

Good luck!


#5

I now did this:

// Calculate values for layers
  matrix calc_u(matrix X, matrix weights, vector bias) {
    
    vector[rows(X)] bias_unit;
    matrix[rows(X), cols(weights)] res;
    
    for(i in 1:rows(X))
      bias_unit[i] = 1;
    
    res = append_col(X, bias_unit) * append_row(weights, bias');

    return res;
  }

…which is considerably faster. Thanks again!

Using the first function in the first post I actually got a few diverging transitions sometimes. Didn’t get any with this function, which is great… but also strange, since I didn’t change anything else in the code?!

Anyways, thanks for the help!

Max


#6

Be warned – then interpreted as densities neural networks are essentially huge mixture models, and the corresponding posteriors will be hugely non-identified. You’ll be able to explore the distribution of the weights, but only in a small neighborhood around your initial conditions even with Hamiltonian Monte Carlo.

Because we can never fully quantify the posterior over these models it’s somewhat dubious to call them “Bayesian” as the computation induces an implicit model that you cannot explicitly describe. Having an incomplete characterization of uncertainty can still be better than no uncertainty in some problems, just be aware of the limitations.


#7

Thanks for the heads-up, @betanalpha

I’ll have to let that sink in. Is it correct, that calling any neural network “Bayesian” (in a strict sense) is thus generally a bad/misleading idea? (And the more appropriate thing would be to call it a “neural net estimated via HMC”.)

Are GPs the (more) “Bayesian” approach when it comes to (quasi) non-parametric estimation?

I actually started reading the Neal book because of your introductory HMC paper, which was a great read. However, I am still learning (or trying to get an intuition for) all the technical stuff.


#8

In my opinion calling an inaccurate fit a “Bayesian” analysis is indeed misleading, largely because you cannot quantify an explicit posterior. Recently there has been a wave from machine learning arguing for “implicit models” defined by the interaction of a nominal model and the fitting algorithm, but this is largely moving goal posts to accommodate their own research.

I think it’s more fair to be explicit and note that you could not completely quantify the posterior and hence are utilizing in incomplete characterization of the problem. Again, this can still be much better than relying on a single point estimate in practice so it may be a useful incremental improvement.

Finally note that this issue is not unique to neural networks. It’s also a serious problem in trying to fit models with Dirichlet process priors, such as LDA.

It’s not that Gaussian processes are any more Bayesian, it’s just that they can be easier to accurately fit when used correctly (in particular, marginalizing over the hyperparameters with principled hyperpriors).

Splines, GPs, neural networks, and such can all be considered methods for interpolating between noisy data and they all have their own unique benefits and advantages. Personally, of these I prefer GPs because they are more interpretable and hence easier to use in a principled fashion. At least when interpolating across 1 or 2 dimensional spaces…

Neal’s papers are great and I encourage you to read through them. I particularly recommend his comprehensive review of MCMC methods, http://www.cs.toronto.edu/~radford/review.abstract.html.

One of the best ways to come to appreciate the technical details is to experiment with pathological problems, and from this perspective neural networks are definitely worth exploring. Just make sure to always run multiple chains from diffuse initial conditions and keep an eye on the diagnostics to see how hard they can be to fit.


#9

I used to work with Bayesian neural networks using similar models and inference methods as Radford Neal (see, e.g., http://becs.aalto.fi/en/research/bayes/publications/LampinenVehtari_NN2001_preprint.pdf). NUTS didn’t exist yet, but I did use HMC and stepsize adjustments proposed by Neal. In simpler cases neural networks worked fine, and they were scaling better than GPs with more data. However, we observed the problems with 1) funnel shaped posteriors and 2) multimodality. We knew that it was likely that HMC wasn’t reaching the narrow parts of the funnel, but it wasn’t that problematic as the behavior was consistent and produced additional regularization (big part of modern deep neural network methods is still how to avoid narrow modes!) So even if we knew we are not getting to the narrow part of the funnels, we knew that if we repeat the inference we get the same result. Multimodality was the decisive reason I switched from neural networks to Gaussian processes. Even with long chains for nnets it was very likely to end up in different posterior modes. When these modes correspondent to different predictions it was really problematic. I’m still using some models which have multimodal posteriors (e.g. horseshoe produces these), but there were some neural network modeling cases where the predictions changed too much from mode to another and we couldn’t figure out how to control the prior to favor certain kind predictions. I know nnets can be useful, but it seems to be really really difficult to do reliable full Bayesian inference with them.

ps. I just remembered, that Matt “NUTS” Hoffman presented some recent results last year in NIPS.


#10

For a two-layer network with tanh, I used this Stan function:

  /**
   * Returns linear predictor for restricted Boltzman machine (RBM).
   * Assumes one-hidden layer with logistic sigmoid activation. 
   *
   * @param x Predictors (N x M)
   * @param alpha First-layer weights (M x J)
   * @param beta Second-layer weights (J x (K - 1))
   * @return Linear predictor for output layer of RBM.
   */
  matrix rbm(matrix x, matrix alpha, matrix beta) {
    return tanh(x * alpha) * beta;
  }

More layers look just the same.

This is still a mess of inefficient autodiff compared to actually building the back-prop algorithm statically. So if we really wanted to do these efficiently, we’d write custom derivatives for functions like rbm() direclty in C++. Calling autodiff requires lots of extra space and is also slower. The direct C++ implementation would require very little memory and be at least 4 times faster.

But for reasons @betanalpha mentions and because this still isn’t going to scale in parallel, we haven’t been very focused on neural nets (aka deep belief nets).


#11

And the paper was just published in ICML
Learning Deep Latent Gaussian Models with Markov Chain Monte Carlo

Aki


#12

Thanks @betanalpha, @avehtari, and @Bob_Carpenter for the great answers and comments! I really appreciate it!