For a two-layer network with tanh, I used this Stan function:
/**
* Returns linear predictor for restricted Boltzman machine (RBM).
* Assumes one-hidden layer with logistic sigmoid activation.
*
* @param x Predictors (N x M)
* @param alpha First-layer weights (M x J)
* @param beta Second-layer weights (J x (K - 1))
* @return Linear predictor for output layer of RBM.
*/
matrix rbm(matrix x, matrix alpha, matrix beta) {
return tanh(x * alpha) * beta;
}
More layers look just the same.
This is still a mess of inefficient autodiff compared to actually building the back-prop algorithm statically. So if we really wanted to do these efficiently, we’d write custom derivatives for functions like rbm()
direclty in C++. Calling autodiff requires lots of extra space and is also slower. The direct C++ implementation would require very little memory and be at least 4 times faster.
But for reasons @betanalpha mentions and because this still isn’t going to scale in parallel, we haven’t been very focused on neural nets (aka deep belief nets).