Stan neural network models


I am trying to use NN model and experienced some difficulties with chains switching between modes. Can anybody help to figure out what is going on?

Thanks for any suggestions,


The stan code is:

ablationNN.stan (1.7 KB)
I have borrowed the model from matlab.

trainFcn = 'trainbr';  % Bayesian Regularization backpropagation.
hiddenLayerSize = [4 3];
net = fitnet(hiddenLayerSize,trainFcn);
net.input.processFcns = {'removeconstantrows','mapminmax'};
net.output.processFcns = {'removeconstantrows','mapminmax'};
net.divideFcn = 'dividerand';  % Divide data randomly
net.divideMode = 'sample';  % Divide up every sample
net.divideParam.trainRatio = 100/100;
net.divideParam.valRatio = 0/100;
net.divideParam.testRatio = 0/100;
net.performFcn = 'mse';  % Mean Squared Error
net.plotFcns = {'plotperform','plottrainstate','ploterrhist', ...
    'plotregression', 'plotfit'};
[net,~] = train(net,x,t);

There is some discussion on this topic here:

1 Like

You might also want to use the algorithms described in the recent SBC paper, in order to check more systematically for calibration in this setting. Looking at the rank-histograms might reveal whats going on w.r.t. to multi-modality, @seantalts, @anon75146577 know more (I admit thats not a solution to your observation, but a good exercise to understand whats going on).

1 Like

I’d be interested to see SBC applied to NNs, but I can’t see how the multimodality issues won’t be a disaster;)

1 Like

Thanks to all. I see that it is a way of life. Is there any robust procedure to determine if chains converge (especially when switching between modes occurs in the same chain).

Any suggestions on priors since they really don’t have any physical meaning?

If at all, I would consider something along the lines of the hierarchical priors for BNN’s in Radford Neals Flexible Bayesian Modelling.

Here is a (admittedly rudimentary) Stan code I worked out a while ago that tries to reproduce the simplest regression example which is mentioned in the FBM manual by Neal:

// an attempt to implement Neals Flexible Bayesian Modeling applied to case study
data { 
    int<lower=0> N;
    int<lower=0> N_h;
    vector[N] x;
    vector[N] y;
transformed data {
    real<lower=0> width_ih; //0.05
    real<lower=0> alpha_ih; //0.5
    real<lower=0> width_ho; //0.05
    real<lower=0> alpha_ho; //0.5
    real<lower=0> width_hb; //0.05
    real<lower=0> alpha_hb; //0.5
    real<lower=0> width_response; //0.05
    real<lower=0> alpha_response; // 0.5
    real<lower=0> mean_ih;
    real<lower=0> mean_ho;
    real<lower=0> mean_hb;
    real<lower=0> mean_response;
    width_ih = 0.05;
    alpha_ih = 0.5;
    width_ho = 0.05;
    alpha_ho = 0.5;
    width_hb = 0.05;
    alpha_hb = 0.5;
    width_response = 0.05;
    mean_ih = 1/pow(width_ih,2);
    mean_hb = 1/pow(width_hb,2);
    mean_response = 1/pow(width_response,2);
     // This scheme is intended to give proper scaling behaviour as N_h goes to infinity
    // due to 'x' in the prior specification, c.f.
    // see also
    // see also R. M. Neal, Bayesian Learning for Neural Networks, Appendix A 1.4. (scaling of priors)
    if(alpha_ho == 2) 
        mean_ho = N_h*log(N_h)/pow(width_ho,2);
    else {
        if(alpha_ho < 2) 
            mean_ho = pow(N_h, 2./alpha_ho)/pow(width_ho,2);
            mean_ho = N_h * (alpha_ho/(alpha_ho - 2))/pow(width_ho,2);
parameters {
    vector[N_h] ih_w_raw;
    vector[N_h] h_b_raw;
    row_vector[N_h] ho_w_raw;
    vector<lower=0>[N] sigma_response_raw;
    real o_b; // output bias
    vector<lower=0>[N_h] precision_ih;
    row_vector<lower=0>[N_h] precision_ho;
    vector<lower=0>[N_h] precision_hb;
    vector<lower=0>[N] precision_response;
transformed parameters {
    vector[N] nn_mean; 
    vector[N_h] ih_w; //input-output weights; we pick a vector instead of matrix, since input is 1d
    row_vector[N_h] ho_w; //hidden-output weights; we pick a vector instead of matrix, since output is 1d
    vector[N_h] h_b; // hidden bias
    vector<lower=0>[N] sigma_response;
    for(n in 1:N_h) {
      ih_w[n] = ih_w_raw[n]/sqrt(precision_ih[n]);
      ho_w[n] = ho_w_raw[n]/sqrt(precision_ho[n]);
      h_b[n] = h_b_raw[n]/sqrt(precision_hb[n]);
    for(n in 1:N)
      sigma_response[n] = sigma_response_raw[n]/sqrt(precision_response[n]);
    for(n in 1:N)
        nn_mean[n] = ho_w*tanh(ih_w*x[n]  + h_b) + o_b;
model {
    // general comment: think whether using inverse gamma distribution might be more efficient
    // c.f. manual 55.7.
    precision_ih ~ gamma(alpha_ih/2., alpha_ih/(2.*mean_ih));
    ih_w_raw ~ normal(0,1.);
    precision_ho ~ gamma(alpha_ho/2., alpha_ho/(2.*mean_ho));
    ho_w_raw ~ normal(0,1);
    precision_hb ~ gamma(alpha_hb/2., alpha_hb/(2.*mean_hb));
    h_b_raw ~ normal(0,1);
    precision_response ~ gamma(alpha_response/2.,alpha_response/(2.*mean_response));
    sigma_response_raw~ normal(0,1);
    o_b ~ normal(0,100);
    y ~ normal(nn_mean, sigma_response);

Something I have been having some fun with lately is finding transformations of my random variables that do have a meaningful interpretation, and then putting priors on them instead (by transforming the prior back to the original domain). For neural networks, putting priors on the weights is as you note a bit difficult, but if you look at the activations I feel a very reasonable a priori assumption would be that each activation follows e.g. a standard normal, which is essentially the assumption batch normalization and similar regularizing methods assume.

For a probit/sigmoid activation function, something like this might do the trick (not tested):

data {
   matrix[N_in,N_in] X; //pseudo-data 
parameters {
   matrix[N_out,N_in] W; //weights

transformed parameters {
   matrix[N_out,N_in] activation = Phi(W * X);

model {
   to_vector(activation) ~ normal(0,1);
   target += N_out*log_determinant(X) + normal_lpdf(to_vector(W * X), 0, 1);

Here, X would be a set of artificial anchoring points; if you are using a normal distribution, a Gaussian quadrature grid might be reasonable. Alternatively, you can pick X to have e.g. triangular structure so that the log-determinant becomes trivial. Although, actually, the determinant is constant and can be disregarded when using HMC, so it only really matters if you want to go to more advanced designs with e.g. random X.

For regression, this paper helps illustrate the idea.

1 Like

The chains almost certainly won’t converge to exploring the same posterior due to multimodality.

Thanks! We prefer to work on the standar deviation scale directly rather than with precision and then inverse-sqrt it.

A lot of the code can be simplified with vectorization and elementwise operations.

Inverse gamma isn’t more efficient in that we don’t use conjugacy. We generally don’t use the gamma priors.

Did it fit?