You might also want to use the algorithms described in the recent SBC paper, in order to check more systematically for calibration in this setting. Looking at the rank-histograms might reveal whats going on w.r.t. to multi-modality, @seantalts, @anon75146577 know more (I admit thats not a solution to your observation, but a good exercise to understand whats going on).
Thanks to all. I see that it is a way of life. Is there any robust procedure to determine if chains converge (especially when switching between modes occurs in the same chain).
Any suggestions on priors since they really don’t have any physical meaning?
If at all, I would consider something along the lines of the hierarchical priors for BNN’s in Radford Neals Flexible Bayesian Modelling.
Here is a (admittedly rudimentary) Stan code I worked out a while ago that tries to reproduce the simplest regression example which is mentioned in the FBM manual by Neal:
// an attempt to implement Neals Flexible Bayesian Modeling applied to case study https://www.cs.toronto.edu/~radford/fbm.2004-11-10.doc/Ex-netgp-r.html
data {
int<lower=0> N;
int<lower=0> N_h;
vector[N] x;
vector[N] y;
}
transformed data {
real<lower=0> width_ih; //0.05
real<lower=0> alpha_ih; //0.5
real<lower=0> width_ho; //0.05
real<lower=0> alpha_ho; //0.5
real<lower=0> width_hb; //0.05
real<lower=0> alpha_hb; //0.5
real<lower=0> width_response; //0.05
real<lower=0> alpha_response; // 0.5
real<lower=0> mean_ih;
real<lower=0> mean_ho;
real<lower=0> mean_hb;
real<lower=0> mean_response;
width_ih = 0.05;
alpha_ih = 0.5;
width_ho = 0.05;
alpha_ho = 0.5;
width_hb = 0.05;
alpha_hb = 0.5;
width_response = 0.05;
alpha_response=0.5;
mean_ih = 1/pow(width_ih,2);
mean_hb = 1/pow(width_hb,2);
mean_response = 1/pow(width_response,2);
// This scheme is intended to give proper scaling behaviour as N_h goes to infinity
// due to 'x' in the prior specification, c.f. https://www.cs.toronto.edu/~radford/fbm.2004-11-10.doc/Ex-netgp-r.html
// see also https://www.cs.toronto.edu/~radford/fbm.2004-11-10.doc/prior.html
// see also R. M. Neal, Bayesian Learning for Neural Networks, Appendix A 1.4. (scaling of priors)
if(alpha_ho == 2)
mean_ho = N_h*log(N_h)/pow(width_ho,2);
else {
if(alpha_ho < 2)
mean_ho = pow(N_h, 2./alpha_ho)/pow(width_ho,2);
else
mean_ho = N_h * (alpha_ho/(alpha_ho - 2))/pow(width_ho,2);
}
}
parameters {
vector[N_h] ih_w_raw;
vector[N_h] h_b_raw;
row_vector[N_h] ho_w_raw;
vector<lower=0>[N] sigma_response_raw;
real o_b; // output bias
vector<lower=0>[N_h] precision_ih;
row_vector<lower=0>[N_h] precision_ho;
vector<lower=0>[N_h] precision_hb;
vector<lower=0>[N] precision_response;
}
transformed parameters {
vector[N] nn_mean;
vector[N_h] ih_w; //input-output weights; we pick a vector instead of matrix, since input is 1d
row_vector[N_h] ho_w; //hidden-output weights; we pick a vector instead of matrix, since output is 1d
vector[N_h] h_b; // hidden bias
vector<lower=0>[N] sigma_response;
for(n in 1:N_h) {
ih_w[n] = ih_w_raw[n]/sqrt(precision_ih[n]);
ho_w[n] = ho_w_raw[n]/sqrt(precision_ho[n]);
h_b[n] = h_b_raw[n]/sqrt(precision_hb[n]);
}
for(n in 1:N)
sigma_response[n] = sigma_response_raw[n]/sqrt(precision_response[n]);
for(n in 1:N)
nn_mean[n] = ho_w*tanh(ih_w*x[n] + h_b) + o_b;
}
model {
// general comment: think whether using inverse gamma distribution might be more efficient
// c.f. manual 55.7.
precision_ih ~ gamma(alpha_ih/2., alpha_ih/(2.*mean_ih));
ih_w_raw ~ normal(0,1.);
precision_ho ~ gamma(alpha_ho/2., alpha_ho/(2.*mean_ho));
ho_w_raw ~ normal(0,1);
precision_hb ~ gamma(alpha_hb/2., alpha_hb/(2.*mean_hb));
h_b_raw ~ normal(0,1);
precision_response ~ gamma(alpha_response/2.,alpha_response/(2.*mean_response));
sigma_response_raw~ normal(0,1);
o_b ~ normal(0,100);
y ~ normal(nn_mean, sigma_response);
}
Something I have been having some fun with lately is finding transformations of my random variables that do have a meaningful interpretation, and then putting priors on them instead (by transforming the prior back to the original domain). For neural networks, putting priors on the weights is as you note a bit difficult, but if you look at the activations I feel a very reasonable a priori assumption would be that each activation follows e.g. a standard normal, which is essentially the assumption batch normalization and similar regularizing methods assume.
For a probit/sigmoid activation function, something like this might do the trick (not tested):
Here, X would be a set of artificial anchoring points; if you are using a normal distribution, a Gaussian quadrature grid might be reasonable. Alternatively, you can pick X to have e.g. triangular structure so that the log-determinant becomes trivial. Although, actually, the determinant is constant and can be disregarded when using HMC, so it only really matters if you want to go to more advanced designs with e.g. random X.
For regression, this paper helps illustrate the idea.