Strange convergence behaviour

Has anyone come across a situation where the same model and data converges (at least all R_Hat == 1) and then you run exactly the same model and data again and it doesn’t converge, and then again it doesn’t converge, and maybe about 1/5 times it converges? When it converges everything looks ok, and it has very sensible answers. When it doesn’t converge the posteriors are almost the same as the priors. When the ‘variational’ method is used, this always gives close to the same results as when the sampling method converges.

Yes but not sure from your description. You might have priors that are quite different from the posterior and a lot of data so the posterior is sharp and never gets found. When the model does converge what are parameter values like? If they’re dramatically different from zero it’s worth rescaling the model.

Hi, thanks very much for looking at this. When it converges everything looks fine, and gives results that I would expect. The parameters are ok, the variances are within tight ranges, etc. I’m trying to make a latent variable, and when it converges the estimates I get for the latent variable are completely sensible. Here is the code, sorry it is a bit long:

data {
int<lower=0> n; /number of observations/
int<lower=2> k; /Likert 1,2,…,k */
vector[n] w; /binary factor/
int<lower=1,upper=k> mir[n]; /ordered response variables/
int<lower=1,upper=k> age[n];
int<lower=1,upper=k> myb[n];
int<lower=1,upper=k> mya[n];
vector[n] gui; /response/
vector[n] sha; /response/
int<lower=0, upper=1> apo[n]; /binary response/

parameters {
vector[2] b_mir;
ordered[k-1] c_mir;
vector[2] b_age;
ordered[k-1] c_age;
vector[2] b_myb;
ordered[k-1] c_myb;
vector[2] b_mya;
ordered[k-1] c_mya;
vector[n] Ship;
vector[3] b_sha;
vector[3] b_gui;
vector[4] b_apol;
real<lower=0> sigmasha;
real<lower=0> sigmagui;

model {
for (i in 1:2) {
b_mir[i] ~ normal(0,5);
b_age[i] ~ normal(0,5);
b_myb[i] ~ normal(0,5);
b_mya[i] ~ normal(0,5);
b_sha[i] ~ normal(0,5);
b_gui[i] ~ normal(0,5);
b_apol[i] ~ normal(0,5);
b_sha[3] ~ normal(0,5);
b_gui[3] ~ normal(0,5);
b_apol[3] ~ normal(0,5);
b_apol[4] ~ normal(0,5);
sigmasha ~ cauchy(0,2.5);
sigmagui~ cauchy(0,2.5);

    for (i in 1:n) {
        Ship[i] ~ normal(0,5);  /*Latent variable*/

    for (i in 1:n) {
        mir[i] ~ ordered_logistic(b_mir[1] + b_mir[2]*Ship[i], c_mir);
        age[i] ~ ordered_logistic(b_age[1] + b_age[2]*Ship[i], c_age);
        myb[i] ~ ordered_logistic(b_myb[1] + b_myb[2]*Ship[i], c_myb);
        mya[i] ~ ordered_logistic(b_mya[1] + b_mya[2]*Ship[i], c_mya);

        sha[i] ~ normal(b_sha[1] + b_sha[2]*Ship[i] + b_sha[3]*w[i], sigmasha);
        gui[i] ~ normal(b_gui[1] + b_gui[2]*Ship[i] + b_gui[3]*w[i], sigmagui);
        apo[i] ~ bernoulli_logit(b_apol[1] + b_apol[2]*Ship[i] + b_apol[3]*gui[i] + b_apol[4]*sha[i]);


One funny thing about this model is that the model for mir is completely separate from the model for apo/gui/sha, completely separate from the model for age, etc… so you might want to break them out and run them separately to find the problem. If you’d be willing to split the models up and check which one(s) have the problem I’d be happy to look further.

I frequently have similar issues. As @sakrejda says, one way this can happen is when you are stuck in the fat tails of a spike-and-tail distribution. If the process never finds the spike, it just drifts around aimlessly. Awkwardly, posteriors seem to become harder to fit when they are very well identified, particularly in high-dimensions. You basically have to find in an enormous space, one little tiny region where all the probability mass is…

One thing that might help is to first do an optimization step, and then take the output and use it as the initial point for your regular sampling statement.

I should add that if you have some idea of what effect sizes will be scaling everything (esp. the spikes) to be about N(0,1) works great.

I often have an idea how big the effect size will be, but no idea whatsoever how big the posterior standard deviation will be (and it will depend on how big a sub-sample of the data I decide to use!). So it’s typically quite possible to make the say median value be O(1) but very tricky to get more than that.

Also, I am working with simplexes these days and of course they are not really rescalable.

Also, I have had some luck on some models using the variational bayes approximation, and then grabbing an initial vector from that. Typically these initial vectors are better starting places than whatever random junk you get otherwise, but still often not that great.

vbsamps <- vb(...)


for example

Hi, thanks again to everyone.
This worked as suggested - first fitting using the ‘variational’ method, and then using the means of the posterior distributions of the parameters as initial values in the sampling method.

My first reply was garbled.

We recommend always starting from diffuse initializations to try to probe for multimodality and lack of convergence. The problem with starting every chain in the same spot is that they can trick R-hat into thinking everything is OK when you really haven’t explored the posterior. Sometimes you can do this on purpose if you know you have symmetric multimodality, but it’s usually better not to.

My guess is that you have fundamental problems in the model that would be solved with a reparameterization or a move to identify multiple modes (like ordering). Check out Michael Betancourt’s case study on divergences on the web site under users >> documentation >> case studies. And also his case study on mixtures if that’s the kind of problem you’re dealing with.

Bob, those are definite possibilities, but another issue can be just really long tails, out of which the sampler can only sometimes find the mode in a computationally feasible amount of time. At least, that’s been my experience. often times the way I wind up with something happening like this is that some aspect of my model is t-distributed, or cauchy, or has some other long tail or strong skewness or whatever.

Using VB to get a sample and then starting from that sample works well in some cases. But usually I don’t start all the chains at the same place (say the vb mean), I start the chains at individually randomly selected samples from the vb sample. I think so long as you don’t have multi-modality / identifiability problems, this should work well to avoid getting stuck off in the tail of a long tail. But, you’re right that with identifiability issues, you do have the concern that Rhat might be fooled.

Great—that’s exactly what we’ve been recommending (not the VB part, but the randomization around pre-specified inits). I hope we can build this in soon—we just have to unconstrain the inits, jitter, and reconstrain.