Divergences in a non-centered computational model

Isaac_Fradkin · October 24, 2019, 9:07pm

Hi all.
I’m trying to fit a computational model of a Bayesian state space model, that includes three parameters per participants - tau (corresponding with 1-process noise), gamma (corresponding with 1-observation noise) and w (corresponding with the probability for random responding). The model is quite complex, and so I attach only the parameters and model parts (the model itself is within a function).
Here is the thing - I’m using a non-centered parameterization, but am still getting some divergent transitions (~40, using 3 chains of 1000 iterations, 400 of which are warmup. Note this this already takes a few days to finish sampling, so any larger numbers would take even more time).
However, posterior predictive checks show that the model is working. And when I run the computational model for individual participants (in a non-hierarchical model) no divergences occur.
Even stranger, when examining the parallel coordinates plots (attached herewith) there is no evidence that divergences are concentrated in a specific range of any parameter. I was also unable to find any evidence for bad geometry using pair plots (I can’t attach all of them but a sample including the immediate suspects can be found in the file). Finally, I also attach the plot comparing lp and acceptance for divergent vs. non-divergent transitions (I’m not sure exactly how to interpret it, but I can’t see any clear distinction here too).
Divergences plots.pdf (806.1 KB)
Another thing I can say is that it seems that chains with a lower stepsize (stepsizes ranging from 0.137 to 0.183) also have less divergences - Is this evidence that the problem can be solved by lowering initial step size and increasing adapt_delta? More generally, my question is - how can I proceed? Are these real divergences, or are they a sort of a false alarm?

One last thing I think I should note - while trying to find init values, I often get the warning “error evaluation log probability at initial value - Exception: bernoulli_lpmf probability parameter is 1, but most be in the internval [0,1]”. This happens because in some combinations of parameters pRight can be equal to 1 or 0. I’m not sure whether this might be causing the divergences - but I would assume that if this was the case, we would see them also in non-hierachical models ran for individual participants.

Also - a more technical question - is there any way to let the sampler record the parameter values of the divergent transitions themselves (rather than only their starting point?). I though that this might help.

Any advice would be much appreciated!
Itzik,

parameters{
    vector [nSubj] tau0;
    vector [nSubj] gamma0;
    vector [nSubj] w0;
    
    real TAU0;
    real GAMMA0;
    real W0;
    
    real<lower=0> TAU0_SD;
    real<lower=0> GAMMA0_SD;
    real<lower=0> W0_SD;
  }
  
  transformed parameters{
    vector[nSubj] tau;
    vector[nSubj] gamma;
    vector[nSubj] w;
    real TAU;
    real GAMMA;
    real W;
    
    tau=Phi_approx(tau0*TAU0_SD+TAU0);
    gamma=Phi_approx(gamma0*GAMMA0_SD+GAMMA0)/2+0.5; //the additional transformation is used to keep gamma from 0.5 to 1
    w=Phi_approx(w0*W0_SD+W0);
    
    GAMMA=Phi_approx(GAMMA0)/2+0.5; //the additional transformation is used to keep gamma from 0.5 to 1
    TAU=Phi_approx(TAU0);
    W=Phi_approx(W0);
  }
  
  model{
    vector [nTrials] pRight[nSubj];
    
    tau0~normal(0,1);
    gamma0~normal(0,1);
    w0~normal(0,1);
    
    TAU0~normal(0,1);
    GAMMA0~normal(0,1);
    W0~normal(0,1);
    
    TAU0_SD~normal(0,5);
    GAMMA0_SD~normal(0,5);
    W0_SD~normal(0,5);
    
    
    for (s in 1:nSubj){
      pRight[s]=Bmodel1(cues[s],targets[s],nTrials,gamma[s],tau[s],w[s])[,1];
      for (t in 1:nTrials){
        if(response[s,t]!=999){ //This is just used to ignore missing responses marked by 999
          response[s,t]~bernoulli(pRight[s,t]);
        }
      }
    }
  }

stijn · October 25, 2019, 2:42am

Welcome back!

I am not a divergences specialist but from the plots, I think there are too many to ignore them as false alarms. The relation with the stepsize seems a good indicator that increasing adapt_delta might help.

I think it would be helpful if you could solve this problem or at least diminish it’s occurrence. There are a number of steps you can take. The most hacky, hail mary way would be to force pRight between 0 and 1 with something like pRight[s] = max(min(Bmodel1, 0.999), 0.001). This might break the sampler though. The better way to do it would be to use narrower priors on the hierarchical parameters. Without knowing what Bmodel1 is doing it’s hard to exactly tell you which parameters are too wide or not. Given that Bmodel1 is quite complicated, it’s probably hard to assess the impact of priors anyway, so simulating from the prior might be the way to go. If you get a lot of these exceptions with the hierarchical model but not with the individual models, I would see this as an indication that the priors are too wide and they allow for too (?) many unlikely (?) combinations of tau, gamma, and w that lead to pRight = 0 and pRight = 1.

torkar · October 25, 2019, 4:55am

@Isaac_Fradkin, do you have the data? To me, it seems your priors are very “liberal”. How many subjects and trials?

Isaac_Fradkin · October 25, 2019, 6:34am

Thank you both!

I will try to increase adapt_delta. The thing is that it will likely prolong sampling to be more than a week, which is quite impractical. Any way I can speed this up? I read somewhere about changing metric to dense_e instead of diag_e. I have no idea what it means - but could it help?
I have 60 participants with 90 trials each. What do you mean by too liberal priors? The way it currently works is that all priors on transformed parameters are uniform [0,1]. The posterior mean values of tau are around .9, gamma is around .7 and w is around .99. So what kind of priors would you use (we are talking about the group-level priors, right?)
If indeed the problem has to do with samples where pRight=1 or 0 -> are divergences actually a problem? Doesn’t it just mean that the sampler doesn’t try to get into regions where too many pRight values become 0 or 1? This is just like rejecting inits with similar values, no? In a way that sounds like a good behavior…

Thanks!
Itzik.

torkar · October 25, 2019, 7:00am

Isaac, setting the correct priors makes sampling faster also :) Have you plotted the combination of all your priors, i.e., prior predictive simulations? To me that is the only sane way to at least start to understand the priors in my models :)

mcol · October 25, 2019, 7:26am

You could also replace those normal(0, 1) with std_normal(), as according to https://mc-stan.org/docs/2_21/functions-reference/normal-distribution.html it should be much more efficient.

stijn · October 25, 2019, 7:27am

Not really. We don’t know that this is what’s causing the divergences. So that is why it would be could to make sure that pRight = 0 or 1 doesn’t happen. At least, if that is consistent with your domain knowledge. Also, if the sampler is exploring these unlikely regions of parameter space and having trouble in it, it might be slowing down the sampler. Again, why I would try to get rid of them and I agree with @torkar that the priors are suspect number one.

I tried to simulate from your priors for the parameters. I did it in R so I apologise if you cannot follow this. I am happy to explain more.

require(ggplot2)
require(cowplot)
theme_set(theme_cowplot())
N = 10
subjects = 90
sds = abs(rnorm(N, 0, 5))
intercepts = rnorm(N, 0, 1)
parameters = matrix(NA, nrow = N, ncol = subjects)
plots = vector("list", N) 
for (i in 1:N){
  parameters[i, ] = pnorm(rnorm(subjects, intercepts[i], sds[i]))
  plots[[i]] = qplot(parameters[i, ])
}
plot_grid(plotlist = plots)

The results for 10 simulations for 90 subjects show that most of the simulated parameters are close to 0 and 1. It’s something the community here has started to realise in the last couple of years how quickly that can happen with seemingly innocuous priors. That’s why we advice so simulate from the prior.

As an aside. I think you are not using GAMMA, TAU, W in the model block. So you could move them to the generated quantities block for a small speed up. I think there might also to move the tau, gamma and w calculations to the model block. For instance, you could calculate them within the s loop.

torkar · October 25, 2019, 7:28am

Good summary @stijn!

Isaac_Fradkin · October 25, 2019, 8:07am

Thanks! I’ve actually got the same result when simulating from the priors. I think that it’s the hierarchical SD to blaim- it makes perfect sense because the hierarchical mean parameters are close to 0.9-1. Then if the SD is too high, because of the pnorm(or phi_approx) a binomial prior results in for individuals.
The question is how to handle this. One way to go is to tighten the hierarchical SD priors (e.g. normal(0,1) or cauchy(0,1). I will test to see if it works. Is there any other way to avoid this bimodality?

Also- thanks for the advice re: speeding things up!

Thanks!
Itzik.

stijn · October 25, 2019, 8:14am

You might want to look into the beta distribution for the restricted parameters. I find the beta distribution quite flexible and relatively intuitive once you play around with some simulations.

Isaac_Fradkin · October 25, 2019, 8:44am

Well, I thought about it but is there a way for a non-central beta parametrizarion?

stijn · October 25, 2019, 8:53am

I am not aware of an easy way to do that. However, you might not need a non-centred parametrisation because you have quite a bit of data (90 trials) to estimate 3 parameters per participant. Admittedly, this is a cop-out answer.

Isaac_Fradkin · October 26, 2019, 1:29pm

Hi all.
I’ve ran the model while settting the scale of the SD parameters’ priors to 0.2 (which given a uniform prior predictive for individual parameters). It works much better. However one of the chains still produces some divergent transitions. This time, however, they have a more unique signature. First, divergent transitions are associated with much lower metrop. acceptance rate. I’m not sure exactly how to interpret this, though. Second, as I said - they occur only in one chain - what could be the reason for this? Third - pair plots show that they seem to occur when GAMMA0_SD is high and GAMMA0 is either low or high. I’m not sure exactly why would this happen. I thought it might be related to the fact that in the original model GAMMA0 was limited (by an additional linear transformation) from 0.5 to 1 instead of 0 to 1.
I attach some new diagnostics.
divergences.pdf (367.0 KB)

Do you happen to have any insight about what is going on here?

Thanks!
Itzik.

torkar · October 28, 2019, 10:33am

Four chains? How long do you run each chain and what is your warm-up?

Isaac_Fradkin · October 28, 2019, 10:38am

Hi Richard.

Three chains. Each chain for 1000 iterations with 400 warmups.

Ps. I dont get the divergent transitions when using random, smaller, sub samples (10 of 60 participants).

torkar · October 28, 2019, 10:40am

Well if the runtime isn’t a problem, what happens if you run iter=1e4 and chains = 4?

Isaac_Fradkin · October 28, 2019, 11:28am

Run-time can be a problem (it takes a bit more than a day currently), but I surely can try.
Is there anything in the analytics that makes you think in this direction?

BW,
Itzik.

torkar · October 28, 2019, 11:30am

Stop. If it takes a day then it feels like the model could be misspecified… @betanalpha could perhaps provide some insight here - when I have divergences that don’t like me I go to the man who loves to explain them :)

Isaac_Fradkin · October 28, 2019, 12:25pm

Well it’s a pretty complicated computational model, so every evaluation of the likelihood takes some time - so it’s not surprising that it will take considerable time at the end. However it does creates a problem for solutions such as increasing adapt_delta to very large numbers or running very long chains.

I have the intuition that there might be something inherent in this common parameterization of [0,1] constrained variables that creates strange lp surfaces. That is - even though specifying a low scale for the hierarchical SD parameters creates a uniform individual-level priors when the hierarchical mean is centered around zero, when this mean is centered around a relatively high/low number (and because in our case all constrained parameters are either around 0.75 or around 0.9) this creates an individual level prior that is very much ‘pushed’ towards the boundary. This will inevitably require smaller step sizes to the direction of that boundary and larger step sizes to the other direction. This will be especially prominent when the hierarchical SD parameter is large. I think that this might be the source of the problem. I’m not sure I can think of a good reparametrization trick to solve it though.

However, I might be completely wrong, and it would be great to see what @betanalpha has to say.

Thanks Richard!

betanalpha · October 29, 2019, 1:38am

Sorry but I won’t be able to comment on the specific model. Instead let me offer some general advice.

Sampling in Stan is done on an unconstrained scale, so the apparent geometry on the constrained scale can sometimes be misleading. You can run with the diagnostic_file option to save the unconstrained values and look for problems there to avoid any confusion.

Trying to debug a model that takes days to run is going to be a pain no matter the circumstance. The best thing that you can do is back up and restart with a simpler model, adding structure incrementally until something breaks. This may include removing hierarchical structure, heterogeneity in general, or subjects. Small models will be faster to iterate on and easier to explore for debugging purposes.

Hierarchical priors are rife with degeneracies that can cause divergences. The standard centering/non-centered choice concerns how the individual parameters interact with the population parameters (Gaussian location and scale). But population parameters themselves can also be degenerate when there are few individuals or little day per individual. Make sure to look at those pairs plots, too.

Centering/non-centering is done for each individual one at a time, not for all the individuals altogether. If you have strong imbalances in your data then you might have to center some individuals and non-center others.

As @stijn notes, if you see the “horns” then your priors are definitely way too wide. This doesn’t necessarily mean that “boundary” effects are a concern, but it will require much more out of your data to identify the model as a whole. And because binary data is so weakly informative that might require way more data than you expect.

Topic		Replies	Views
Divergent transitions in Beta model General fitting-issues , specification , performance , divergences	7	1281	February 7, 2022
Divergent transitions Modeling	12	940	July 17, 2019
Divergent transitions - a primer General howto , divergences	6	14982	August 11, 2020
Problems remain after non-centered parametrization of correlated parameters Modeling	34	4044	July 3, 2019
Adaptation and non-centered parameterisations Developers	10	1182	August 2, 2021

Divergences in a non-centered computational model

Related topics