High number of divergent transitions in ODE model with explicit constraints

Hi everyone,
This is my first post here on the forum so please let me know if I am missing something or you want me to be more explicit about my problem.

I have used Rstan now for a couple of months now and really enjoy it but I have run into some problems with my ODE model. I am modelling the labelled fraction of two population and how that fraction changes over time.
Now, when I try to infer the posteriors of the parameter I run into either one of the two following problems. Around 1/5 to 1/4 of the post burn-in transitions are divergent or one of the chains gets “trapped” in an area of low probability and all transitions are divergent.

Due to biological constraints I have imposed two contraints such that 0 <= pn-tr_delta <= 1 and 0 <= 2^k * tr_delta*R + ps <= 1 . I have been reading on your old mailinglist and here and I guess that these constraints might be part of the problem.

Since I am quite new to modelling in gerneral and Stan in particular I was wondering if you could help me and point me in the right direction to make my model work.

I am open to any kind of critic and suggestions.

Thanks already for your help

Jonas

Stan_model.stan (3.2 KB)

I have been reading on your old mailinglist and here and I guess that these constraints might be part of the problem.

If you’re bumping up against the edges of these constraints, then they could be a problem. But they might not be.

Have you made pairplots for this? Since you have 8 parameters, this is probly the easiest way to go looking for things. Are the divergent transitions consistent with each other in some way?

Make pairplots and highlight the points that are diverge (like the pairplots in here: Diagnosing Biased Inference with Divergences) and look for patterns is step one.

Something could be going south with the ODE solve that is making the gradient of the log probability very hard to compute.

Looking at the sampled values for the divergent transitions and calculating the respective the transformed parameters it seems like that they are not bumping against these edges.

divergent_dn_values.pdf (6.5 KB)
divergent_ds_values.pdf (6.5 KB)

I have created some pairplots but I can’t really see a strong pattern for the divergent transitions. The only thing which stands out to me is that low values of k seem to coincide with divergence.

pairs.pdf (72.6 KB)

BTW the inference on this model is quite slow and thus debugging is a bit slow. Is it possible that all the for loops to account for missing data points slow it down?

I have have set adapt_delta already to 0.99, what would you suggest next?

Agreed.

For loops can slow things down, but it’s probly not the biggest problem here. And since upping adapt_delta didn’t help, just leave it at its default value for now.

I just looked at the ODE function. This bit label = t > tau. Tau is a parameter here, right?

Stan requires its likelihoods to have first derivatives everywhere. Conditionals like this can create likelihoods that are continuous but not differentiable. The sampler can get stuck in the little kink if not.

So the question is, is the function label(tau) continuous and first differentiable?

If not, is there a way to smooth this out? Sometimes people replace hard switches like if statements with smooth switches (sigmoids and such). Or is there an easy way to get rid of this switching part of the model and see how it behaves?

Tau is data here. It is the timepoint after which the treatment stops and the concentration of label starts go down over time.

Conceptually how would smooth switch look like?

Oh, whoops, I saw it come in in the ode theta array and assumed it was a parameter. Well scratch that.

Next thing I’d do is plot ODE trajectories of the divergent vs. non-divergent things and look for any weirdnesses.

I’d be curious about making the tolerances tighter. Also, 1e8 seems like a large number of steps for an ODE solver. How long does this ODE run? Are you actually using all 1e8 of them?

An if statement looks like a step function, a smooth one would look like a sigmoid. I just called it a smooth switch – not a technical thing.

Generate the ODE trajectories with generated quantities or something probly.

Since you’re running ODEs, you might benefit from the plotting stuff here: https://github.com/mjskay/tidybayes

Can cut down pretty significantly on the R gymnastics to get things looking nice.

Do you mean smaller when you are saying tighter?

I will do that and let you know how it goes.

I just checked and setting it back to 0.8 increases the number of divergent transitions.

Yeah, so 1e-8 instead of 1e-6. It’s worth wiggling them a little if you suspect ODE shenannigans (which we should be suspicious of at this point).

If it doesn’t slow you down too much, leave it high. I like to think of changing the treedepth/adapt settings as last ditch things.

I have played around a bit the tolerances a I get “weird” results. Using the rk45 solver the divergent transition reduce by increasing the tolerance to e-10 but beyond that divergence increases again.When I change to the stiff solver at a e-10 all post-warmup transitions are become divergent.

I also looked at the trajectories of the divergent transitions for the two species overlayed with my simulated data. There nothing obvious going here either.

Trajectories_divergent.pdf (10.5 KB)

Looking at the pairs plot again I have a feeling that although more or less random low k values increases divergence. Do you think that could be some sort of numerical issue since it is 2^k in the model.

pairs_high_sens_e10_adapt_08_rk45_step001_tree30.pdf (127.2 KB)

Do you have any idea what is going on here? I am a bit confused.

I am confused as well.

Looking at the trajectories you plotted though, this sure doesn’t seem like the kind of ODE that needs 1e8 steps to resolve.

If you make the max_num_steps lower does anything change?

The next thing I’d try is rescaling the parameters. It’s tough on the sampler if R is about 100 and pn is 4 orders of magnitude smaller. Any chance you can reparameterize that and try that real quick?

Like define:

parameters {
  real pn1000;
}

transformed parameters {
  real pn = pn / 1000.0;
}

It’s easiest on the warmup if everything on the same scale around zero. Part of what warmup does is figure out this transformation, but since you have 8 parameters maybe do this manually and see if it helps.

Also as a side comment, it’d be slightly more efficient if you moved:

  theta[9]=delta;
  theta[10]=beta;
  theta[11]=tau;

Into the x_r data parameters. The ode solver will do the forward sensitivity analysis for everything in theta, even if it doesn’t need the sensitivities for the log density gradients. Things in x_r are not autodiffed.

I actually had reduced it to 1e6 for the previous runs and when I reduced it further to 1e4 the divergent transition increased again.

I am not sure I fully understand your transformation. I have attempted some rescaling which looks like this:

parameters{
real<lower=0, upper=1> k_scaled;
real<lower=0.05, upper=1> R_scaled;
}

transformed parameters{
real<lower=6,upper=120> R;
real<lower=0, upper=20> k;

k = k_scaled20;
R = R_scaled
120;

}

model{
k_scaled ~ uniform(0,1);
R_scaled ~ uniform(0.05,1);
}

That reduced the divergent transition from 49 to 41 for the 250 post-warmup transition. Does this scaling make sense in terms of making things easier for the sampler?

Yeah, that’s it. It’s kind of annoying but it can make life a little easier on the sampler. Also do it for the really tiny things.

Not a difference I’d come to any conclusions from.

How many warmup iterations are you using?

Could you put the code in a code block? It’s hard to read out of one. The markdown gobbles up the syntax.

Just put ``` on a new line before and after what you want to make code.

If you do ```stan
[your code]

then three more ticks on a new line at the end, you’ll get pretty colors as well.

I was thinking about this again. It might be best to pop out the ODE solve and just look at the timestepping of a solver outside Stan. I’ve used deSolve before and liked it. It’s not too bad.

It just seems like there has to be something unusual happening here.

I am using 250 warmup iteration and 250 sampling iteration while testing. I have increased it once which did not seem to change the amount of divergent transition.

I have palyed around with desolve and their ode45 method which is the most similiar to the solver integrated with stan. Changing max steps between 1e5 and 1e10 seems to have no marked effect on the results.

stepsize_traj.pdf (8.2 KB)

I also varied the absolute and relative tolerance between 1e-2 and 1e-14 in steps of 1e-2. What you can see it that the results a markedly different between 1e-2/1e-4(blue and red(appears red because of the overlay)) and the smaller tolerance results.

tolerance_traj.pdf (13.7 KB)

The stan model parameter looks like this now. Are there any further changes I could make. All parameter priors are now defined between 0 and 1.

parameters{
  real<lower=0,upper=1> pn;
  real<lower=0,upper=1> ps;
  real<lower=0,upper=1> tr_delta;
  real<lower=0, upper=1> k_scaled;
  real<lower=0.05, upper=1> R_scaled;
  real<lower=0> sigma1;
  real<lower=0> sigma2;
  real<lower=0,upper=1> ds_label;


}

transformed parameters{
  real<lower=0,upper=1> dn;
  real<lower=0,upper=1> ds;
  real<lower=6,upper=120> R;
  real<lower=0, upper=20> k; 
  
  k = k_scaled*20;
  R = R_scaled*120;
  dn = pn - tr_delta;
  ds = pow(2, k)* tr_delta*R + ps;
  
}

model{
  pn ~ uniform(0,1);
  ps ~ uniform(0,1);
  tr_delta ~ uniform(0,1);
  k_scaled ~ uniform(0,1);
  ds_label ~ uniform(0,1);
  R_scaled ~ uniform(0.05,1);
  sigma1 ~ uniform(0, 1);
  sigma2 ~ uniform(0, 1);

Out of the curiosity, have you checked the parallel coordinates (parcoord) plot? It would be interesting to see it.

I’m in no way expert on odes, but instead of uniform priors what of you use distribution defined on the interval 0,1 e.g. beta(1.5,1.5) or beta(2,2)?

1 Like

Raplacing the uniform(0,1) with beta(1.5,1.5) lead to significant increase in deivergence.

I will try that.

I have plotted the paracoord plots. I have seperated the parameters k and R since they are quite a bit bigger than the rest.

all.pdf (8.0 KB)
k_R.pdf (6.2 KB)
pn_ps_tr_ds.pdf (7.6 KB)

1 Like

Can you transform the data (x-mu)/sd? There is an example in bayesplot docs.

I seem to not be able to find the example you are referring to. Can you maybe point me in the right direction?

Is x the data here and mu the solution of the ODE solver at the given timepoint and sd the sample of either sigma1 or sigma2?