High number of divergent transitions in ODE model with explicit constraints



Last example


Thanks. I thought we are talking about the actual stan model in terms of transformation.

I have done the transformation and plotted the results:

all_transformed.pdf (10.3 KB)


Somehow the k is diverging when it is low. Could this be due the hard boundary (lower=0)

Do you need to limit k (maybe try without limits or at least without the lower limit)? It’s going into 2^k. What is the description for k?

P.s. recently @Bob_Carpenter wrote a blogpost concerning uniform priors. http://andrewgelman.com/2017/11/28/computational-statistical-issues-uniform-interval-priors/


I just tried that and divergence increased from 41 to 199 transitions.

It describes clonal expansion and thus k should not go below 0 .

I’ll have a look at that.

Do you have any intuition what else could be going wrong here?


Some of the posteriors seem to be “cut” at 0 when looking at the density plots. Using uninformative uniform prior such as U(-1,1) lead to all post warm-up transitions to be divergent and the posterior look not as expected(I know the true value since I use simulated data).


Oh yeah, like @ahartikainen said, this is definitely not a great thing.

Is it possible to use more simulated data and maybe that’d tighten up the posteriors and get you away from the boundaries? If the divergent transitions go away then that’ll tell you something.

Is it possible to post the R code to play with this?


Increasing datapoints reduced divergent transitions form 41 to 10. Which reduced further to 9 when I increased adapt_delta to 0.9 but when I increased warmup and sampling to 1000 steps each and also increased adapt_delta to 0.99 I ended up with all post-warmup transtions being divergent.

Lastly, adding further datapoints somehow increased divergence again but I am currently looking why that could be.


You could also try playing with adding tight priors around the right answer instead of the uniform ones to see if that changes anything.

But my advice is getting a little vague and drifty at this point :D. I’ll be back at my comfy desk in a few days and wouldn’t mind running the code a few times then if it’s easy to post the R code here.


Apologies for the huge delay in my reply. I was doing a few different things in the meanwhile.

I have now expanded the model to a hierarchical model and include an additional dataset.

However, I still have them same issue around parameter k. If used in the form of 2^k in the model I get divergence of all transitions at when debugging with a low number of iterations(60 warm up / 60 sample). While if implement k as result of 2^k I don’t get divergence during debugging but the problem becomes apparent when running longer chains (500/500) as I get 374 divergent transitions.

I have attached the two different stan models and would appreciate any suggestions to resolve this.

Furthermore, I was wondering if I have properly implemented a hierachical model here.

In particular, I get the following warning:

Warning (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If it does, you need to include a target += statement with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:
    mu_diff_comb[i] ~ normal(...)

I think my transformation is linear, right? Thus, I can ignore this message, right?

Let me know if you want me clarify any of the issues.

Model1.stan (5.1 KB)
Model2.stan (5.1 KB)


It’s a little squirrely cause the transformation there isn’t invertible, but I think that’d achieve your intended effect.

To figure out if these transforms are what you’re looking for, you can always do integrals over chucks of space in each of them that should correspond to the same probability and check if they’re equal. You put a prior over something that looks like x - y, but it leaves x + y lookin’ like a uniform, which baffles my math skills cause integrals in that direction don’t seem to normalize.

You could just make mu_diff_combo and mu_tscm parameters, and then make mu_tn a transformed parameter to get rid of the warning if you like.

That was a slick idea, for what it’s worth haha.

If you’re still having trouble with your model, that probly isn’t gonna make things easier.

Have you tried fitting to data that you generated from your exact model yet? Just to see if you can recover parameters? That’s where I’d go next.


Sometimes adding a hierarchical prior helps if the fixed prior was poorly specified.

Sometimes adding more data helps if it helps to identify the parameters.

But usually, it’ll just take longer and lead to more divergences if the hierarchical parameters aren’t well identified themselves.

Parameter transforms have to be square and invertible and have positive definite Jacobians. Even if you only want to put a distribution on some function of the parameters, you need to have an invertible transform. For example, if you have u and v and put a distribution on f(u, v), then what you’re probably really doing is working with transform (u, v) \rightarrow (f(u,v), u). That can be improper without a distribution on u if u is unconstrained.


Sorry, I am not sligthly new to all of this. So does it mean that my transformations is improper the way I have written it. Conceptually, I am trying to fit two normal distributions to data and then fit a model to the difference of the distribution means, which shares parameters with the ODE part of the model.

By doing I hope at least theoretically to attenuate some of the parameter correlations we observe between the parameters in the ODEs.

Do you think the reason for divergences here comes from the fact that k is too correlated and thus non-identifiable? If that is the case I am wondering why adding R as free parameter to the model, which is similarly correlated with trdelta, does not cause issues and ultimately divergent transitions.

Lastly, I forgot to mention that the BFMI seems always to be low. Could that explain any of the behaviour.

I will try out a few more things to get to the bottom of the problem!!


Yeah. Best thing to do is just get out paper and write things down. You have a transformed parameter, u = x - y. The sampler wants to know about p(x, y) (since those are your parameters). You’ve specified a prior on p(u). The question du jour is how to relate those two objects.

Right now you’re defining the difference. It’s just a function of the other random variables in your system. If you want to fit the difference, then you’d make it a parameter and stick some sort of error model around it. Like:

parameters {
  real mu1;
  real mu2;
  real diff;
  // etc. 

model {
  y1 ~ normal(mu1, sigma1);
  y2 ~ normal(mu2, sigma2);
  diff ~ normal(mu1 - mu2, sigma);

Dunno if that’d do anything for you though. I’d just not put the prior on the difference, or reparameterize it in terms of the difference if you did want to put the prior there.

But! With regards to the divergences, can you run this model with a fixed k and see if problems happen? I had a look at the ODE again and that switching thing (t >= tau) scares me. If we fix k at a few values and still see divergences, then maybe we can eliminate it as the culprit.


Thanks, makes sense. I will have look if that would work for my problem at hand.

What I observed when I was working with the exact model is that incorporating k as a free parameter always caused divergence, while keeping all the others free did not cause divergence of the model.

I have observed the same pattern for the hierarchical model, which suggest to me that k is the culprit.


Why does it scare you? In which way would it cause a problem?


I have tried to implement it as you suggested and I am currently running chains to test if its working as I would hope.

Generally speaking I have implemented as follows. Does that make sense than the way I had written it previously?

  int<lower=1> N2;
  real y2[N2];

  int<lower=1> N3;
  real y3[N3];


  real<lower=0> mu_2;
  real<lower=0> mu_3;
  real<lower=0> sigma_2;
  real<lower=0> sigma_3;

  real<lower=0> mu_diff;


  mu_2 ~ uniform(6800,7200);
  mu_3 ~ uniform(6800,7200);

  sigma_2 ~ normal(0,2);
  sigma_3 ~ normal(0,2);

  y2 ~ normal(mu_2, sigma_2);
  y3 ~ normal(mu_3, sigma_3);

  mu_diff ~ normal(mu_2 - mu_3, sqrt(square(sigma_1)/N2 + square(sigma_3)/N3));

  y_hat = "combination of parameters which estimate difference of the two population means but were also fit to ODE solution";

  mu_diff ~ normal(y_hat, sqrt(square(sigma_1)/N2 + square(sigma_3)/N3));


Eek, I’m scared I’ve misled you. My suggestions were based entirely on how to make the warnings go away. Making warnings go away isn’t a good end goal in and of itself.

The issue with how your priors were specified is that they were technically incorrect. This hurts model interpretability, because it isn’t super clear what’s happening, and that isn’t good. It’s easy to write models in Stan that have dubious statistical meaning. A ~ normal(B, C) is the conditional distribution of A given B and C, p(A | B, C). So the model having two mu_diff ~ statements is also almost definitely wrong as well.

There’s a solid chance that getting priors right on this thing isn’t going to matter. The thing to do is check (get rid of all the priors, put loose priors only on the parameters, etc.). Ofc, to check and trust the results we gotta get rid of the divergences. That’s the more important thing. And hopefully the divergences aren’t a result of these priors…

I hope I haven’t confused the situation more :P.

My biggest question is are we fitting simulated data here? Have you generated data from your exact model and then tried to recover the parameters? That’s what I’d do next if you haven’t already. There’s a chance your real data and model don’t match well, and that could be causing problems. By fitting your model to data generated from itself, we can verify that there aren’t any problems intrinsic to the model.


I am slightly more confused :D

Which priors should I get rid of and which should to be loosely defined?

Yes have simulated datasets for 7 individuals with some parameters fixed and some free, depending which parameters are free in the model I am fitting to the data.


I started off with calculating the mean of the two distributions from the two samples outside of stan and feed in the calculated difference as data and than fit the “shared” parameters to it, which looked like something like this:

  real mu_diff;

  real param1;
  real param2;
  real param3;
 real sigma;


  real diff_hat;

  diff_hat = param1 + param2*param3;
  mu_diff ~ normal(diff_hat, sigma);

The idea I had was to incorporate the estimation of the mean and sd in the stan model and use the estimates to fit the other parameters (e.g. param1, param2, param3).

Is this somehow a stupid idea and actually not possible the way I thought I could implement it?


There’s no way param1, param2, and param3 are identifiable there, but the statistical meaning of that model is clear.

It’s not a dumb idea in general. You can use summary statistics (that has technical meaning: https://en.wikipedia.org/wiki/Summary_statistics edit: blech, the Wikipedia definition really sucks, but I failed to Google a better one) to speed up inference. Or you might just summarize other bits of data ad-hoc, but the problem with doing these things is it is pretty easy to accidentally end up solving a different problem than the one you intended.

Aaah, bummer, there must be a real problem with the model then. Not just a weird fit to data or anything.

Do you have an updated non-hierarchical version of this model? The original one looks a little different than the latest ones. There’s some missing data stuff in the original that isn’t in the new ones. Also I’m having trouble parsing out what the telo/tn/tscm etc. stuff is doing.