High number of divergent transitions in ODE model with explicit constraints


I’m not sure what you’re trying to do with the model here, but if you have this uniform distribution on mu_2, you need to declare matching constraints in the declaration.

Why two distributions on mu_diff?


I will write an updated non-hierarchical version and start again from there. Shouldn’t take too long and I’ll get back to you.

Its additional piece of data which relates to the two species in the ODE system and I hope to use the difference of the population means as a additional data to inform the parameters of the model(telo_hat=… shares it’s parameters with the ODEs). These are represented as param1,param2, … in the above example code.

I hope that was not to confusing.

Sorry, I would include matching boundaries. That was only a mock up to make sure I understood what @bbbales2 was suggesting.

I did that due to fact that I hadn’t fully understood the the way stan works but I can see now that this doesn’t make much sense.

As I have written above I hoped to estimate the mean of the two samples within my model and than use their difference to inform model parameter rather than just feeding single datapoint with the difference calculated outside of the model.

Concept of model with difference between means as datapoint:

  real[] lineage(real t,
                  real[] y,
                  real[] theta,
                  real[] x_r,
                  int[] x_i){....

  real mu_diff;
  real y[20];

  real param1;
  real param2;
  real param3;
  real sigma;

  real diff_hat;
  real y_hat[2,10];
  real theta[3];

  theta[1] = param1;
  theta[2] = param2;
  theta[3] = param3;  
  y_hat = integrate_ode_rk45(lineage,y0, t0, ts, theta, x_r, x_i,1.0E-6, 1.0E-6, 1.0E6);
  diff_hat = param1 + param2*param3;

  //likelihood calculation of trajectory obtained from ODE solver
  for (n in 1:N)
    y[n] ~ normal(y_hat[yy[n],kk[n]], sigma[kk[n]]);

  //likelihood calculation difference of means
  mu_diff ~ normal(diff_hat, sigma);

I wanted to incorporate the full dataset of the two samples rather than just summary statistics, which I thought would be something like in the model above (I know having two times doesn’t make sense):

  mu_2 ~ uniform(6800,7200);
  mu_3 ~ uniform(6800,7200);

  sigma_2 ~ normal(0,2);
  sigma_3 ~ normal(0,2);

  y2 ~ normal(mu_2, sigma_2);
  y3 ~ normal(mu_3, sigma_3);

  mu_diff ~ normal(mu_2 - mu_3, sqrt(square(sigma_1)/N2 + square(sigma_3)/N3));

  y_hat = "combination of parameters which estimate difference of the two population means but were also fit to ODE solution";

  mu_diff ~ normal(y_hat, sqrt(square(sigma_1)/N2 + square(sigma_3)/N3));

However would something like this work (swoping y_hat and mu_diff):

  mu_2 ~ uniform(6800,7200);
  mu_3 ~ uniform(6800,7200);

  sigma_2 ~ normal(0,2);
  sigma_3 ~ normal(0,2);

  y2 ~ normal(mu_2, sigma_2);
  y3 ~ normal(mu_3, sigma_3);

  mu_diff ~ normal(mu_2 - mu_3, sqrt(square(sigma_1)/N2 + square(sigma_3)/N3));

  y_hat = "combination of parameters which estimate difference of the two population means but were also fit to ODE solution";

  y_hat ~ normal(mu_diff, sqrt(square(sigma_1)/N2 + square(sigma_3)/N3));

I really hope I have made it even less clear what I am trying to achieve. Let me know if you want me to elaborate.



The less error-prone way to build models is to think generatively. In your experiments, how is data generated from the parameters? Build your models like that.

The simplest assumption for ODEs is that parameters generate states deterministically and then we make noisy measurements of those states.

That’s where terms like y ~ normal(y_hat, sigma) (y_hat being the output of an ODE solve) come from. The Lotka-Volterra case study is the latest case study for that sorta thing: http://mc-stan.org/users/documentation/case-studies/lotka-volterra-predator-prey.html

If differences come up in how the system works and the measurements are made, then they’re worth including. If not, leave them out.

What you’ve written makes me think that these differences are not part of the generative process, but rather some assumptions that might regularize the problem a little and maybe make it easier to fit. If so, let’s start without them. Adding in bunches of parameters mostly doesn’t make problems easier anyway :P.


I have written up the non-hierarchical version of my model and I actually was able to achieve to keep all parameters free while have no divergent transitions. Admittedly, I have defined an informative normal prior over parameter par5.

The above non-hierarchical model now includes the difference of populations means between the to species of the ODE as a single datapoint(which comes from different data than the one which is used for the ODEs). This looks something like this in the model block where telo and telo_sd passed as data (described below):

  real y_hat[T,2];
  real z_hat;
  real theta[6];

  par1 ~ uniform(0,1);
  par2 ~ uniform(0,1);
  par3 ~ uniform(0,1);
  par4 ~ uniform(0,1);
  par5 ~ normal(12,3);
  par6 ~ uniform(0,20);
  par7 ~ uniform(0,k);

  sigma ~ normal(0,1);


  y_hat = integrate_ode_rk45(lineage,y0, t0, ts, theta, x_r, x_i,1.0E-6, 1.0E-6, 1.0E6);

  telo_hat = par7*50 + (((par2-par1)*50)/(pow(2,par6-1)*par3*par5));

  for (n in 1:N){
    y[n] ~ normal(y_hat[yy[n],kk[n]], sigma[kk[n]]);

  target += normal_lpdf(telo | telo_hat, telo_sd);


I just want to double-check that this makes any sense. The mean and sigma are calculated as follows;

  1. Calculate mean and sigma for each of the samples of the species respectively
  2. Calculate population mean and sigma for both species
  3. Calculate the difference of means and summing the variance of the two populations
  4. I pass then pass difference of the means and combined sigma into my stan model where I fit a combination of parameter, which are shared with the ODE to the mean and use the combined sigma as the SD of the Normal error model.

I this a reasonable way to go about it or am I absolutely off target with this approach.



So you have two ways of independently measuring trajectories of this ODE, and you do it for two species.

You model one like:

y ~ normal(ode(parameters), sigma);

That makes sense.

But if your second measurement also directly gives you the trajectories of that ODE, I don’t think there is a compelling reason to reduce that to a difference before passing it into the model.

Is there a technical reason I’m missing that you couldn’t just do:

yhat = ode(parameters);
y1 ~ normal(yhat, sigma1);
y2 ~ normal(yhat, sigma2);

where y1 is the data from the first measurement process and y2 is the data from the second measurement technique (and then sigma1/sigma2 represent the measurement noise of each technique)?

If that isn’t possible, then it’s presumably because measurement 2 is some sort of reduced resolution indirect measurement, in which case what you want to write is:

yhat = ode(parameters);
y1 ~ normal(yhat, sigma1);
y2 ~ normal(second_measurement_process(yhat), sigma2);

where second_measurement_process mimics whatever the second measurement technique is measuring from yhat. If the second measurement process gives you an estimate of the mean of the difference of the two trajectories, then code that up in second_measurement_process.

I don’t see a compelling reason to do all this preprocessing (computation of means and such) outside the model. I could be wrong ofc, and there are reasons to do this sorta thing, but I’m not convinced yet.


The second measurement is not related to the trajectories at all. However, the parameter are shared between the two models/types of measurement thus I am trying to infer them simultaneously to improve identifiability of some of the parameters.

I think there was a slight typo in the above example code. So I solve the ODE for the sampled parameters and calculate the probability of the trajectories of the two species as follows:

  y_hat = integrate_ode_rk45(lineage,y0, t0, ts, theta, x_r, x_i,1.0E-6, 1.0E-6, 1.0E6);

  for (n in 1:N){
    y[n] ~ normal(y_hat[yy[n],kk[n]], sigma[kk[n]]);

At the same time I also use a combination of the above parameters to fit it against the difference of the means obtained for the two species. This second dataset (different type of data) is independently obtained from the first dataset. As the combination of parameters describes the difference of the means of the populations, I first calculate this difference based on the sampled parameters (z_hat) and then calculate the likelihood of the observed difference z with pre-calculated z_sd as the SD of the Normal likelihood model as follows:

  z_hat = par7*50 + (((par2-par1)*50)/(pow(2,par6-1)*par3*par5));

  target += normal_lpdf(z | z_hat, z_sd);

I hope I have explained it a bit more concise now. As I am still a novice I just wanted to make sure that e.g. using the combined SD as the SD of the likelihood function is reasonable choice to make.


Oh okay, that makes sense. There’s no need to estimate means and standard deviations of data outside Stan though.

I think from what you’re saying you precompute z and z_sd using:

z = mean(measured_dataset_2);
z_sd = sd(measured_dataset_2);

And then pass that in as data.

Instead pass in measured_dataset_2 as data, make z_sd a parameter, and do:

target += normal_lpdf(measured_dataset_2 | z_hat, z_sd);

The only thing to keep in mind is when you’re fitting two models at the same time like this, you gotta watch out that they aren’t disagreeing with each other. Check your posterior predictives. It’s possible one model explains the data but the other doesn’t, and that probably has some application specific implications.


It is unfortunately not that straight forward. I tried to incorporate the actual measurements before(see earlier post). For this second dataset I have numerous measurements for both species and I am interested in the difference of the two population mean which I want to fit the parameters against. So I am actually doing something like this outside of stan:

z1 = mean(measured_dataset_2-1)
z2 = mean(measured_dataset_2-2) 

z1_sd = sd(measured_dataset_2-1)
z2_sd = sd(measured_dataset_2-2) 

z= z1-z2

z_sd = sqrt( ( (z1_sd^2 )/ n_z1 ) + ( (z2_sd^2 )/ n_z2 ) )

And than pass the the difference of the population means z and the square root of the sum of population variances into my stan model.

Is there any way to do these calculations in stan and combine means and sd in the model?


The thing you’re looking for then is:

target += normal_lpdf(measured_dataset_2_1 - measured_dataset_2_2 | z_hat, z_sd);

(where z_sd is a parameter and z_hat is a function of your other parameters)

All this is saying is the difference in the measurements of the two processes is normally distributed around some mean that is a function of your parameters with some noise level that you’re going to infer. Make sure and check the posterior predictives that this is working.


Thanks. The measurements are not “paired” or anything. They are independent samples from each of the population respectively. Will this be problematic?

Furthermore, how do go about subtracting measured_dataset_2_1 from measured_dataset_2_1 if they are no of equal size? I have already google’d this question but couldn’t find anything yet.


Welp, I’m wrong again. And it’s probably gonna keep happening since I don’t really know your model, but try this on for size:

parameters {
  // ... other parameters
  real mu1;
  real mu2;
  real<lower=0.0> sigma1;
  real<lower=0.0> sigma2;
  real<lower=0.0> z_sd;

model {
  // ... compute z_hat somehow
  sigma1 ~ normal(0, 1);
  sigma2 ~ normal(0, 1);
  z_sd ~ normal(0, 1);
  mu2 ~ normal(0, 1);
  mu1 ~ normal(z_hat + mu2, z_sd);

  measured_dataset_2_1 ~ normal(mu1, sigma1);
  measured_dataset_2_2 ~ normal(mu2, sigma2);

You should verify the likelihood makes any sense by writing out the math. Term by term:
p(\text{measure_data_2_1} | \mu_1, \sigma_1) p(\text{measure_data_2_2} | \mu_2, \sigma_2) p(\mu_1 | \hat{z}, \mu_2, z_{sd}) p(\mu_2) p(z_{sd}) p(\sigma_1) p(\sigma_2)

Looks like a likelihood to meeee.

The generating process is you draw a \mu_2, whatever that means. You then draw a \mu_1 which is some function of \mu_2 plus some offset that is a function of a bunch of other parameters in your model.

From \mu_1, \sigma_1, \mu_2, \sigma_2 you can generate your data. If you’re scared about the prior on \mu_2 cause it’s super arbitrary, make it bigger. Only reason I put it there was so it’s not uniform [-inf, inf].

If that makes sense for your model, rock n’ roll. If not, point out what’s wrong and we can adjust it.


I think that makes sense. Thank you so much for all the suggestions.

I will give it a try and verify the likelihood makes sense and get back to you.


Thanks for going at these hard posts.

Your response sounds like about half my posts in the first two years on the bboards! It’s a great way to learn applied stats if you’re not afraid to be wrong in public and have an audience patient enough to wait for corrections.


I appreciate that he keeps going.

I have implemented it know and it makes sense. I actually realised that I had the solution in front of me all the time. Using the parameter combination as the “offset” between the population means was exactly the way I simulated the data I was testing against.

My results with an informative prior on R now look like this:

As I am still a novice what would be the next steps to take? I can see that I still have strong correlations between parameters, would a hierarchical model potentially advisable? I have previously that a hierarchical model can help to attenuate these correlation and my real whole dataset consists of 7 individuals with the above sub-datasets.

To this end I was also wondering if I can speed up my model as it in its current form runs for 2,5 hours for 200 warm-up/200 sampling iterations. I have seen that there is an MPI implementation being developed which seemed pretty promising but I guess it not stable enough to be used by the “masses”. Anyhow I have access to an HPC if I can make use of parallelisation somewher in my code.

model.stan (2.5 KB)

Or shall I open a separate thread regarding the speed issue.


Hmm, looks like the posterior samples of ps and C are super duper correlated.

If you knew either ps or C, you’d have a pretty good idea of what the other one was.

Looking at the model, the only place ps and C come together is:

telo_hat = C*50 + (((ps-pn)*50)/(pow(2,k-1)*tr_delta*R));

So the posterior is saying it can explain the data by either making C small and ps big, or ps big and C small. Since ps shows up other places, and C only shows up here, then I guess the first natural question is what does C mean, and what is it doing? If we could reparameterize it, or get rid of it entirely that would be nice.

Also is the constraint on ps [0, 1] or [0, 0.05]?

Tightly correlated posteriors like this are hard on the sampler. Once we fix this, it should go faster. Hopefully much much faster and your computational problems are all gone, but maybe not.


the funnel at boundary zero in tr-delta is also worrisome. is that a hierarchical variance?

also, there appears to be some multiplicative nonidentifiability in ds-label and k.