ODE parameter estimation - diagnostics

specification

#1

Hi!

First of all, I would like to apologize if my question is too naive. I started using RStan for estimating 24 parameters of a system of 8 ODEs. From these notes, I produced the following plots of the densities of my parameters .

My question is: In the case of the densities that do not shown an unimodal behavior, it is caused by a miss-specification in my set of ODEs?, the priors that I used?, it is necessary to end up with unimodal distributions of the parameters?

I will really appreciate your comments and suggestions. My Stan model is below!

functions {
	real[] 	cbf(real t,
				real[] x,
				real[] theta,
				real[] y_r,
				int[] y_i) {
		real dxdt[8];
		
		dxdt[1] = - theta[1]*theta[12]*x[1]*x[6]/(theta[17]+x[1]) - theta[2]*theta[14]*x[1]*x[7]/(theta[19]+x[1]);
		dxdt[2] = - theta[3]*theta[13]*x[2]*x[6]/(theta[18]+x[2]);
		dxdt[3] = theta[4]*theta[12]*x[1]*x[6]/(theta[17]+x[1]) + theta[5]*theta[13]*x[2]*x[6]/(theta[18]+x[2]) - theta[6]*theta[15]*x[3]*x[8]/(theta[20]+x[3]);
		dxdt[4] = theta[7]*theta[14]*x[1]*x[7]/(theta[19]+x[1]) - theta[8]*theta[16]*x[4]*x[8]/(theta[21]*x[8]+x[4]);
		dxdt[5] = theta[9]*theta[14]*x[1]*x[7]/(theta[19]+x[1]) + theta[10]*theta[15]*x[3]*x[8]/(theta[20]+x[3]) +theta[11]*theta[16]*x[4]*x[8]/(theta[21]*x[8]+x[4]);
		dxdt[6] = theta[12]*x[1]*x[6]/(theta[17]+x[1]) + theta[13]*x[2]*x[6]/(theta[18]+x[2]) - theta[22]*x[6]*x[3];
		dxdt[7] = theta[14]*x[1]*x[7]/(theta[19]+x[1]) - theta[23]*x[7]*x[4];
		dxdt[8] = theta[15]*x[3]*x[8]/(theta[20]+x[3]) + theta[16]*x[4]*x[8]/(theta[21]*x[8]+x[4]) - theta[24]*x[8]*x[5]^2;
		
		return dxdt;
	}
}
data {
	int<lower=1> T;
	real<lower=0> x[T,8];
	real t0;
	real ts[T];
	real x0[8];
}
transformed data {
	real y_r[0];
	int y_i[0];
}
parameters {
	vector<lower=0>[8] sigma;
	real<lower=0.005999985,upper=1.998029> mu1;
	real<lower=0.00399848,upper=0.404> mu2;
	real<lower=0.4559996,upper=1.998033> mu3;
	real<lower=0.7739987,upper=1.999988> mu4;
	real<lower=0,upper=0.01000002> mu5;
	real<lower=44.7998,upper=799.9959> ks1;
	real<lower=3.678926E-7,upper=0.8000041> ks2;
	real<lower=53.59992,upper=380.8> ks3;
	real<lower=38.39991,upper=132.8> ks4;
	real<lower=0.3999882,upper=799.6> ks5;
	real<lower=0.03391768,upper=0.05500034> k1;
	real<lower=0.005047708,upper=0.006317451> k2;
	real<lower=0.006030796,upper=0.007347895> k3;
	real<lower=13.99994,upper=1999.843> yc1;
	real<lower=19.99999,upper=34.00002> yc2;
	real<lower=13.99998,upper=1998.618> yc3;
	real<lower=1.997585,upper=932.0009> yc4;
	real<lower=1.999979,upper=552.0002> yc5;
	real<lower=675.9997,upper=1150> yc6;
	real<lower=9.70224,upper=10.00001> yc7;
	real<lower=0.9998504,upper=1999> yc8;
	real<lower=5.926618,upper=6.00001> yc9;
	real<lower=1.998582,upper=70.0001> yc10;
	real<lower=0.9999872,upper=1999> yc11;
}
transformed parameters {
	real x_hat[T,8];
	{
		real theta[24];
		theta[1] = yc1;
		theta[2] = yc2;
		theta[3] = yc3;
		theta[4] = yc4;
		theta[5] = yc5;
		theta[6] = yc6;
		theta[7] = yc7;
		theta[8] = yc8;
		theta[9] = yc9;
		theta[10] = yc10;
		theta[11] = yc11;
		theta[12] = mu1;
		theta[13] = mu2;
		theta[14] = mu3;
		theta[15] = mu4;
		theta[16] = mu5;
		theta[17] = ks1;
		theta[18] = ks2;
		theta[19] = ks3;
		theta[20] = ks4;
		theta[21] = ks5;
		theta[22] = k1;
		theta[23] = k2;
		theta[24] = k3;
		
		x_hat = integrate_ode_bdf(cbf, x0, t0, ts, theta, y_r, y_i,1.0E-6, 1.0E-6, 1.0E8);
	}
}
model{
	
	sigma~cauchy(0,2.5);
	for (t in 1:T)
		x[t]~normal(x_hat[t],sigma);
}

#2

Real quick, the three different colors in the plots are three different chains?

The thing that stands out to me in this model is not that the posteriors wiggle a little bit, but that the sampler seems to be getting near all the bounds of each of the parameters. That’s no good, cause it hints that there’s valid parameter space that the sampler isn’t exploring cause of constraints.

I’d get rid of all the constraints you don’t absolutely need and just replace them with weakly informative priors. Check https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations .

Also you’ve got parameters that cover a really large range. Looks like the kx parameters can get down to 0.005~ ish, and the ycx parameters get up in the 1000s. It’s easier on Stan if you can manually scale these so the parameters are as near as possible to unit normals as you can make them, if that makes sense. So maybe for the kxs:

parameters {
  real k3big;
}

transformed parameters {
  k3 = k3big / 100.0;
}

model {
  k3 ~ normal(0.0, 0.1);
}

It can look a little ugly but it’s worth it when you have parameters that are widely varying. This is covered in that priors link as well.


#3

Dear Ben,

Thanks for your answer, I have been thinking about it for some days (I am a rookie in Bayesian Statistics). Answering your question, yes, the three different colors correspond to the posterior distribution of my parameters given by different chains.

About the constraints, I came up with them from an exploration of the parameter space using a deterministic approach (optimization with OLS cost functions. I am trying to model a microorganism’ driven fermentation process, so these ranges are those in which my OLS fitting could find optimal solutions within biological plausible ranges. So values beyond these limits are not allowed.

Because of the non identifiability that I faced using this approach, I hope that Stan would help me to figure out if it is a matter of structural or practical non identifiability.

About your answer, I did not get properly the idea of dividing the parameters that are extremely large by 100, would that mean that I need to re-scale the model too? (multiplying by 100 each term as it appears on the model block?).

Besides, I tried another approach, where I defined my big parameters as the inverses:

transformed parameters {
	real x_hat[T,8];
	{
		real theta[24];
		theta[1] = 1/yc1;
		theta[2] = 1/yc2;
		theta[3] = 1/yc3;
		theta[4] = 1/yc4;
		theta[5] = 1/yc5;
		theta[6] = 1/yc6;
		theta[7] = 1/yc7;
		theta[8] = 1/yc8;
		theta[9] = 1/yc9;
		theta[10] = 1/yc10;
		theta[11] = 1/yc11;
		theta[12] = mu1*2;
		theta[13] = mu2*2;
		theta[14] = mu3*2;
		theta[15] = mu4*2;
		theta[16] = mu5*2;
		theta[17] = 1/ks1;
		theta[18] = ks2;
		theta[19] = 1/ks3;
		theta[20] = 1/ks4;
		theta[21] = 1/ks5;
		theta[22] = k1;
		theta[23] = k2;
		theta[24] = k3;
		
		x_hat = integrate_ode_bdf(cbf, x0, t0, ts, theta, y_r, y_i,1.0E-6, 1.0E-6, 1.0E8);
	}
}

However, this just works somehow ok when I constrain the parameters between 0 and 1, and use normal priors N~(0.5,0.3). (for some the parameters the sampler stills gets near to the bounds)

I read in the link that you sent me, that when one has some idea that a parameter lies within certain bounds, it could be better to drop out constraints, and instead use a distribution around those boundaries as a prior.

However, when trying such an approach, I had problems in some aspects. First, I am not able anymore to sample more than one chain. One chain could progress really quick, and the others simply do not make progress at all, and depending on the seed used in the stan command (is my guess) sometimes simply do not run any chain.

I hope that you might advice me in this matter.

Thanking you in advance,
Mauricio


#4

if some chains get stuck, I would advise you in this setting to decrease the max_num_steps argument from 1E8 to 1E3 or so. Those chains which are stuck make the ODE integrator work very hard and its better to reject these parameter sets and move on. This is ok as long as this only happens during warmup.


#5

I have been thinking about it for some days

It’s hard stuff! Welcome to the club!

OLS … So values beyond these limits are not allowed.
Because of the non identifiability that I faced using this approach

Eh, let’s just repeat the exercise and verify this with Stan. Stan’s diagnostics are usually easier to interpret than stuff you get out of OLS, though it takes some time to get used to them.

I did not get properly the idea of dividing the parameters that are extremely large by 100

Parameters are the little degrees of freedom with which the sampler explores your space. Transformed parameters are not – they just transformations of that. k3 is the parameter you expect to be on a scale of 0.02~ or something, which means that k3big (which is 100x larger than k3) will be around 2. You can do a similar transform for your very large parameters to bring them from being on a scale of like 1000 to a scale of like 1-10.

The sampler moves around on the parameter space, not the transformed space. The idea is you want to use scaling to make your parameters look as much like iid normals as possible, and you mostly don’t care what the transformed parameters look like. This sorta space is really easy for the sampler to explore cause steps in all directions are basically the same size. This is in comparison to a narrow valley, where an efficient sampler would want to take large steps along the narrow valley (of a negative log likelihood), but it would need to be careful to only move up and down the wall so much to avoid instabilities (which conceptually isn’t that bad but numerically is hard).

Stan tries to figure out this transformation for you, but it’s not a bad idea to help it, especially if you’re working with a hard model (ODEs are hard).

I read in the link that you sent me, that when one has some idea that a parameter lies within certain bounds, it could be better to drop out constraints, and instead use a distribution around those boundaries as a prior.

However, when trying such an approach, I had problems in some aspects. First, I am not able anymore to sample more than one chain.

Keep going down this route. First off, if there’s any way to shrink your model, start with that.

Also, with these sorts of dynamics models, start making posterior predictive plots and compare them with your data (this is the generated quantities stuff – look in the Stan manual – looking at the generated signals with and without noise is usually really informative). Make sure what you think is happening is actually happening. You gotta make sure your model is fitting your data in some reasonable way (it might not be!).

One chain could progress really quick, and the others simply do not make progress at all, and depending on the seed used in the stan command (is my guess) sometimes simply do not run any chain.

When Stan goes slow, it’s time to investigate the model and figure out the exact problem. To do this, usually I just starting by running single chains for short numbers of samples. You can’t believe the results of an inference until you run a bunch of chains a long time, but if you’re just searching for problems, start with the easiest thing. Once small numbers of samples seem to work, go to large numbers. Once single chains with lots of samples work, go to multiple chains (checking and fixing stuff along the way).

Your friends here are the Stan pairplots and ShinyStan. ShinyStan is really easy (basically boils down to launch_shinystan(fit)) and it plots tons of diagnostics that might give you clues as to what is going on. Look for divergences and non-identifiabilities. If those aren’t obvious, look at the traceplots and make sure the sampler is exploring rapidly, and that the sampler isn’t reaching its max treedepth (this is associated with how many timesteps the sampler goes in HMC looking for a new sample – search around for more details, higher is worse, 10 is the default max and if you’re stuck there it’s bad).

Sorry if some of that seems confusing. May have typed too much haha. Hope it helps!


#6

Thank you very much for all your advice! I think that now it worked:

What I did is to re-scale the data dividing by its maximum in each time series, and then de-scaling parameters according to that transformation. I just checked that the transformation fits the original data very well.

Any comments if what I did is ok?


#7

The problem is that Stan only uses a diagonal mass matrix by default, so it only adapts the scales of parameters, not the scale plus rotation. And it’s a global adaptation, whereas a lot of hierarchical models have curvature that varies by position in the posterior.

The reason unit scaling helps is that Stan initializes the adaptation for each parameter’s scale at the unit scale. So if it’s already there, adaptation doesn’t have to figure it out.

RStanArm does what you suggest on the inside—scaling data and priors and then rescaling back for the answers.