Hi everyone,
This is my first post here on the forum so please let me know if I am missing something or you want me to be more explicit about my problem.
I have used Rstan now for a couple of months now and really enjoy it but I have run into some problems with my ODE model. I am modelling the labelled fraction of two population and how that fraction changes over time.
Now, when I try to infer the posteriors of the parameter I run into either one of the two following problems. Around 1/5 to 1/4 of the post burn-in transitions are divergent or one of the chains gets “trapped” in an area of low probability and all transitions are divergent.
Due to biological constraints I have imposed two contraints such that 0 <= pn-tr_delta <= 1 and 0 <= 2^k * tr_delta*R + ps <= 1 . I have been reading on your old mailinglist and here and I guess that these constraints might be part of the problem.
Since I am quite new to modelling in gerneral and Stan in particular I was wondering if you could help me and point me in the right direction to make my model work.
I have been reading on your old mailinglist and here and I guess that these constraints might be part of the problem.
If you’re bumping up against the edges of these constraints, then they could be a problem. But they might not be.
Have you made pairplots for this? Since you have 8 parameters, this is probly the easiest way to go looking for things. Are the divergent transitions consistent with each other in some way?
Looking at the sampled values for the divergent transitions and calculating the respective the transformed parameters it seems like that they are not bumping against these edges.
I have created some pairplots but I can’t really see a strong pattern for the divergent transitions. The only thing which stands out to me is that low values of k seem to coincide with divergence.
BTW the inference on this model is quite slow and thus debugging is a bit slow. Is it possible that all the for loops to account for missing data points slow it down?
I have have set adapt_delta already to 0.99, what would you suggest next?
For loops can slow things down, but it’s probly not the biggest problem here. And since upping adapt_delta didn’t help, just leave it at its default value for now.
I just looked at the ODE function. This bit label = t > tau. Tau is a parameter here, right?
Stan requires its likelihoods to have first derivatives everywhere. Conditionals like this can create likelihoods that are continuous but not differentiable. The sampler can get stuck in the little kink if not.
So the question is, is the function label(tau) continuous and first differentiable?
If not, is there a way to smooth this out? Sometimes people replace hard switches like if statements with smooth switches (sigmoids and such). Or is there an easy way to get rid of this switching part of the model and see how it behaves?
Oh, whoops, I saw it come in in the ode theta array and assumed it was a parameter. Well scratch that.
Next thing I’d do is plot ODE trajectories of the divergent vs. non-divergent things and look for any weirdnesses.
I’d be curious about making the tolerances tighter. Also, 1e8 seems like a large number of steps for an ODE solver. How long does this ODE run? Are you actually using all 1e8 of them?
An if statement looks like a step function, a smooth one would look like a sigmoid. I just called it a smooth switch – not a technical thing.
I have played around a bit the tolerances a I get “weird” results. Using the rk45 solver the divergent transition reduce by increasing the tolerance to e-10 but beyond that divergence increases again.When I change to the stiff solver at a e-10 all post-warmup transitions are become divergent.
I also looked at the trajectories of the divergent transitions for the two species overlayed with my simulated data. There nothing obvious going here either.
Looking at the pairs plot again I have a feeling that although more or less random low k values increases divergence. Do you think that could be some sort of numerical issue since it is 2^k in the model.
Looking at the trajectories you plotted though, this sure doesn’t seem like the kind of ODE that needs 1e8 steps to resolve.
If you make the max_num_steps lower does anything change?
The next thing I’d try is rescaling the parameters. It’s tough on the sampler if R is about 100 and pn is 4 orders of magnitude smaller. Any chance you can reparameterize that and try that real quick?
Like define:
parameters {
real pn1000;
}
transformed parameters {
real pn = pn / 1000.0;
}
It’s easiest on the warmup if everything on the same scale around zero. Part of what warmup does is figure out this transformation, but since you have 8 parameters maybe do this manually and see if it helps.
Also as a side comment, it’d be slightly more efficient if you moved:
theta[9]=delta;
theta[10]=beta;
theta[11]=tau;
Into the x_r data parameters. The ode solver will do the forward sensitivity analysis for everything in theta, even if it doesn’t need the sensitivities for the log density gradients. Things in x_r are not autodiffed.
That reduced the divergent transition from 49 to 41 for the 250 post-warmup transition. Does this scaling make sense in terms of making things easier for the sampler?
I was thinking about this again. It might be best to pop out the ODE solve and just look at the timestepping of a solver outside Stan. I’ve used deSolve before and liked it. It’s not too bad.
It just seems like there has to be something unusual happening here.
I am using 250 warmup iteration and 250 sampling iteration while testing. I have increased it once which did not seem to change the amount of divergent transition.
I have palyed around with desolve and their ode45 method which is the most similiar to the solver integrated with stan. Changing max steps between 1e5 and 1e10 seems to have no marked effect on the results.
I also varied the absolute and relative tolerance between 1e-2 and 1e-14 in steps of 1e-2. What you can see it that the results a markedly different between 1e-2/1e-4(blue and red(appears red because of the overlay)) and the smaller tolerance results.
Out of the curiosity, have you checked the parallel coordinates (parcoord) plot? It would be interesting to see it.
I’m in no way expert on odes, but instead of uniform priors what of you use distribution defined on the interval 0,1 e.g. beta(1.5,1.5) or beta(2,2)?