Forced ODEs, a start for a case study?

Hi!

Attached is a small intro to forced ODEs in Stan. I am not sure if this falls into the category of a case study (no data is fit at all)… one would probably have to add the standard example of the Theophylline data set or something similar, not sure.

In any case, this is the type of info which I see users on the mailinglist ask from time to time. Therefore, we should consider to put it up somewhere if people agree that this document is helpful. Suggestions and improvements are certainly very welcome.

Best,
Sebastian

@charlesm93: This code demonstrates essentially what I do when I solve what you called a “mixed” situation.

ode_forcings.stan (2.1 KB)
ode_forcings_simple.stan (1.2 KB)
ode_forcings.R (5.5 KB)
ode_forcings.pdf (208.4 KB)

3 Likes

This would make a fine case study with or without fitting a real data set. Real data’s always nice to see, of course.

What we’re doing is moving some of these more complicated domain-specific calculations into case studies rather than trying to pack everything into the manual.

I’ve been thinking a lot about where we want to fall on the arXiv to program documentation to journal scale with these. We could open it up and let people manage their own submissions arXiv-style, or we could start refereeing them closely like a journal, or we could just vet them like any other doc pull request as we’re doing now.

1 Like

@wds15: This is great, and very much in line with what I had in mind for the “mixed solver”.

I put together a prototype, based on a simplified Friberg-Karlsson model (PK: one compartment with absorption from the gut, PD: feedback mechanism), and observed a twofold speedup. See attached files: these aren’t minimal working examples yet. The forcing function is the solution to the one compartment model, which I had in Torsten’s C++ library and I exposed to Stan for this test.

I want to try it out with a two compartment forcing function (i.e a more sophisticated forcing function).

Why do you pass the parameters for the forcing function in x_r? Shouldn’t you use parameters, if the forcing function depends on parameters?

fOneCpt_mixed.stan (5.3 KB)
fOneCpt.stan (4.6 KB)

Hi!

Forcing functions are usually understood as externally driving the system. I did interpret this here such that this is fixed by data, but you are right in that it does not have to be like that.

For PK/PD with an analytic PK part this means that PK parameters are passed in via x_r if you do a two-step approach while if you do a joint PK/PD fit, then, of course, the parameters for the PK model have to be passed into the ODE RHS via the theta argument. It really depends on what you do.

For PK forcing functions one can take advantage of many tricks if the PK parameters are fixed.

Sebastian

Hi,

Following up on the idea of writing a case study, I drafted a performance comparison between the mixed and the numerical solver. I wrote this as a sequel to Sebastian’s article. To recommend the implementation of a specialized mixed solver function in Torsten, I had to use a “difficult” problem, so I picked a full Friberg-Karlsson model.

MixedSolver.pdf (471.1 KB)

I am definitely looking for feedback. The footnotes are used for drafting comments. The code is on GitHub, under charlesm93/example-models, branch mixed-solver.

For a more formal article, I was thinking of starting with Sebastian’s two states example and running a comparison, before demonstrating applications to more sophisticated problems.

I also want a discussion of the theory. That this would work was not trivial to me, but I think I underestimate how expensive solving ODEs numerically can be – especially when we compute Jacobians on top of that.

@charlesm93, I haven’t dug into the paper deeply (yet), but a quick comment is that I believe the performance evaluation you’re using (time to certain number of effective samples) isn’t the right one. It’s an indirect way of measuring the difference between time for the numerical integrator vs the mixed integrator.

First thing I’d check is that both the numerical and mixed integrator results in the same output. (For testing, I’d check the gradients also.) Then I’d measure the direct difference in times between the two integrators. On first thought, I was going to suggest running with the same seed, but differences down to 1e-12 will have the draws differ even with the same seed.

I’ll post more thoughts after reading through in more detail.

@syclik Thank you for taking a look.

First thing I’d check is that both the numerical and mixed integrator results in the same output

I have tests to make sure the two models produce the same output (at multiple levels: results from the integrators, the evolution operators, the event handlers, and the full model). I also made sure they both simulated the real data when fixing the parameters at the true value. I can include these results in the article.

For testing, I’d check the gradients also.

I haven’t tested the Jacobians, though I ran a diagnostic test on which both models seemed to perform well. I’d be curious to see if we observe a difference when looking directly at the Jacobian.

Then I’d measure the direct difference in times between the two integrators.

Makes sense. I’ll work on it.

1 Like

Some comments:

Abstract: “We observe… an average speed up of 44% in these experiments.”

Introduction: You definitely should go into these scalings here, both in terms of solving the original system and the automatic differentiation. The quadratic scaling of the the autodiff (really, the quadratic growth of the elements in the Jacobian) is what makes expanding the system so costly.

Figure 1: Comics sans is never the answer.

Figure 2: Can you identify graphically the effect of the drug? Perhaps even a before/after with two figures, where you show the nominal feedback mechanism and then the effect of the drug.

Motivation Problem: “Circ” and “drug” should be \mathrm’d. Additionally, the equations could benefit from some judicious whitespace with “,” between some of the multiplicative terms.

Performance Analysis: Why not simulate the data directly in Stan? That would avoid any potential conflicts between the R package implementations and the model coded up in Stan.

Figure 3: The figure is fine, but readers will be tempted to look at the deviation and argue that neither model is getting a good fit not realizing that the box plots are for the variation in the mean and not representing the individual posteriors. Might be worth emphasizing this. Also, given the nontrivial parameter name lengths you might want to flip the graph 90 degrees.

Figure 4: The relatively large variation in performance is slightly worrying. Any idea why the runs are so variable?

I haven’t even looked, but until I get the word that Comic Sans is removed, I’m not going to!

I sort of like the idea of coding in two different languages for added security when they agree. It’s too easy to make the same mistake in both stages if you use the same language.

I find starting with a lower initial stepsize helps tame chain runtime variation.

1 Like

Introduction: You definitely should go into these scalings here, both in terms of solving the original system and the automatic differentiation. The quadratic scaling of the the autodiff (really, the quadratic growth of the elements in the Jacobian) is what makes expanding the system so costly.

Yes, yes, I now see what you mean! I definitely want to discuss the algorithm more thouroughly. Per Daniel’s recommendation, I’ve written some bare bones code with measurements of the CPU times required for each solver. I’m comparing the “dd” case, in which the init and the parameters are double, to the “vv” case, where init and parms are var and we compute the Jacobian. This should give us an “empirical” sense of where the gain in efficiency comes from.

Figure 2: Can you identify graphically the effect of the drug? Perhaps even a before/after with two figures, where you show the nominal feedback mechanism and then the effect of the drug.

Certainly. I can show how the drug causes a drop in neutrophil count and how the feedback mechanism brings it back up, once the body clears the drug. You even see a “spring” effect, where the feedback mechanism overcompensates before moving back to baseline.

Performance Analysis: Why not simulate the data directly in Stan? That would avoid any potential conflicts between the R package implementations and the model coded up in Stan.

To echo Bob, this avoids replicating mistakes. For testing purposes, it is easier to use a third party. If the two solvers disagree, but one agrees with a third program, this gives a clue as to where an error may lie (and it did actually help me debug a mistake).

Figure 3: The figure is fine, but readers will be tempted…

Good point. I also think having a plot with some summary statistics would be helpful. Bill G suggested plotting ratios. For the parameter names, I could probably use greek letters (instead of sigma, etc.)

Figure 4: The relatively large variation in performance is slightly worrying. Any idea why the runs are so variable?

I’m not sure, but I’d guess it is because I’ve only used 200 iterations. Even though the initial estimates are good and the priors informative, I think the chain can still wander off and land in pathological regions of the posterior. Because there are so few iterations, one such “expedition” would be enough to make a run relatively slow. To know for sure, I’d need to check the output of slower and faster runs. It may make sense to run the test with more iterations.

I haven’t even looked, but until I get the word that Comic Sans is removed, I’m not going to!

Ok, I’ll change to luminari.

A couple of questions:
(1) Which is a better performance metrics: CPU or Wall-clock time?

(2) When testing the jacobian, I want to use test_ode_finite_diff_vv in util.hpp. The function only allows me to pass an absolute error (diff2), but a relative error would make more sense, especially because the drug amount and the neutrophil count have different orders of magnitude. I can create a function just for this case study, but I think we should change the function in the core language too.

That’s all useful, but I meant literally go through the math of counting the size of the ODE system induced by the autodiff. You start with N states then go to N * K states – for every parameter you have to add N states to the ODE system. Solving for some of the original N states can reduce this burden significantly if N >> K. In other words, the efficacy of a mixed solver will depend on the size, N, and the number of parameters, K.

No, I meant literally in the figure. Highlight where the drug comes in so the reader can immediately see it. Because the figure is so busy it’s not trivial for the reader to immediately identify the effect of the drug, but if you had figure for the feedback mechanism without the drug (2a) and then with the drug (2b) then the reader could identify the difference by comparing the figures. Having two figures would also allow you to have more informative captions, the first describing the model and the second discussing the drug intervention and its intended effect on the model.

1 Like

Hi Bob!

How low do you set the initial step size to get the variation between runs away?

From what I observed it is super important to get the step size to be as large as possible (yet avoiding divergence trouble). One way to do that is to decrease adapt_delta to low values such as 0.7 … which comes at the risk of unstable HMC in some cases.

However, getting the tuning parameters right made a difference in terms of speed in the examples I ran across.

@charlesm93: I can only concur with Mikes comment about an inclusion of a “O-discussion”. I would expect that your 44% increase in speed is well in line with the reduction of the ODE system size. From what I found it is about the reduction in number of ODE equations - and not the cost of calculating the Jacobian due to AD. AD is amazingly fast which is possibly due to the fact that the ODE RHS expressions are usually very small and algebraically simple.

Sebastian

I’m talking about the initial stepsize here. It smooths out differences in warmup in my experience (while slowing them all downa bit).

Divergences are extreme failures to follow the Hamiltonian. Sampling problems can come up even if you don’t diverge all the way to our divergence warning, but just get a bit out of whack on the Hamiltonian.

That’s all useful, but I meant literally go through the math of counting the size of the ODE system induced by the autodiff.

In the case study, we need sensitivities for both the N initial states and the K parameters. The size of the coupled system is N * (N + K + 1).

However N + K + 1 is the same for the mix solver and the numerical integrator. In the mix solver, we need to pass the initial state for the forcing function in the theta array. The function which computes the forcing function requires those initial estimates. They will be parameters, because the initial states at an event are the solution computed at the previous event. As a result even though N is reduced, K increases by exactly the same number number. This means the gain in speed only depends on N, and not on K.

In the case study, they are N = 8 states in the full system, N = 5 states in the reduced system, and K = 9 parameters that get passed to the ODE (12 for the mix solver).

So the ratio between the two should be 5 / 8 = 0.625, which is not in full agreement with the ~44% observed speed-up (I need uncertainties around that number).

I would have actually expected something below 0.55. Even though the mix solver gains in speed by reducing the size of the system, it has to do extra work by (1) computing the analytical solution on top of calling the integrator – ok, not that much work – and (2) computing the analytical solution at each step of the integrator (i.e. dy_dt is harder to compute).

Also, isn’t calculating the sensitivities Stan specific? The scaling of the coupled system to calculate the ODE sensitivity suggests the mix solver is particularly well-suited for Stan, but less so for other softwares.

For speed, sometimes it’s worthwhile to spend more effort to make smarter steps—I don’t know if that’s what’s going on now or not.

Calculating the sensitivities for external use is not just Stan specific. Stan uses it internally in order to do HMC, but there are many applications such as estimating standard errors for max (maringal) likelihood or using them for engineering design. There’s a huge autodiff literature on calculating sensitivities of ODE solutions.

1 Like

If there is a single dose, we can optimize the model, by passing the initial states as data. This scheme would not work with multiple doses. Maybe we can use Sebastian’s trick of computing doses as tight gaussians?

Still, the ratio of the coupled states between the mix solver and the full integrator will always be 5 / 8. This in fact holds whether we compute the sensitivities or not, because the total number of parameters (initial states + parameters) is the same for both methods. The speedup does not depend quadratically on N, but linearly.

Edit: This doesn’t hold if the forcing function is only data dependent. In that case, I do not need to pass the initial states of the forcing function in theta. I would have K = 5 (PD) parameters. For the mix solver, the number of ODEs would be 5 * (5 + 5 + 1) = 55, and for the full integrator 8 * (8 + 5 + 1) = 112, i.e. a much greater gain in performance.

Also, I realize I shouldn’t compare (1 - 5/8 = 37.5%) to ~44%, because as pointed by Daniel, this is an indirect metric. I’ll finish the test measuring the CPU time of the solvers. I don’t expect a perfect match – I think more than the number of coupled ODEs comes into play.

Hi!

These performance O metrics are always black magic; at least to me who has not studied informatics to properly calculate them. However, the order of speed gain should scale with the total ODE system size. But as you noted it is no exact science here which can have many causes in this particular case. For example, when you go to the mixed case, the AD gets more costly as it is more deep and needs to work through more terms.

The trick to use Gaussians as dosing events is quite cool in my view and from what I have seen it works like a charm. All you need to do is to force the ODE integrator to “see” the dose-peak by forcing the integrator to output a value at that point. The gain from the reduced sensitivity system should be substantial, but we will pay with a bit more stress on the ODE integrator and additional terms in the AD stack due to the extra time-points where we need to stop.

For a serious multiple dosing situation I would probably code things differently. Often we have an early full PK profile over 12h, then only a handful of Ctrough values (PK concentration just before a dosing event such that this is steady-state) and then maybe one more full profile. In such a situation as of now I would suggest to use such a Gaussian peak before the individual PK profiles and in between I would just put a continuous infusion which mimics the rate of dosing.

Cheers,
Sebastian

Cormen, Leisersohn and Rivest’s algorithms book is fantastic if you want to get the basic idea and core data structures.

The big-O notation ignores additive and multiplicative constants, so it’s not usually what we care about for practical algorithms. We actually care about both of those, especially if they’re big.

No kidding. I finished writing the “CPU” tests and the results are in direct disagreement with with the “full model” tests. The num solver performs faster than the mix solver. And by an order of magnitude in the vv case (inits and parms are stan::math::var)!

I tried this for the FK model and a simpler two ODE system, and got the same result. I’m, admittedly, very surprised. Maybe I did not construct the CPU tests properly? See files:
https://github.com/charlesm93/math/blob/issue-31-mixed_solver/test/unit/math/torsten/mixSolver/mixSolver_test.cpp
https://github.com/charlesm93/math/blob/issue-31-mixed_solver/test/unit/math/torsten/mixSolver/mixSolver_simple_test.cpp

The files output a csv file which can be analyzed with the attached R script.mixSolver.R (1.7 KB)

Cormen, Leisersohn and Rivest’s algorithms book is fantastic if you want to get the basic idea and core data structures.

Sounds like a much needed read. I’ll get my hands on it!