Fitting ODE models: best/efficient practices?

I’m late to the party but I’ve enjoyed reading the discussion so far. Thanks for igniting the conversation @Funko_Unko .

A few weeks ago, @bbbales2 told me about adding recommendations for tuning ODE integrators in the Stan manual. It feels awkward to write recommendations when there are still many unanswered questions – and in certain cases, we still need to work out what the right questions are. Still, there are some useful heuristics and proven strategies, as this thread demonstrates.

Understanding Stan’s underlying mechanics can help avoid pitfalls – does that count as best practices? --, one example can be found in the recent case study on disease transmission. Section 5 resonates with some of the discussion happening here.

There’s also more work on switching ODEs during HMC phases, as teased during the presentation I gave last week and motivated by @wds15 's idea. And other things happening on other fronts.

2 Likes

Might be worth starting a new thread about the prototype adjoint solver. The linked-mass-flow example in the adjoint branch returns divergent transitions, unless I crank adapt_delta to 0.99 (which takes about twice as long to run). Also, it’s worth running things in parallel, so a command like

fit <- mod$sample(base_data, chains = 4, parallel_chains = 4,
                  adapt_delta = 0.99)

But then the chains are not mixing, \hat R \gg 1

Maybe this conversation is already taking place somewhere else?

That’s integrators, isn’t it?

I hadn’t actually looked at the (most important) question whether “the model” converges. If our first scalable example doesn’t, then that’s another reason for some principled test suite.

What’s the (arising) “consensus” here? I recall reading (on here) something to the extent that data driven rankings are difficult, but I can’t recall where exactly or what the argument was.

You should start the example with init=0 then it did work for me at least.

I have already run SBC over this thing and it did come out just fine.

I plan to announce the prototype in a separate post which will hopefully catch even more users attention. We should aim to close the testing phase this month. Then this leaves enough time before mid May we release the next Stan version. This sounds like plenty time to get this thing straight.

1 Like

Agree on a seprate post.

If any of you have more examples, feel free to push the model & data to the ode_adjoint cmdstan branch.

I have found the following things reasonable:

  • Using a dense metric appears to significantly improve every metric of interest. I guess that is because there may frequently arise correlations between variables in the posterior?
  • Playing with the tolerances may lead to catastrophe or to salvation.
  • The Adjoint ODE Prototype - RFC - Please Test appears to become competitive very quickly.

Furthermore, for the way I set up my toy models, the following worked out exceedingly well, significantly reducing walltime and runtime differences between chains and increasing ESS:

  • Instead of a single run using all of the data, warm the model up in chunks. I.e. take 1,2,4,8,16,… timesteps, and for each run perform a short warmup phase with adapt_init_phase=25, adapt_metric_window=100, adapt_step_size=0, iter_warmup=125 and initialized with a metric computed from the (relevant) warmup samples from all chains from the previous run. For the final run first do the above, compute the metric, and then do the sampling preceeded by the standard window to adapt the step size.

Is there anything subtly or not so subtly wrong with the above?

2 Likes

The need for a dense metric is to me a hindsight for bad parametrization as it suggests that the posterior is quite correlated which can be avoided once reparametrized (not always possible, nor do I have the time to do that for every problem).

Can you explain a bit more your warmup strategy? What do you mean by timesteps here? Can things run in parallel from what you describe?

Learning with short runs on warmup and then expand on that can make sense; but this is actually also done by the warmup procedure itself (which is not necessarily the best in the world, but certainly a very robust procedure…though we all wish for it to run faster which is hard)

1 Like

Yes, inspecting the posterior in pair plots does reveal some pretty tight correlations, but the dense metric should be able to take care of it, shouldn’t it? For the predator-prey model I used the standard parametrization, but maybe there is a better one? Then again, all of the parameters have some distinct significance (how good is the prey at reproducing, how quickly does it get eaten by predators, how quickly do the predators die and how good are they at hunting). I guess with the model being a Poisson system, you could transform it to a canonical symplectic system and see what that tells you about the parameters.

Say you have observations at 2^N linearly spaced time points. Then you construct your ith metric by performing a single fast-then-slow warmup iteration using the posterior due to the first 2^i observations, where you initialize your metric with the covariance matrix computed from all samples from the previous iteration and with the final points of the respective chains as your initial points. It looks like this converges more quickly towards the “true” metric, as can be checked (I believe) by computing the covariance matrix based on (potential) actual samples from the ith posterior. This seems like it leads to higher stepsizes, fewer leapfrog iterations and higher ess than relying on the metrics from single chains. Not sure whether this is clear?

It is very similar to the standard warmup procedure, in that the work is expected to grow geometrically, and I would expect the metrics to converge in a similar fashion towards the final metric. The typical sets of the posteriors should similarly converge towards the final one. However, it appears like this gets rid of much of the variation in the runtime of the different chains.

I have a sample run here, might be of interest to @wds15 et al.

Problem is a 2x2 orbital mechanics problem with unknown measurement error, masses and initial conditions, with only the position being measured. In the figure, green is the incremental warmup, black is the regular one up to the relevant time. For the wall time, x represents total time (warmup + sampling) and _ just warmup, everything per chain. Any run that took more than a 100 seconds was canceled. You can see the true and predicted trajectories and measurements at the bottom.

Seems like a good strategy. I would like it if things like this were easier to program. What is the adapt_step_size = 0 for?

Yeah I did everything via CmdstanPy, was kind of awkward (but great in the end, thank you dear developers, @mitzimorris and @ahartikainen I guess?), but certainly 100% better than if I had tried to abuse the C++ code.

That’s the window for the step size adaptation before we start to sample, but we do not want to start to sample until the very last run, which includes all measurements. Once we decide to sample, we do use the regular adapt_step_size before.

Edit: I’m actually not 100% sure whether it may be better to include the last window, and then use that step size for the next iteration. Preliminary tests looked like this was worse, but they were really not exhaustive.

Edit2: I believe the point of the last warmup window is to adapt the stepsize such that we get the aimed for acceptance rate? But this doesn’t appear to be relevant for the warmup phase? It looks like for the second warmup phase, the stepsize is always just one?

Oooh, makes sense.

All through warmup the stepsize is being adapted, so I’d expect the refined timestep would be unnecessary if more warmup was the follow.

It’s surprising that it worked worse in this case.

There’s a few interesting posts on Stan’s timestep adaptation over here from @monnahc .

What @Funko_Unko does here is to add more and more data to the problem as warmup proceeds as I understand. That is rather inconvenient for partitioner to do that… but how about we raise the log-lik of the model to a power which we move from close to 0 (no data&prior contribution) up to 1 throughout the warmup phase? Has this been tried?

1 Like

That does sound like a convenient way to code this. In most models we can probably set up the prior to sample pretty efficiently and then just slowly turn on the likelihood. That avoids awkward things like models where more data leads to more parameters.

This is the last place I saw the raise-distribution-to-a-power-trick in use which was also for hard-to-sample problems in Stan: [2009.00471] Adaptive Path Sampling in Metastable Posterior Distributions

True that. This is simple. In case we want to use this in Stan at some point we could introduce two target statements.

target += for data log like and
prior_target += for the prior part

or something like. Before this kicks of a flame war on me… this not to be meant implemented anytime soon… but I am making the point that this may actually be possible to implement in a simple way in a stan program with full generality.

The raise to some power business is the idea of the power prior. A technique to discount historical data. I am not an expert in this technique as I have my trouble understanding the scaling parameter meaning, but in this setting of warmup it may make sense actually as I find.

@Funko_Unko you see what’s happening here? Care to try implementing it for your model?

That’s what I thought at first, but whenever I extract the stepsize from a run consisting only of the first two warmup phases, the stepsize is always one, which confused my a lot. Wonder why that is.

Hm, if the main goal is to adapt the metric and approach the typical set, I guess the time dedicated to finding an appropriate stepsize for a posterior which won’t be sampled anyways is wasted. I believe it also led to rather small stepsizes, which is counter to what we appear to want right? Start high and go low if necessary.

Yes, exactly. This, and pooling the warmup samples across chains, which actually appears to be a necessary step to improve performance.

You want to scale the part of the log-posterior density due to the data to avoid the chains being trapped in metastable energy states? Should be as simple as adding another parameter which we increment as we restart our chains. BTW, is there a way to get info on which warmup phase we are currently in from within the Stan program?

A potential drawback would be that we do not save any computational work, as we still have to integrate over the whole time span. With the rescaling, we should have even less pressure to leave the bad neighbourhoods, shouldn’t we? Or not, I have to get the picture right in my head first…

Hm, wait so what do we actually want? At sampling time, we want to have the “true” posterior? During warmup we want to interpolate between the prior and the posterior, with fixed factor per window of the second warmup phase?

Correct. You write your model such that you accumulate the prior straight with

target += ...prior stuff goes here...

but you collect the log-lik data driven terms into a variable.

real lp = 0;

lp += data-log-lik things;

then you do at the end

target += pow(lp, alpha);

and alpha is varied from 0.1 all the way up to 1 during successive warmup runs. The final sampling phase is with alpha=1 (though, I would make an if there. So if alpha==1, then drop the pow thing).

Makes sense?

1 Like

Sure. But during which phase?

The first phase may be the more difficult part, so this is where this approach might be most needed.

Anything is quickly implemented, but I’m not sure on the best strategy.

What do you think?