Fitting ODE models: best/efficient practices?

What @Funko_Unko does here is to add more and more data to the problem as warmup proceeds as I understand. That is rather inconvenient for partitioner to do that… but how about we raise the log-lik of the model to a power which we move from close to 0 (no data&prior contribution) up to 1 throughout the warmup phase? Has this been tried?

1 Like

That does sound like a convenient way to code this. In most models we can probably set up the prior to sample pretty efficiently and then just slowly turn on the likelihood. That avoids awkward things like models where more data leads to more parameters.

This is the last place I saw the raise-distribution-to-a-power-trick in use which was also for hard-to-sample problems in Stan: [2009.00471] Adaptive Path Sampling in Metastable Posterior Distributions

True that. This is simple. In case we want to use this in Stan at some point we could introduce two target statements.

target += for data log like and
prior_target += for the prior part

or something like. Before this kicks of a flame war on me… this not to be meant implemented anytime soon… but I am making the point that this may actually be possible to implement in a simple way in a stan program with full generality.

The raise to some power business is the idea of the power prior. A technique to discount historical data. I am not an expert in this technique as I have my trouble understanding the scaling parameter meaning, but in this setting of warmup it may make sense actually as I find.

@Funko_Unko you see what’s happening here? Care to try implementing it for your model?

That’s what I thought at first, but whenever I extract the stepsize from a run consisting only of the first two warmup phases, the stepsize is always one, which confused my a lot. Wonder why that is.

Hm, if the main goal is to adapt the metric and approach the typical set, I guess the time dedicated to finding an appropriate stepsize for a posterior which won’t be sampled anyways is wasted. I believe it also led to rather small stepsizes, which is counter to what we appear to want right? Start high and go low if necessary.

Yes, exactly. This, and pooling the warmup samples across chains, which actually appears to be a necessary step to improve performance.

You want to scale the part of the log-posterior density due to the data to avoid the chains being trapped in metastable energy states? Should be as simple as adding another parameter which we increment as we restart our chains. BTW, is there a way to get info on which warmup phase we are currently in from within the Stan program?

A potential drawback would be that we do not save any computational work, as we still have to integrate over the whole time span. With the rescaling, we should have even less pressure to leave the bad neighbourhoods, shouldn’t we? Or not, I have to get the picture right in my head first…

Hm, wait so what do we actually want? At sampling time, we want to have the “true” posterior? During warmup we want to interpolate between the prior and the posterior, with fixed factor per window of the second warmup phase?

Correct. You write your model such that you accumulate the prior straight with

target += ...prior stuff goes here...

but you collect the log-lik data driven terms into a variable.

real lp = 0;

lp += data-log-lik things;

then you do at the end

target += pow(lp, alpha);

and alpha is varied from 0.1 all the way up to 1 during successive warmup runs. The final sampling phase is with alpha=1 (though, I would make an if there. So if alpha==1, then drop the pow thing).

Makes sense?

1 Like

Sure. But during which phase?

The first phase may be the more difficult part, so this is where this approach might be most needed.

Anything is quickly implemented, but I’m not sure on the best strategy.

What do you think?

This technique is a warmup for the warmup if you want to. This should be done during the early parts of warmup.

I really don’t know what is best here - the topic is mega complex!

The approach should let the sampler first adapt to the prior (alpha=0) and then you ramp up alpha quickly to get to where you actually want to be. The crux is that this technique will only really help you if the posterior geometry is a smooth outcome of varying alpha… don’t ask me what smooth is here and how I would define it.

(and I was actually just reading this and threw in this idea - as it made sense). Maybe @avehtari or @Lu.Zhang can say more.

I like the idea of separating prior target and likelihood target. And also power posteriors, they could possibly help avoid getting stuck in bad local mode during warmup

I’m still not convinced that this will help much, but we can just compare regular/incremental/power warmup.

What are the best metrics for this? Besides from visual inspection, which seems rather unreliable and final Neff/s, which will probably take forever.

For the first phase you want to find the typical set, starting from some point in the tails of the posterior. So I guess something like a z-value? Distance from the posterior mean divided by a measure of the std? Edit: I guess something like the distance from the mean in the norm induced by the covariance matrix could work.

For the second phase you want to find the (co)variance matrix, so just the Frobenius norm of the difference to some reference matrix?

Oh that was all I was curious about.

The warmup strategy you talked about was fit an increasing subset of timepoints. Like fit 10%, then 20%, etc. and use each of those as approximate posteriors for the next in line.

And this seems similar. We can use pow(likelihood, alpha) as an approximation to the posterior where 0.0 is a real bad approximation and 1.0 is the right thing.

So in the same way that the subsets of timepoints strategy works, does this work? Nothing more formal than that.

Well what we are comparing right now in the eye-norm are wall times and neff/s, which is fine.

I can generate more of these plots with the incremental warmup replaced by/vs the power warmup, but if the power warmup is only intended for the first phase of the warmup, which should take up a rather short amount of time, then waiting for complete wall times or neff/s seems rather inefficient.

Yeah that’s good.

I don’t really have any intentions for it now :D. Just curious if it’s able to solve the same problem.

~This is indeed so, is this intended behavior?~
I mean: Stan reports the step size after adaptation as 1, even though the last step size used is different.
@bbbales2

Ooo, that sounds like a bug. Can you make an issue over here: Issues · stan-dev/stan · GitHub ?

The way to dig into something like this is figure out where the stepsize is being printed (via lots of grepping) and work back from there if you wanna dig in c+±land.

Hm, I did my worst: "term_buffer = 0" leads to "# Step size = 1" in output csv · Issue #3023 · stan-dev/stan · GitHub

1 Like

Hmmm, so I’ve been trying out my method on the pleiades problem (via this answer by @jtimonen).

Thanks to the adjoint method, everything worked perfectly with tight priors (thanks @wds15).

Even with wide priors, everything worked fine, until the interesting part of the dynamics came, making everything come to a stop. Meaning a warmup that took longer than a few minutes. Who has got time for that?

Anyhow, I’ve got an even better idea. Or maybe not, we’ll see. Stay tuned, or not.

Sample output for the pleiades problem, up until I got impatient:

And (incremental) marginal plots, up until the same time. The marginals for initial positions, velocities and measurement error tighten quite nicely, while the marginals for the stellar masses stay wide, until (I think) the respective planet interacts with another planet, which is the only time we actually get the information we need. So everything works as expected, but slowly.

FYI @bbbales2, I’ve tried this “power warmup” and did not observe any benefit. Chains still got stuck. It might be possible to fiddle with the settings, but my heart’s not quite in it :)

The problem was again a simple 2x2 orbital mechanics problem. Admittedly with absurdly wide priors, but “my” warmup handled it with ease.

@robertgrant: FYI, my idealistic use case (inspiring the question here Bayes factors for a series of experiments using posteriors as priors - #10 by robertgrant) would be to

  • first fit an ODE to part of the data (say up to time t1),
  • saving the posterior distribution of the (dynamic) states and (fixed) parameters at t1,
  • continuing fitting the data from t1 to t2,
  • integrate the states backwards to get the posterior at the original initial time t0.

However, simply saving the state as a multivariate normal distribution did not work (unsuprisingly to be honest, but one can dream). It might be possible to save a better representation using your methods, but so far I have not looked into it.

1 Like

I don’t have a method that will work reliably with 36 parameters yet, but maybe soon. However…the approximation to the density will cause drift over successive iterations, and needs to be replaced with all-data analyses, and if you actually are dealing with some kind of ghostly data that’s only half there via alpha, then I don’t know how to do that waypointing.

Thanks for responding!

Oh right, I forgot to mention this:

The goal is actually mostly to more efficiently warmup the all-data sampling, as such a slightly drifted approximation would be entirely sufficient, as a final correction will have to be performed with all of the data anyways.

I’ll watch this place (the forums), I presume you will advertise any solution with which you are satisfied? It’s not in any sense urgent, so I won’t deep dive now, but I am curious.