The need for a dense metric is to me a hindsight for bad parametrization as it suggests that the posterior is quite correlated which can be avoided once reparametrized (not always possible, nor do I have the time to do that for every problem).
Can you explain a bit more your warmup strategy? What do you mean by timesteps here? Can things run in parallel from what you describe?
Learning with short runs on warmup and then expand on that can make sense; but this is actually also done by the warmup procedure itself (which is not necessarily the best in the world, but certainly a very robust procedureā¦though we all wish for it to run faster which is hard)
Yes, inspecting the posterior in pair plots does reveal some pretty tight correlations, but the dense metric should be able to take care of it, shouldnāt it? For the predator-prey model I used the standard parametrization, but maybe there is a better one? Then again, all of the parameters have some distinct significance (how good is the prey at reproducing, how quickly does it get eaten by predators, how quickly do the predators die and how good are they at hunting). I guess with the model being a Poisson system, you could transform it to a canonical symplectic system and see what that tells you about the parameters.
Say you have observations at 2^N linearly spaced time points. Then you construct your ith metric by performing a single fast-then-slow warmup iteration using the posterior due to the first 2^i observations, where you initialize your metric with the covariance matrix computed from all samples from the previous iteration and with the final points of the respective chains as your initial points. It looks like this converges more quickly towards the ātrueā metric, as can be checked (I believe) by computing the covariance matrix based on (potential) actual samples from the ith posterior. This seems like it leads to higher stepsizes, fewer leapfrog iterations and higher ess than relying on the metrics from single chains. Not sure whether this is clear?
It is very similar to the standard warmup procedure, in that the work is expected to grow geometrically, and I would expect the metrics to converge in a similar fashion towards the final metric. The typical sets of the posteriors should similarly converge towards the final one. However, it appears like this gets rid of much of the variation in the runtime of the different chains.
I have a sample run here, might be of interest to @wds15 et al.
Problem is a 2x2 orbital mechanics problem with unknown measurement error, masses and initial conditions, with only the position being measured. In the figure, green is the incremental warmup, black is the regular one up to the relevant time. For the wall time, x represents total time (warmup + sampling) and _ just warmup, everything per chain. Any run that took more than a 100 seconds was canceled. You can see the true and predicted trajectories and measurements at the bottom.
Yeah I did everything via CmdstanPy, was kind of awkward (but great in the end, thank you dear developers, @mitzimorris and @ahartikainen I guess?), but certainly 100% better than if I had tried to abuse the C++ code.
Thatās the window for the step size adaptation before we start to sample, but we do not want to start to sample until the very last run, which includes all measurements. Once we decide to sample, we do use the regular adapt_step_size before.
Edit: Iām actually not 100% sure whether it may be better to include the last window, and then use that step size for the next iteration. Preliminary tests looked like this was worse, but they were really not exhaustive.
Edit2: I believe the point of the last warmup window is to adapt the stepsize such that we get the aimed for acceptance rate? But this doesnāt appear to be relevant for the warmup phase? It looks like for the second warmup phase, the stepsize is always just one?
What @Funko_Unko does here is to add more and more data to the problem as warmup proceeds as I understand. That is rather inconvenient for partitioner to do thatā¦ but how about we raise the log-lik of the model to a power which we move from close to 0 (no data&prior contribution) up to 1 throughout the warmup phase? Has this been tried?
That does sound like a convenient way to code this. In most models we can probably set up the prior to sample pretty efficiently and then just slowly turn on the likelihood. That avoids awkward things like models where more data leads to more parameters.
True that. This is simple. In case we want to use this in Stan at some point we could introduce two target statements.
target += for data log like and prior_target += for the prior part
or something like. Before this kicks of a flame war on meā¦ this not to be meant implemented anytime soonā¦ but I am making the point that this may actually be possible to implement in a simple way in a stan program with full generality.
The raise to some power business is the idea of the power prior. A technique to discount historical data. I am not an expert in this technique as I have my trouble understanding the scaling parameter meaning, but in this setting of warmup it may make sense actually as I find.
Thatās what I thought at first, but whenever I extract the stepsize from a run consisting only of the first two warmup phases, the stepsize is always one, which confused my a lot. Wonder why that is.
Hm, if the main goal is to adapt the metric and approach the typical set, I guess the time dedicated to finding an appropriate stepsize for a posterior which wonāt be sampled anyways is wasted. I believe it also led to rather small stepsizes, which is counter to what we appear to want right? Start high and go low if necessary.
Yes, exactly. This, and pooling the warmup samples across chains, which actually appears to be a necessary step to improve performance.
You want to scale the part of the log-posterior density due to the data to avoid the chains being trapped in metastable energy states? Should be as simple as adding another parameter which we increment as we restart our chains. BTW, is there a way to get info on which warmup phase we are currently in from within the Stan program?
A potential drawback would be that we do not save any computational work, as we still have to integrate over the whole time span. With the rescaling, we should have even less pressure to leave the bad neighbourhoods, shouldnāt we? Or not, I have to get the picture right in my head firstā¦
Hm, wait so what do we actually want? At sampling time, we want to have the ātrueā posterior? During warmup we want to interpolate between the prior and the posterior, with fixed factor per window of the second warmup phase?
Correct. You write your model such that you accumulate the prior straight with
target += ...prior stuff goes here...
but you collect the log-lik data driven terms into a variable.
real lp = 0;
lp += data-log-lik things;
then you do at the end
target += pow(lp, alpha);
and alpha is varied from 0.1 all the way up to 1 during successive warmup runs. The final sampling phase is with alpha=1 (though, I would make an if there. So if alpha==1, then drop the pow thing).
This technique is a warmup for the warmup if you want to. This should be done during the early parts of warmup.
I really donāt know what is best here - the topic is mega complex!
The approach should let the sampler first adapt to the prior (alpha=0) and then you ramp up alpha quickly to get to where you actually want to be. The crux is that this technique will only really help you if the posterior geometry is a smooth outcome of varying alphaā¦ donāt ask me what smooth is here and how I would define it.
(and I was actually just reading this and threw in this idea - as it made sense). Maybe @avehtari or @Lu.Zhang can say more.
I like the idea of separating prior target and likelihood target. And also power posteriors, they could possibly help avoid getting stuck in bad local mode during warmup
Iām still not convinced that this will help much, but we can just compare regular/incremental/power warmup.
What are the best metrics for this? Besides from visual inspection, which seems rather unreliable and final Neff/s, which will probably take forever.
For the first phase you want to find the typical set, starting from some point in the tails of the posterior. So I guess something like a z-value? Distance from the posterior mean divided by a measure of the std? Edit: I guess something like the distance from the mean in the norm induced by the covariance matrix could work.
For the second phase you want to find the (co)variance matrix, so just the Frobenius norm of the difference to some reference matrix?
The warmup strategy you talked about was fit an increasing subset of timepoints. Like fit 10%, then 20%, etc. and use each of those as approximate posteriors for the next in line.
And this seems similar. We can use pow(likelihood, alpha) as an approximation to the posterior where 0.0 is a real bad approximation and 1.0 is the right thing.
So in the same way that the subsets of timepoints strategy works, does this work? Nothing more formal than that.
Well what we are comparing right now in the eye-norm are wall times and neff/s, which is fine.
I can generate more of these plots with the incremental warmup replaced by/vs the power warmup, but if the power warmup is only intended for the first phase of the warmup, which should take up a rather short amount of time, then waiting for complete wall times or neff/s seems rather inefficient.