Indeed, we are. But it’s a trivial amount of extra work in the scheme of a larger model, so I don’t think that should discourage use. If we had thought that was a dealbreaker, we wouldn’t have included it in the first place.
But the numerical behavior is really a problem, and that’s enough to make me not want to recommend using the offset/multiplier in its current form. Oh well.
I always thought the offset/multiplier was backwards, like why can’t it be the same offset and multiplier in the parameters block but just declare the y ~ std_normal()? I only have to write the mu and sigma once and this would fix the auto diff precision issue.
Like doesn’t offset and multiply just do what I want to this standard normal already?
This is over-generalization from one example. This depends on the initial values and where most of the posterior mass is. For illustration, the same example, but now init near 0 and at the same time the most of the posterior mass is far from 0, so we are not starting near the mode.
Both @jsocolar’s and mine warmup traceplots show constant progress in the beginning. I also realised that the traceplots did not show the initial values. The following plots include the initial values and NUTS diagnostics. The first iteration is fine, but then there are several iterations which all have big step size, divergence, minimal treedepth and no progress. I repeated this with init=2 (bad inits), init=.1 (good init), and initializing with posterior draws (perfect init), and in all cases the sampler is stuck for several iterations which seems like a failure in the adaptation.
This is the expected outcome of the dual averaging, right? When the adaptation sees a very high acceptance stat very early in the adaptation, it’ll aggressively explore a large step size that takes a few iterations to come back down. The computational cost is minimal because they all diverge immediately.
Even if the cost is small, it seems still silly to increase the step size 1000 times bigger after results from one iteration, even when
initializing with posterior draws
the step size during the first iteration is in the middle of later step size distribution
and the accept_stat of the first iteration is between .25 and 1 and not much different from the later iterations, so it seems your high accpetance_stat claim does not hold
But maybe this silly behavior doesn’t matter, if Bob’s new sampling algorithm is better with step sizes anyway.
I think it’s fair to say that it’s tendency under the default initialization of Uniform(-2, 2), no? Certainly it is not guaranteed to happen and it depends on the data and the model, but this is common enough occurrence that users encountering this problem in the wild may find this thread in their searching. That’s why this discussion is valuable. My comment was just an attempt to summarize the conversation for the non-developers who might stumble upon this thread and want a tl;dr.
For me, the other takeaway from this thread is that I should also be paying more attention to my initial values when I use this kind of structure, but that’s is something applies to both on offset/multiplier and the transformation without change of variables approaches.
I’m writing up the C++ version as quickly as I can. Then I plan to release it Nutpie style at first with an interface through BridgeStan. Integrating directly in Stan is a multiple person-month job that touches half a dozen core and interface libraries (stan, CmdStan, cmdstanpy, cmdstanr, rstan, pystan, stan.jl).
Ooof. Is the step-size accidentally getting boosted at the beginning of the init buffer in the same way that it does at the after all the metric adaptation windows?