Offset multiplier initialization

@jsocolar pointed out that if you initialize an offset-multiplier style non-centered parameters to a really small value you get problems sampling. @jsocolar’s example showed the same problems for me, but when I tried to duplicate it with a manual offset-multiplier implementation, I did not get the problem.

It seems like something not-so-good is going on here.

Here is the offset-multiplier model

parameters{
  real<lower = 0> sigma;
  real<multiplier=sigma> x;
}
model{
  sigma ~ std_normal();
  x ~ normal(0, sigma);
}

Here is the manual offset-multiplier model:

parameters{
  real<lower = 0> sigma;
  real x_raw;
}
transformed parameters {
  real x = x_raw * sigma;
}
model{
  sigma ~ std_normal();
  x ~ normal(0, sigma);
  target += log(sigma);
}

Code to run them is

library(tidyverse)
library(cmdstanr)

mod1 = cmdstan_model("mod1.stan")
inits_chain_1 = list(sigma = 1e-20)
fit1 = mod1$sample(chains = 1, init = list(inits_chain_1), iter_sampling = 1000)
fit1$summary()

mod2 = cmdstan_model("mod2.stan")
fit2 = mod2$sample(chains = 1, init = list(inits_chain_1), iter_sampling = 1000)
fit2$summary()

You get output like this for the build-in offset-multiplier:

> fit1$summary()
# A tibble: 3 x 10
  variable      mean    median    sd   mad        q5       q95  rhat ess_bulk
  <chr>        <dbl>     <dbl> <dbl> <dbl>     <dbl>     <dbl> <dbl>    <dbl>
1 lp__     -9.01e+38 -9.01e+38     0     0 -9.01e+38 -9.01e+38    NA       NA
2 sigma     3.12e-20  3.12e-20     0     0  3.12e-20  3.12e-20    NA       NA
3 x        -1.33e+ 0 -1.33e+ 0     0     0 -1.33e+ 0 -1.33e+ 0    NA       NA
# … with 1 more variable: ess_tail <dbl>

And the output with a custom offset-multiplier looks like:

> fit2$summary()
# A tibble: 4 x 10
  variable     mean  median    sd   mad      q5    q95  rhat ess_bulk ess_tail
  <chr>       <dbl>   <dbl> <dbl> <dbl>   <dbl>  <dbl> <dbl>    <dbl>    <dbl>
1 lp__     -1.57    -1.24   0.987 0.761 -3.64   -0.563  1.00     292.     430.
2 sigma     0.821    0.703  0.594 0.591  0.0667  2.02   1.00     437.     366.
3 x_raw     0.0115   0.0396 0.983 1.01  -1.64    1.59   1.00     523.     660.
4 x         0.00698  0.0104 0.967 0.547 -1.66    1.51   1.00     492.     503.

This is pretty repeatable that the custom code doesn’t have a problem with inits but the built in does.

I’m not quite sure where the problem is or what the difference is – posting here if someone knows. @jsocolar said this was happening in a realistic setting (and so figuring out it was a problem probably wasn’t fun or easy and so it’d be nice to fix this if it is fixable or at least know what is the difference)

4 Likes

Not sure it helps, but the generated C++ for the first case:

local_scalar_t__ sigma;
sigma = DUMMY_VAR__;
      
sigma = in__.scalar();
if (jacobian__) {
    sigma = stan::math::lb_constrain(sigma, 0, lp__);
} else {
    sigma = stan::math::lb_constrain(sigma, 0);
}
local_scalar_t__ x;
x = DUMMY_VAR__;

x = in__.scalar();
if (jacobian__) {
    x = stan::math::offset_multiplier_constrain(x, 0, sigma, lp__);
} else {
    x = stan::math::offset_multiplier_constrain(x, 0, sigma);
}
{
    lp_accum__.add(std_normal_lpdf<propto__>(sigma));
    lp_accum__.add(normal_lpdf<propto__>(x, 0, sigma));
}

and for the second case:

local_scalar_t__ sigma;
sigma = DUMMY_VAR__;

sigma = in__.scalar();
if (jacobian__) {
  sigma = stan::math::lb_constrain(sigma, 0, lp__);
} else {
  sigma = stan::math::lb_constrain(sigma, 0);
}
local_scalar_t__ x_raw;
x_raw = DUMMY_VAR__;

x_raw = in__.scalar();
local_scalar_t__ x;
x = DUMMY_VAR__;

x = (x_raw * sigma);
{
  lp_accum__.add(std_normal_lpdf<propto__>(sigma));
  lp_accum__.add(normal_lpdf<propto__>(x, 0, sigma));
  lp_accum__.add(stan::math::log(sigma));
}
2 Likes

Thanks and this is the offset multiplier code: math/offset_multiplier_constrain.hpp at develop · stan-dev/math · GitHub

Not sure where something is going off the rails. Probably the way to debug this is looking at differences in gradients of those two blocks of C++ tho.

(Edit: to be clear not asking you to check the C++ or anything unless you really want to – the point of the post is to know if anyone else has seen this/knows what is happening before digging in)

2 Likes

Just to mention that I think the settings where I’ve seen this happen with “real models” involve initializations far from the typical set where gradients are really weak, and so (with nontrivial probability) chains manage to wander down to absurdly low standard deviations during the initial exploration… and (sinister voice) once they go in, they never come out.

I’ve got a model on simulated data where this was happening as frequently as approximately 1 in 4 chains. Even when chains worked, during the first few iterations this model threw a surprising number of warnings about “multiplier is zero but must be positive” and “current metropolis proposal is about to be rejected” (sorry I don’t have the actual warning text in front of me). Switching to the “old” noncentered parameterization squashed all these warnings as well as the sticky-boundary issue.

I agree with @jsocolar that initialization being far from the high probability region could be the problem. The following plot shows the density plot of the offset-multiplier model. The blue and black region has log-density greater than -15. We can see from the plot that the initials will be far from the high probability region when initializing sigma at a very small value.
p1

On the other hand, the high probability region of the manual offset-multiplier has a much better shape. The following plot shows the log-density of the manual offset-multiplier, where the blue and black region has log-density greater than -10.
p2

The weird shape of the high probability region is probably caused by the singularity of the log-density \log p(x, \sigma) of the offset-multiplier model at \sigma = 0 , where \log p(x, \sigma) \propto -0.5 \sigma^2 - log(\sigma) - 0.5 x^2/\sigma^2. While the manual offset-multiplier model whose log-density follows \log p(x_{raw}, \sigma) \propto -0.5 \sigma^2 - 0.5 x_{raw}^2 does not have this problem.

1 Like

Oh wait, when I read this issue a couple days ago I thought these were two different models, but now that I look closer I really think these should be exactly the same two log-densities.

This code from the built-in offset-multiplier (jacobian should be true when we’re doing HMC):

x = in__.scalar();
if (jacobian__) {
    x = stan::math::offset_multiplier_constrain(x, 0, sigma, lp__);
} else {
    x = stan::math::offset_multiplier_constrain(x, 0, sigma);
}
...
lp_accum__.add(lp__) // this happens later in the code

Should be equivalent to:

x = (x_raw * sigma);
...
lp_accum__.add(stan::math::log(sigma));

So I really think these should be the same log densities? I’m confused. Also a lot of the offset_multiplier code and whatnot changed in the last few months so maybe some autodiff got improved in develop. I am curious if this would happen there.

1 Like

I think that density is the one you’d get from a regular non-centered parameterization. Neither of the models in the first post are the regular non-centered parameterization (this is what I missed a couple days ago when I read it).

One of these is offset multiplier and the other should be equivalent to offset multiplier :/.

1 Like

Sorry I am a little bit confused. Isn’t it correct that Stan samples in the unconstrained space of (x, sigma) for the offset-multiplier model, and Stan samples in the unconstrained space of (x_raw, sigma) for the manual offset-multiplier model? The x-axis of the two density plots I gave are different. By the way, I just found that I missed the lower bound of sigma, so I updated the log-density plots in my last post.

1 Like

Ooooo, I think you’re right it is initialization. Underlying they are both exposed to the sampler as x_raw. However, when things are initialized they are initialized on the scale you see, which in the first model is x (not the x_raw scale). In the second model things are indeed initialized on x_raw. Not sure that is actually the problem, but could test that by supplying manual inits.

For more on why both models have an x_raw, what the offset multiplier thing does behind the scenes in sample on an x_raw to avoid that by doing a change of variables transformation from an x_raw to x that should mimic what the second model does.

That’s mostly baked into:

x = stan::math::offset_multiplier_constrain(x, 0, sigma, lp__);

There is the x on the input but that is different than the x on the output (the input should be equivalent to x_raw).

I see, that is interesting! Thank you so much!

Wait, I say that with such confidence but it isn’t true lol. Nevermind whoops.

(Randomly initialized things are initialized on the unconstrained scale)

Wow, that looks weird.

If that helps anybody, I tried building a Stan model that should match all of the operations and their order as in the offset-multiplier (mod1) and it still does not match the (bad) results in mod1:

mod3.stan:

parameters{
  real<lower = 0> sigma;
  real x_raw;
}
transformed parameters {
}
model{
  real x;
  target += log(sigma);
  x = fma(sigma, x_raw, 0);
  sigma ~ std_normal();
  x ~ normal(0, sigma);
}

So the only remaining difference IMHO is that in mod1 the constraining transform overwrites the x local variable while in mod3 we create a new local variable to hold the result… Which should not matter much?

BTW, I’ve always found the offset-multiplier way of doing stuff confusing as a) it requires me to use sigma at two potentially distant places in the code, increasing risk I will change one but not the other in the future and b) I always have to think hard whether the multiplier is from unconstrained-to-constrained or the other way around. In fact, it seems our own manual has it the wrong way: 10.5 Affinely transformed scalar | Stan Reference Manual (e.g. note that the math there would imply that the log-jacobian is -log(sigma) but Stan’s internal code and comparing with logic of the lower bound both give +log(sigma)) EDIT: I filed a docs issue for this: Docs for affinely transformed scalar do not match implementation · Issue #356 · stan-dev/docs · GitHub

In addition the offset-multiplier logic (both manual and controled by Stan) adds unnecessary operations to the model: log(sigma) is added in the Jacobian correction, only to be subtracted in x ~ normal(0, sigma), with a penalty for both performance and numerical precision, so I think the best way to code non-centering is the IMHO simpler:

parameters{
  real<lower = 0> sigma;
  real x_raw;
}
transformed parameters {
  real x = x_raw * sigma;
}
model{
  sigma ~ std_normal();
  x_raw ~ std_normal();
}
1 Like

Hmm~ I still suspect that the way of generating initials of the two models are different.
If you check the first several samples of x and \sigma for the two models

mod1 = cmdstan_model("mod1.stan")
inits_chain_1 = list(sigma = 1e-7)
fit1 = mod1$sample(chains = 1, init = list(inits_chain_1), 
               iter_sampling = 1000, save_warmup = TRUE)
draws <- fit1$draws(inc_warmup = TRUE)
rbind(draws[1:6,,"x"], draws[1:6,,"sigma"])

mod2 = cmdstan_model("mod2.stan")
fit2 = mod2$sample(chains = 1, init = list(inits_chain_1), 
                   iter_sampling = 1000, save_warmup = TRUE)
draws2 <- fit2$draws(inc_warmup = TRUE)
rbind(draws2[1:6,,"x"], draws2[1:6,,"sigma"])

you will get output like this:

> rbind(draws[1:6,,"x"], draws[1:6,,"sigma"])
          [,1]      [,2]         [,3]         [,4]         [,5]         [,6]
[1,] 0.2110230 0.2110230 -1.30120e-02 -7.56263e-03 -1.14654e-02 -1.61252e-02
[2,] 0.0000001 0.0000001  1.38811e-07  8.30654e-08  1.32924e-07  2.14542e-07
> rbind(draws2[1:6,,"x"], draws2[1:6,,"sigma"])
            [,1]        [,2]        [,3]         [,4]        [,5]        [,6]
[1,] 2.27659e-08 -8.2056e-08 6.52913e-06 -6.68192e-06 0.000232000 0.000181418
[2,] 1.00000e-07  4.5713e-07 5.38727e-06  5.36030e-06 0.000327161 0.003488220

If both models initialize on the unconstrained space of x_{raw} and \sigma, why the scale of x are so different for the two models.

In case anyone wants to get confused a bit more, here’s a traceplot of the first 200 warmup iterations for a single chain and mod1 and mod2 with a fixed seed but unspecified inits:

Edit with further info:

  • Any common seed without inits yields a picture as above
  • Specifying only sigma or only x and x_raw as inits yields different initial points and initial traces (as for @Lu.Zhang)
  • Specifying sigma and x==x_raw/sigma again yields the same inits and slowly diverging traces.
2 Likes

Also, @martinmodrak wins a prize:

The other formulations start to do very weird things around sigma==1e-10:

Another edit with further info:

unit corresponds to

Final edit:

left vs right: initialize only sigma vs initialize sigma and x/x_raw
top to bottom: sigma from 1e-20 to 1e-4
colors: models
lines: traceplots in the sigma,|x| plane

black lines: common initial values (if applicable)

Summary:

The manual version only appeared better because the initial points were not the same if only sigma was specified. Actually, it’s not really better than the offset-multiplier version. The reason appears to be numerical, as @martinmodrak’s version works fine for all (tested) values of sigma.

4 Likes