Throw exception when <multiplier = 0>?

Is the doc here written backwards:

Just comparing it to this it seems so:

By backwards I mean we want to put a sampling statement on Y, but the unconstrained space is written in terms of X and so the last equation with p_X and p_Y isn’t what we want.

I guess both spaces are unconstrained there but I mean unconstrained space as in the sampler’s coordinates.

The thing that bothers me is the log(0)/log(1/0) that’ll come with the Jacobian.

I just don’t expect that error/warning to mean anything unless someone actually just types <multiplier = 0> (or otherwise uses a parameter that is actually constrained to zero).

If we wanted to avoid this error, we could:

  1. Work on the log scale (provide a log_multiplier option? But then it’s still easier to just provide log(sigma) than sigma on a log scale)

  2. Change the offset multiplier transformation to y = \mu + (\sigma + 10^{-300}) x (or whatever the smallest positive double value is).

1 would take compiler work to be effective and 2 is just playing numeric games to make sure the error never happens.

I think it makes sense to just axe the error, or I’m pretty sure 2 + a non-negative check on sigma would be fine as well.

1 Like

Yes, definitely a bug in the doc. I filed an issue in the docs repo with an explanation. The code in stan-dev/math is correct, so it’s just a prolem with the doc.

That’d be OK with me. I’m curious if @betanalpha has a geometric point to make about this. The thing that worries me is the reduction in dimension.

We definitely don’t want to start adding epsilon offsets—that just hides the problem and we’ve learned from long experience that’s a bad idea. We just did a round of getting rid of these in the upper and lower-bound transforms (where we used to try to prevent underflow to 0 or rounding to 1).

I am entirely in favor of keeping the warning.

One fear is that people will try to use multiplier=0 to turn off certain parameters – when the multiplier is a constant zero the unconstrained parameter will decouple from the expression graph and not influence the target log density or its gradients (aside from adding a constant infinity due to the log(0) introduced with the Jacobian). This will cause the sampler to explore the entire real line along that unconstrained parameter (although the infinity added to the target log density will cause havoc to the state sampling procedure). Ultimately in my opinion trying to set multiplier to a constant zero is a bad enough idea that a warning is necessary.

At the same time I also think that multiplier being set to a dynamic zero due to underflow also warrants a warning. Underflow is not a nuisance but an important indication that floating point precision is not sufficient for the model being specified. Turning the warning off or trying to code around the warnings doesn’t improve floating point but rather just hides the problems. Users have the prerogative to ignore the warnings and push on with numerically fragile models, but that should be no reason to reduce the information communicated by the sampler.

1 Like

Yup. That’s exactly why I thought it should be an error.

The issue here is that these are transient errors during very early warmup iterations. Reducing initial stepsize has always gotten rid of these warnings in my experience. at the cost of making warmup slower.

I think that they’re informative regardless of their transience exactly because they tell you about how the sampler is struggling during warmup!

The warnings (I’m including all stability warnings including these, matrix inversions, implicit solvers, etc) are interwoven into the iteration updates so it’s clear when they’re coming from the warmup phase or the main sampling phase. When they show up during the main sampling phase then they inform the user that the target density is numerically unstable somewhere within the typical set, which is very bad. But even when they only show up during the warmup phase they inform the user that the target density is numerically unstable on the path towards the typical set which can reduce efficiency.

It’s only with these warnings that we can react by improving the numerical stability of the target density or tweaking the initial sampler configuration (smaller stepsize means more accurate numerical Hamiltonian trajectories that will deviate into numerically unstable regions less, especially when the large initial potential energy is converted to kinetic energy during equilibration).

1 Like

Should this be a helpful tip to add in the manuals?

I would just add that I’ve seen a ton of numerical errors that appeared only rarely in warmup - e.g. the sd of normal distribution being zero, location parameters being inf etc. Those have similar causes as the case discussed here (numerical underflow/overflow). In my experience they can be informative that my priors are too wide or something else is a bit fishy (though I also ignore them sometimes when I lack patience and it has never really bitten me). In this context, it seems weird to me to have one type of underflow error that gets hidden and other very similar errors that do not.

I don’t have strong opinion on this, just wanted to share this viewpoint.

In our particular example, the sampler was not struggling. I think it may have been happening on the very first iteration of warmup. I’m concerned that a naive user might freak out over these warnings, or alternatively the boy-who-cried-wolf issue that non-errors such as like this will numb users to more serious problems.

Also, from a statistical perspective, there is no error happening here. Internally, the scale parameter is on the log scale and it’s working just fine, and the exp(-300) going to 0 doesn’t actually cause any problems.

One way to see this is that I rewrote the model without the implicit scaling and it had no problems at all.

So in this case there’s nothing fishy etc.; it’s just an artifact of how the scaling is implemented.

One alternative would be to implement the scaling in some other way so that it mimics the error-free way that the computation goes without the implicit scaling.

I agree.

I’m not sure what “a statistical perspective” means. If we’re talking pure math, then computation is irrelevant. Inverse logit will never round down to zero from a finite value in unlimited precision mathematics.

If we’re talking computation, we literally run into an underflow error. The precise nature of the error is that a number that should be positive (the inverse logit of a finite number) underflows to zero. That’ll cause a problem for the normal_lpdf, which you finessed in the example by defining it yourself. That blocks the error message, but doesn’t fix the divergence caused by the underflow to zero.

It’s just the folk theorem for warmup. The initial stepsize is too large for the initial parameter values. We could maybe fix the algorithm and find an initial stepsize that works without errors or just stop cold there. We could also provide an option to suppress warnings during warmup, but they can be super helpful when there is a problem that doesn’t go away after a few iterations.

Only because you rewrote normal_lpdf in order to suppress the error message that would otherwise arise from a scale parameter with value 0 being passed to normal_lpdf.

It’s not “error free” without the implicit scaling!

This is not a false positive. There really was a numerical error that affected sampling! But I take your point that it’s not a serious issue during warmup if it goes away. At least in most cases where it’s not caused by multimodality or slow mixing or inability to visit a region of the posterior. In other words, it’s OK if it passes SBC.

Hi, just to clarify, here’s a simpler example. I’ll simulate fake data in R and then fit the hierarchical model. With the implicit scaling (program in file scaling_1.stan), we get tons of warnings: “offset_multiplier_constrain: multiplier is 0, but must be > 0!”. With the old-fashioned code (program in file scaling_2.stan), we only get one or two warnings: “normal_lpdf: Scale parameter is 0, but must be > 0!”. But the two models give essentially the same inferences; they both actually converge just fine.

I guess it’s open to debate whether the “normal_lpdf” warning should be suppressed: I have mixed feelings about this one, but I can accept the argument that it’s good for the user to know about these things. But I don’t think that the “offset_multiplier_constrain” warning makes sense, as it seems entirely unnecessary.

Code is below.


data {
  int N;
  int J;
  int<lower=1, upper=J> group[N];
  vector[N] y;
parameters {
  real<lower=0> sigma_y;
  real<lower=0> sigma_a;
  vector<multiplier=sigma_a>[J] a;
model {
  y ~ normal(a[group], sigma_y);
  a ~ normal(0, sigma_a);


data {
  int N;
  int J;
  int<lower=1, upper=J> group[N];
  vector[N] y;
parameters {
  real<lower=0> sigma_y;
  real<lower=0> sigma_a;
  vector[J] e_a;
transformed parameters {
  vector[J] a = sigma_a*e_a;
model {
  y ~ normal(a[group], sigma_y);
  e_a ~ normal(0, 1);

R script:


N <- 1000
J <- 50
group <- rep(1:J, rep(N/J, J))
sigma_y <- 2
sigma_a <- 0.5
a <- rnorm(J, 0, sigma_a)
y <- rnorm(N, a[group], sigma_y)

data1 <- list(N=N, J=J, group=group, y=y)

M1 <- cmdstan_model("scaling_1.stan")
fit_1 <- M1$sample(data=data1, refresh=0, parallel_chains=4, iter_warmup=1000, iter_sampling=1000)

M2 <- cmdstan_model("scaling_2.stan")
fit_2 <- M2$sample(data=data1, refresh=0, parallel_chains=4, iter_warmup=1000, iter_sampling=1000)


The reason there’s no error here is that the improper uniform prior on sigma_a is consisent with sigma_a = 0. If you use an inverse gamma or lognormal on sigma_a, you’ll get a divergence warning. Half normal shouldn’t throw an error.

The real problem here is the reduction in dimension. a is declared with J degrees of freedom, but with sigma_a = 0, the values of e_a are ignored.

Now what’ll happen during warmup is that we’ll fall back and lower step size and try again and make progress. But the warning is for a real underflow error.

I will ask again: should we just get rid of this warning from having a zero multiplier? I honestly don’t care one way or the other, but I suspect @andrewgelman’s going to keep asking about it until a solid decision is made.

Just to check, I re-did the above simulations adding the following line to the model blog of both Stan programs:

sigma_a ~ lognormal(1, 1);

And the same thing happened as described in my earlier post.

Then I redid with the following line instead:

sigma_a ~ normal(1, 1);

(This is actually half-normal because sigma_a is constrained to be positive.)

Again, pretty much the same thing happened.

So I don’t think the prior for sigma_a is the issue.

Also, just to be clear, it’s not that I “keep asking about it”! I asked about it and then there was some discussion on the thread, so I updated with a cleaner version of the model. It’s interesting that the error occurs with the implicit scaling but not with the hand-coded scaling.

Thanks. That’s helpful. I would think that if sigma_a = 0 then sigma_a ~ lognormal(1, 1) would throw an exception. I’ll look into why that’s not happening. You may not like the result, though, because it should be throwing an exception.

Not yet. That’s why I wrote

My suspicion is due to your having explicitly told me in the past that your strategy for Stan feature requests was to put them on auto-repeat until they’re addressed.

  • The old way of doing non-centering and this are different. There’s a Jacobian adjustment thing going on here. You could just as well do:
parameters {
  real<lower=0> sigma_y;
  real<lower=0> sigma_a;
  vector<multiplier=sigma_y>[J] a;
model {
  y ~ normal(a[group], sigma_y);
  a ~ normal(0, sigma_a);


parameters {
  real<lower=0> sigma_y;
  real<lower=0> sigma_a;
  vector<multiplier=sigma_a>[J] a;
model {
  y ~ normal(a[group], sigma_y);
  a ~ normal(0, sigma_a);

But only the second thing is non-centering. The first is still fine statistically – I think it just probably wouldn’t work super well lol.

  • I’m in on reporting error messages aggressively, but has anyone fixed a problem because of this error? I still use the old non-centering thing so I don’t have experience with this. Setting this to actual zero should fail to initialize.

  • Maybe we should both print errors and provide summaries of them. The bad part about printing errors is that maybe an unimportant one prints 100 times and then an important one prints once. The good part is printing errors gives you more info to find problems.

Maybe something like:

> fit
47 warnings in 4000 iterations, check fit$warnings() for more info
variable  mean median   sd  mad    q5   q95 rhat ess_bulk ess_tail
    lp__  -7.25  -6.98 0.69 0.31 -8.70 -6.75 1.00     1818     2210
    theta  0.25   0.24 0.12 0.12  0.08  0.46 1.00     1447     1563
> fit$warnings()
23 normal_id_glm sigma == 0
17 max number of iterations exceeded
7 log of inf

That helped me find the bug, which is on line 27 of (math)/stan/math/prim/prob/lognormal_lpdf.hpp, which is:

check_nonnegative(function, "Random variable", y);

It should be check_positive because 0 is not in the support of a lognormal. Here’s the issue:

If we get rid of the multiplier error (I still don’t know what the consensus is there), then both will throw when they hit the lognormal prior, and neither would throw with a half normal prior (which is consistent with zero). This whole thing’s a mess when the math says we should be in (0, 1) but the floating point underflows to 0 or rounds to 1.

The solution I suspect @andrewgelman is going to want is to just suppress all warnings during warmup.

I’m not sure where that first model came from, but I agree it’s not something we’d want to do.

For the second one, the Jacobian adjustment is

target += J * log(sigma_a);

which should just make things uniform on a > 0 as a result, so that the result’s the same as doing all this manually (though a bit cleaner and a bit less efficient).

I see three things going on.

  1. Maybe it makes sense to have a “quiet” setting that switches off warnings during warmup. I’m not sure about this, though, as sometimes a warmup warning can be useful. We could consider turning off warnings just in the first stage of warmup. As usual, this is a naive user vs. super user tradeoff. For the naive user, early warnings can be distracting and can lead to a boy-who-cried-wolf syndrome. For the super user, warnings at any stage can give us information, and more information should be better.

  2. This particular set of warnings still seems unnecessary to me, because sigma is coming in the target function in two places, neither of which I think should cause a problem. First, there’s a log(sigma) in the target, but this should be fine, as Stan is actually working internally with log(sigma), which never itself blows up (it’s just at -300 or -1000 or whatever). Second, there’s a term of -0.5*((y - mu)/sigma)^2, but this could just become -Inf (thus, always rejecting the jump) when log(sigma) is less than the threshold.

  3. You (Bob) were saying that the reason this is going on is that the gradient causes the HMC algorithm to overshoot and send log(sigma) down to -1000. If so, another way to go would just be to not let parameters on the log scale go below -300 or whatever. For programming reasons this might be a nonstarter, but this would solve this particular problem.

My point is that in some settings the warnings during warmup are telling us real information about the geometry of the problem or whatever. But in this case the warnings are not telling us about any such problem. We can see this because they occur even when I put a strong prior on sigma_a which knocks out that ugly tip of the funnel.

So this does not seem to me to be a legit warning. I’m bothered by this warning; this doesn’t mean that I think all warnings should be shut off.

That’s what I was suggesting—kill the messenger.

I get what you mean, but I don’t like the analogy. This is not a false positive. The system’s crying “underflow” because there’s really underflow.

It absolutely does cause a problem in that a zero multiplier cuts the gradient to the parameters. The only reason the sampler gets over the problem is that it’ll reject and retry with a new random momentum.

That’s only true if you stick to the unconstrained scale in the Stan program. As soon as this shows up in the parameters block

real<lower = 0> sigma;

then the unconstrained form of sigma (i.e., log(sigma)) is exponentiated to produce sigma, which is then the variable available in the rest of the program.

The problem with standard deviation going to zero in a normal distribution is that there are log(sigma) and (y - mu) / sigma terms involved.

The problem with the multiplier going to zero here

vector<multiplier = sigma_a>[N] a;

is that all of the a go to zero and we degenerate from a vector of different variables to a vector of zeros.

How do you propose doing that? We can’t just say

if (x < -300) x = -300;

as we get discontinuities and we can’t autodiff through a constant like that.

Oh, I didn’t realize. I thought that when we had log(sigma) in the Stan program, it worked with the internal log_sigma variable (i.e., the parameter on the unconstrained scale).

You have to be more specific. If we write

parameters {
  real<lower = 0> sigma;

then the underlying parameter over which Stan samples is log(sigma). But, every iteration, the constrained value sigma is calculated using exp(). Then sigma is the only variable exposed. To get back to the unconstrained variable, you’d have to do log(sigma), but it’s already too late.

Now if you want to do this yoursekf, it’s easy:

parameters {
  real log_sigma;
transformed parameters {
  real<lower = 0>  sigma = exp(log_sigma);

If you want to put a distribution on sigma, you also need the Jacobian adjustment

model {
  target += fabs(inv(sigma));  // Jacobian adjustment

But if you just wnat to put a prior on log_sigma, such as a normal to imply sigma is lognormal, that you can do. But you still can’t put the resulting sigma into a normal distribution if it underflows to zero.

Intersting. So in theory Stan could be made efficient by keeping log(sigma) and using it when necessary, but I’m sure that this would come at much too great of a cost in complexity of the language, compared to whatever tiny amount of computation time it would save.

For now, I can just accept the warning, and in our documentation we can just make clear that sometimes Stan gives warnings in very early iterations which do not represent real problems.