Throw exception when <multiplier = 0>?

QUESTION: Should we continue to throw exceptions for multiplier constraints when the mulitplier is equal to zero? This can happen during warmup for models that are actually fine due to overflow to 0 on the constrained scale.

@andrewgelman would like to remove them. I bring it here for broader discussion.

@andrewgelman sent me the following example models. The first model throws exceptions during warmup when sigma_a underflows to 0, whereas the second model doesn’t throw warnings.

// model 1
data {
  int N;
  int J;
  int<lower=1, upper=J> group[N];
  vector[N] y;
  real prior_mean;
  real<lower=0> prior_sd;
  real prior_factor;
}
parameters {
  real<lower=0> sigma_y;
  real<lower=0> sigma_a;
  vector<multiplier=sigma_a>[J] a;
}
model {
  y ~ normal(a[group], sigma_y);
  a ~ normal(0, sigma_a);
  sigma_a ~ lognormal(prior_mean, prior_sd);
  target += prior_factor*0.5*log(sigma_a^2/(sigma_y^2*J/N + sigma_a^2));
}

The warnings are due to sigma_a underflowing to zero during warmup:

Sampling hier_1:

Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: offset_multiplier_constrain: multiplier is 0, but must be > 0! (in '/var/folders/j8/kg_4ryhs4d78nts87zqw73wh0000gn/T/RtmplZdpzz/model-2b943e635ccc.stan', line 13, column 2 to column 34)

The normal distribution would throw a warning, but Andrew’s recoded it directly and the basic C++ math functions don’t throw exceptions because the standard library versions don’t (we have been considering throwing in Stan even where the standard library doesn’t).

The following model defines the same log density but does not throw warnings.

// model 2
data {
  int N;
  int J;
  int<lower=1, upper=J> group[N];
  vector[N] y;
  real prior_mean;
  real<lower=0> prior_sd;
  real prior_factor;
}
parameters {
  real<lower=0> sigma_y;
  real<lower=0> sigma_a;
  vector[J] e_a;
}
transformed parameters {
  vector[J] a = sigma_a*e_a;
}
model {
  y ~ normal(a[group], sigma_y);
  e_a ~ normal(0, 1);
  sigma_a ~ lognormal(prior_mean, prior_sd);
  target += prior_factor*0.5*log(sigma_a^2/(sigma_y^2*J/N + sigma_a^2));
}
1 Like

Does multiplier need to be positive?

I think I like the warning just for being careful, but if it really provides no diagnostic value then I guess we could get rid of it.

I assume this is underflowing with divergent transitions in warmup where the leapfrog has gone unstable.

Yes, the multiplier is positive constrained.

Is that warning problematic? If this occurred post-warmup, wouldn’t it be useful? Perhaps we should start squashing some warnings in warmup?

I’m not sure what the exact zero check tells us.

Ofc, I don’t have a lot of experience with that syntax. I assume what will happen is we’ll get a 1/0, everything will turn into NaNs, and whatever piece of Stan was computing the likelihood will know something went wrong.

I think checking non-negative is gonna catch the bulk of the errors here (a multiplier that is not properly constrained).

Again, I haven’t used the syntax, but I’d also expect this error could occur quite a lot. So the false negatives or whatever would be a problem. If we’re telling people to pay attention to errors, then we should expect them to mean something a substantial part of the time. These might happen a lot but just not be that meaningful.

Anecdotally, I regularly get them before the first few samples of warmup and am not sure how to address them, but since the models settle down after initializing and have no apparent pathologies, I just ignore them and a few times they drown out another warning that I should pay attention to.

2 Likes

If a variable is declared as

vector<offset = m, multiplier = s>[N] a;

then

a = m + s * a_unconstrained;

So if the multiplier s = 0, then you get the reduction a[n] = m, so all the values are the same and you’ve reduced N parameters a to one degree of freedom.

1 Like

Is the doc here written backwards: https://mc-stan.org/docs/2_23/reference-manual/affinely-transformed-scalar.html?

Just comparing it to this it seems so: https://mc-stan.org/docs/2_23/reference-manual/lower-bound-transform-section.html

By backwards I mean we want to put a sampling statement on Y, but the unconstrained space is written in terms of X and so the last equation with p_X and p_Y isn’t what we want.

I guess both spaces are unconstrained there but I mean unconstrained space as in the sampler’s coordinates.

The thing that bothers me is the log(0)/log(1/0) that’ll come with the Jacobian.

I just don’t expect that error/warning to mean anything unless someone actually just types <multiplier = 0> (or otherwise uses a parameter that is actually constrained to zero).

If we wanted to avoid this error, we could:

  1. Work on the log scale (provide a log_multiplier option? But then it’s still easier to just provide log(sigma) than sigma on a log scale)

  2. Change the offset multiplier transformation to y = \mu + (\sigma + 10^{-300}) x (or whatever the smallest positive double value is).

1 would take compiler work to be effective and 2 is just playing numeric games to make sure the error never happens.

I think it makes sense to just axe the error, or I’m pretty sure 2 + a non-negative check on sigma would be fine as well.

1 Like

Yes, definitely a bug in the doc. I filed an issue in the docs repo with an explanation. The code in stan-dev/math is correct, so it’s just a prolem with the doc.

That’d be OK with me. I’m curious if @betanalpha has a geometric point to make about this. The thing that worries me is the reduction in dimension.

We definitely don’t want to start adding epsilon offsets—that just hides the problem and we’ve learned from long experience that’s a bad idea. We just did a round of getting rid of these in the upper and lower-bound transforms (where we used to try to prevent underflow to 0 or rounding to 1).

I am entirely in favor of keeping the warning.

One fear is that people will try to use multiplier=0 to turn off certain parameters – when the multiplier is a constant zero the unconstrained parameter will decouple from the expression graph and not influence the target log density or its gradients (aside from adding a constant infinity due to the log(0) introduced with the Jacobian). This will cause the sampler to explore the entire real line along that unconstrained parameter (although the infinity added to the target log density will cause havoc to the state sampling procedure). Ultimately in my opinion trying to set multiplier to a constant zero is a bad enough idea that a warning is necessary.

At the same time I also think that multiplier being set to a dynamic zero due to underflow also warrants a warning. Underflow is not a nuisance but an important indication that floating point precision is not sufficient for the model being specified. Turning the warning off or trying to code around the warnings doesn’t improve floating point but rather just hides the problems. Users have the prerogative to ignore the warnings and push on with numerically fragile models, but that should be no reason to reduce the information communicated by the sampler.

1 Like

Yup. That’s exactly why I thought it should be an error.

The issue here is that these are transient errors during very early warmup iterations. Reducing initial stepsize has always gotten rid of these warnings in my experience. at the cost of making warmup slower.

I think that they’re informative regardless of their transience exactly because they tell you about how the sampler is struggling during warmup!

The warnings (I’m including all stability warnings including these, matrix inversions, implicit solvers, etc) are interwoven into the iteration updates so it’s clear when they’re coming from the warmup phase or the main sampling phase. When they show up during the main sampling phase then they inform the user that the target density is numerically unstable somewhere within the typical set, which is very bad. But even when they only show up during the warmup phase they inform the user that the target density is numerically unstable on the path towards the typical set which can reduce efficiency.

It’s only with these warnings that we can react by improving the numerical stability of the target density or tweaking the initial sampler configuration (smaller stepsize means more accurate numerical Hamiltonian trajectories that will deviate into numerically unstable regions less, especially when the large initial potential energy is converted to kinetic energy during equilibration).

1 Like

Should this be a helpful tip to add in the manuals?

I would just add that I’ve seen a ton of numerical errors that appeared only rarely in warmup - e.g. the sd of normal distribution being zero, location parameters being inf etc. Those have similar causes as the case discussed here (numerical underflow/overflow). In my experience they can be informative that my priors are too wide or something else is a bit fishy (though I also ignore them sometimes when I lack patience and it has never really bitten me). In this context, it seems weird to me to have one type of underflow error that gets hidden and other very similar errors that do not.

I don’t have strong opinion on this, just wanted to share this viewpoint.

In our particular example, the sampler was not struggling. I think it may have been happening on the very first iteration of warmup. I’m concerned that a naive user might freak out over these warnings, or alternatively the boy-who-cried-wolf issue that non-errors such as like this will numb users to more serious problems.

Also, from a statistical perspective, there is no error happening here. Internally, the scale parameter is on the log scale and it’s working just fine, and the exp(-300) going to 0 doesn’t actually cause any problems.

One way to see this is that I rewrote the model without the implicit scaling and it had no problems at all.

So in this case there’s nothing fishy etc.; it’s just an artifact of how the scaling is implemented.

One alternative would be to implement the scaling in some other way so that it mimics the error-free way that the computation goes without the implicit scaling.

I agree.

I’m not sure what “a statistical perspective” means. If we’re talking pure math, then computation is irrelevant. Inverse logit will never round down to zero from a finite value in unlimited precision mathematics.

If we’re talking computation, we literally run into an underflow error. The precise nature of the error is that a number that should be positive (the inverse logit of a finite number) underflows to zero. That’ll cause a problem for the normal_lpdf, which you finessed in the example by defining it yourself. That blocks the error message, but doesn’t fix the divergence caused by the underflow to zero.

It’s just the folk theorem for warmup. The initial stepsize is too large for the initial parameter values. We could maybe fix the algorithm and find an initial stepsize that works without errors or just stop cold there. We could also provide an option to suppress warnings during warmup, but they can be super helpful when there is a problem that doesn’t go away after a few iterations.

Only because you rewrote normal_lpdf in order to suppress the error message that would otherwise arise from a scale parameter with value 0 being passed to normal_lpdf.

It’s not “error free” without the implicit scaling!

This is not a false positive. There really was a numerical error that affected sampling! But I take your point that it’s not a serious issue during warmup if it goes away. At least in most cases where it’s not caused by multimodality or slow mixing or inability to visit a region of the posterior. In other words, it’s OK if it passes SBC.

Hi, just to clarify, here’s a simpler example. I’ll simulate fake data in R and then fit the hierarchical model. With the implicit scaling (program in file scaling_1.stan), we get tons of warnings: “offset_multiplier_constrain: multiplier is 0, but must be > 0!”. With the old-fashioned code (program in file scaling_2.stan), we only get one or two warnings: “normal_lpdf: Scale parameter is 0, but must be > 0!”. But the two models give essentially the same inferences; they both actually converge just fine.

I guess it’s open to debate whether the “normal_lpdf” warning should be suppressed: I have mixed feelings about this one, but I can accept the argument that it’s good for the user to know about these things. But I don’t think that the “offset_multiplier_constrain” warning makes sense, as it seems entirely unnecessary.

Code is below.

scaling_1.stan:

data {
  int N;
  int J;
  int<lower=1, upper=J> group[N];
  vector[N] y;
}
parameters {
  real<lower=0> sigma_y;
  real<lower=0> sigma_a;
  vector<multiplier=sigma_a>[J] a;
}
model {
  y ~ normal(a[group], sigma_y);
  a ~ normal(0, sigma_a);
}

scaling_2.stan:

data {
  int N;
  int J;
  int<lower=1, upper=J> group[N];
  vector[N] y;
}
parameters {
  real<lower=0> sigma_y;
  real<lower=0> sigma_a;
  vector[J] e_a;
}
transformed parameters {
  vector[J] a = sigma_a*e_a;
}
model {
  y ~ normal(a[group], sigma_y);
  e_a ~ normal(0, 1);
}

R script:

library("cmdstanr")

N <- 1000
J <- 50
group <- rep(1:J, rep(N/J, J))
sigma_y <- 2
sigma_a <- 0.5
a <- rnorm(J, 0, sigma_a)
y <- rnorm(N, a[group], sigma_y)

data1 <- list(N=N, J=J, group=group, y=y)

M1 <- cmdstan_model("scaling_1.stan")
fit_1 <- M1$sample(data=data1, refresh=0, parallel_chains=4, iter_warmup=1000, iter_sampling=1000)

M2 <- cmdstan_model("scaling_2.stan")
fit_2 <- M2$sample(data=data1, refresh=0, parallel_chains=4, iter_warmup=1000, iter_sampling=1000)

fit_1$summary()
fit_2$summary()

The reason there’s no error here is that the improper uniform prior on sigma_a is consisent with sigma_a = 0. If you use an inverse gamma or lognormal on sigma_a, you’ll get a divergence warning. Half normal shouldn’t throw an error.

The real problem here is the reduction in dimension. a is declared with J degrees of freedom, but with sigma_a = 0, the values of e_a are ignored.

Now what’ll happen during warmup is that we’ll fall back and lower step size and try again and make progress. But the warning is for a real underflow error.

I will ask again: should we just get rid of this warning from having a zero multiplier? I honestly don’t care one way or the other, but I suspect @andrewgelman’s going to keep asking about it until a solid decision is made.

Just to check, I re-did the above simulations adding the following line to the model blog of both Stan programs:

sigma_a ~ lognormal(1, 1);

And the same thing happened as described in my earlier post.

Then I redid with the following line instead:

sigma_a ~ normal(1, 1);

(This is actually half-normal because sigma_a is constrained to be positive.)

Again, pretty much the same thing happened.

So I don’t think the prior for sigma_a is the issue.

Also, just to be clear, it’s not that I “keep asking about it”! I asked about it and then there was some discussion on the thread, so I updated with a cleaner version of the model. It’s interesting that the error occurs with the implicit scaling but not with the hand-coded scaling.