Debugging overflows in values/gradients

What if in the same way we’re talking about wrapping functions with performance counters we provide a utility to wrap functions with nan/inf checks?

So like say we have some Stan code that looks like:

y ~ normal(X * b + intercept, sigma);

That vaguely translates in C++ to something like:

target += normal_lpdf(y, X * b + intercept, sigma, msgs);

Couldn’t we write a C++ function that wraps this like:

target += search_for_nans_and_infs(normal_lpdf, y, X * b + intercept, sigma, msgs);

that will effectively just call normal_lpdf(y, X * b + intercept, sigma, msgs), but it can search all the inputs for nans and infs on call, and then it can throw a vari on the stack so that it can search for nans and infs in the gradients on the way back.

If it finds a NaN in the values or gradients, then it can print all the argument values and stuff out so that someone can figure out what’s going on where in their problem.

This thread had me thinking about this: Initial value rejected, but the likelihood exists for that initial value . In this example the gradients blew up and it took a long time to track down (“I spent weeks with this :/”).

The only error that Stan printed was:

Chain 1: Rejecting initial value:
Chain 1:   Gradient evaluated at the initial value is not finite.
Chain 1:   Stan can't start sampling from this initial value.

which is like correct, but it doesn’t give info on how to dig down further. There’s some unresolved difficulty in the thread of how to resolve the problem (which this wouldn’t address), but I think we could make it easier to find the problems.

Maybe in the future we could do something like:

Chain 1:   Gradient evaluated at the initial value is not finite.
Chain 1:     Rerun model with --debug_infs to get more information

And then we could automatically instrument all the function calls with nan/inf check varis and print debugging information.

It could be like:

Adjoint of alpha is -inf in y = beta_lpdf(x = 0.7, alpha = 100, beta = 200) when
adjoint of y is 1.7.

Maybe we’d have a default argument that says stop after printing the first 5 errors to avoid blowing up everyone’s terminals with the output.

How useful would something like this be? Numeric problems are hard to track down because checks aren’t comprehensive – sometimes an overflow is just fine. So maybe if we have a way to turn on/off some pedantic checking we could at least make it easy to find the problems.

4 Likes

A straightforward stopgap measure that would have caught the problem in the other thread is to check the partials in operands_and_partials.build() and throw an exception if there are any non-finite values. That change would automatically “instrument” all distributions, both PDFs and CDFs.
Some functions can return a finite value when given an infinite input but gradients always break when infinities are involved because infinity times zero is still NaN.

2 Likes

But in certain cases it makes sense for it to be zero, right? Is there a general reason, computationally, that Inf times 0.0 should be NaN? Please forgive my ignorance.

Yeah that does sound easy.

We already have inf checks in some really low-level places (that I don’t think are doing so much): https://github.com/stan-dev/math/blob/develop/stan/math/rev/core/operator_multiplication.hpp#L22

The difference here would be we throw an error if we find a nan or an inf.

That sounds like it’d be useful in that it won’t be too much of a performance burden. It won’t check other conditions like positivity, but it will check some.

Yes, because it’s how floating point arithmetic works in CPUs according to the IEEE 754 spec. That means it’s also how floating point works in any efficient implementation.

What did you think Inf * 0 should be? Inf * 1 is still Inf, but Inf / Inf is NaN.

1 Like

Measure-theoretically, zero.

I guess that what @maxbiostat was pointing at is that in many practical contexts Inf*0 results from evaluating a function at a boundary and for many of those cases a one sided limit exists and is finite and it is reasonable to return the limit. Obviously this can’t be done automatically, but it IMHO allows for well defined models to have inf as an intermediate value, so we might want the checks to stay optional.

What’s an example here where our autodiff would be able to recover?

You are right that all the examples I can think of are somewhat contrived and would require branching - and when you branch, you may as well branch in a way to avoid computing the 0 * Inf in the first place.