Initial value rejected, but the likelihood exists for that initial value

bnicenboim · July 11, 2020, 5:52am

Hi,

Before I start and to prevent people from trying to avoid my long and not straight-forward model, I’m not expecting anyone to understand and debug the model for me, I think I have some experience with Stan, but I’m completely lost with what’s happening here.

I have a model of sentence processing which is struggling to start on different subsets of my data (reaction times of words in ~200 sentences).The likelihood seems to always exist for the initial value, but depending on the subset of data, the model will fail to start spitting “Gradient evaluated at the initial value is not finite.”

I’m attaching the model here ez_hmm_1.stan (7.6 KB) for reference. The thing is that I simulated the data, I fixed all the parameters except for one, and I identified some sentences that the model can’t deal with. That is, for some of the sentences, the model just run fine, but the following, for example, is problematic. As far as I can tell, there is nothing special here.

list(lnfreq = c(-3.21959716340167, -10.6723729927226, -3.96878841691886, 
-3.78636690441316, -5.15911824237188, -3.52508933292279, -11.7190576704579, 
-9.32291639942472, -3.21959716340167, -6.41942791246034, -6.75261349382341, 
-5.55571242934684, -4.62808966562588, -8.76201613003949, -3.78636690441316, 
-8.02507079950312, -3.52508933292279, -9.40283110709808, -8.24746545540302, 
-8.94316117193497), wrapup_f = c(FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE), RT = c(166.490950141974, 
300.90493554587, 181.674439595697, 221.628223694201, 152.920867070158, 
308.119471128247, 125.048953278296, 295.45315025367, 172.506950010066, 
125.644470180417, 317.8413699003, 163.524005409949, 180.510772105012, 
131.258886202743, 381.342819843197, 150.222320717369, 206.504485197978, 
162.884834661625, 384.054060772294, 164.271852972529), word_pos = c(1, 
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
20), N_obs = 20L, verbose = 0, onlyprior = 0)

I’m using it to trying to debug the model, here I use one iteration, and an initial value of 0, and a print statement of the likelihood.

fit_ez_hmm <- stan("ez_hmm_1.stan",
                   data = ls_pred, init = 0, iter=1, chain = 1)
fit_ez_hmm

And I get :

The NEXT version of Stan will not be able to parse your Stan program.
Please open an issue at
 https://github.com/stan-dev/stanc3/issues 
if you can share or at least describe your Stan program. This will help ensure that Stan
continues to work on your Stan programs in the future. Thank you!
This message can be avoided by wrapping your function call inside suppressMessages().
c("0", "\nSyntax error in 'string', line 117, column 16 to column 17, parsing error:\n\nExpected a \";\" after \"print(...)\".\n\n\n")
hash mismatch so recompiling; make sure Stan code ends with a blank line

SAMPLING FOR MODEL 'ez_hmm_1' NOW (CHAIN 1).
Chain 1: [-4.28026,-5.06026,-4.54302,-6.03513,-4.54653,-5.75279,-5.25075,-5.90587,-4.54448,-5.43742,-5.7158,-4.53386,-4.67623,-5.08949,-6.39984,-4.67007,-5.29535,-4.64996,-6.53748,-4.59851]

Chain 1: [-4.28026,-5.06026,-4.54302,-6.03513,-4.54653,-5.75279,-5.25075,-5.90587,-4.54448,-5.43742,-5.7158,-4.53386,-4.67623,-5.08949,-6.39984,-4.67007,-5.29535,-4.64996,-6.53748,-4.59851]

Chain 1: Rejecting initial value:
Chain 1:   Gradient evaluated at the initial value is not finite.
Chain 1:   Stan can't start sampling from this initial value.
[1] "Error in sampler$call_sampler(args_list[[i]]) : Initialization failed."
error occurred during calling the sampler; sampling not done

Notice that the printed values in [] are the likelihood of every word in the sentence. So it’s not NaN or infinity, why can’t the model start?
I exposed the likelihood function and I tried with different values, and it seems fine. In fact, here I’m plotting it against possible values of alpha (the true value is 2, and it’s constrained to be positive and it’s the only parameter in my model in these tests), and it seems nice and smooth.

What can be happening here?

I’m using rstan_2.21.1, and I’ve also tried with cmdstanr_0.0.0.9008 using cmdstan-2.23.0.

Thanks!
Bruno

nhuurre · July 11, 2020, 6:57am

“Gradient is not finite” means the autodiff is failing somehow. It’s difficult to debug since you can’t print the gradients. I’d start looking at those log_sum_exps.
If x=0 the derivative of log(x) is infinite. log_sum_exp(log(x),log(y)) is then equal to log(y) and therefore finite when y>0 but the autodiff calculates the derivative with respect to x as zero times infinity which is indeterminate. This is a problem if x depends on parameters.

bnicenboim · July 11, 2020, 8:05am

Thanks for the fast answer.
I thought that having -infinity in a log_sum_exp wasn’t a problem. But in any case, I printed all the insides of my log_sum_exp functions and there is no infinity there.
Do you spot any other place where the derivative might be infinite?

nhuurre · July 11, 2020, 8:21am

Maybe s_transform? check if unbounded_sx is finite.

bnicenboim · July 11, 2020, 8:26am

unbounded_sx and s are finite… :/
What else can it be?

bnicenboim · July 11, 2020, 8:27am

or what else should I be looking for?
any trick to identify places where the gradient might break?

nhuurre · July 11, 2020, 8:31am

The gradient usually breaks when something becomes infinite. I was looking for places where you have something like a + exp(b) because then the result can be finite even if b is infinite. I don’t know.

bnicenboim · July 11, 2020, 8:37am

ok,
so this
lfailp_prev[1] = log1m_exp(lfailp[n - 1]);
is always zero in n=2, because lfailp[1] is -Inf.
But this isn’t a problem for some subsets of the data. So it’s not any infinity right, just when it’s summed to a parameter of the model, right?

bnicenboim · July 12, 2020, 6:31pm

Ok, thanks for the tips, I commented lines and parts of lines until I found the “problem”.

If I remove the beta_lcdf in n =18, i =16 for the following line there the model starts.
lfailp_terms[i + 1] = lfailp_prev[i + 1] + beta_lcdf(.5| a[i + 1]/scale, b[i + 1]/scale);
But there is no infinity there, and I can’t spot anything weird, the two parameters are 556.025 and
557.204 and the value of the beta_lcdf there is -0.665332

So how come the gradient is infinite there? What should I do there?

Bests,
Bruno

nhuurre · July 12, 2020, 7:11pm

Yup, can confirm, beta_lcdf does not work correctly for values above 540 or so. I bet the problem is here

github.com

stan-dev/math/blob/0ee81d676d684d6e4448079d59f1a44ab942ce5b/stan/math/prim/fun/grad_reg_inc_beta.hpp#L40


* @param[in] betaAB the value of <code>beta(a, b)</code>
*/
template <typename T>
void grad_reg_inc_beta(T& g1, T& g2, const T& a, const T& b, const T& z,
                      const T& digammaA, const T& digammaB,
                      const T& digammaSum, const T& betaAB) {
 using std::exp;
 T dBda = 0;
 T dBdb = 0;
 grad_inc_beta(dBda, dBdb, a, b, z);
 T b1 = beta(a, b) * inc_beta(a, b, z);
 g1 = (dBda - b1 * (digammaA - digammaSum)) / betaAB;
 g2 = (dBdb - b1 * (digammaB - digammaSum)) / betaAB;
}

}  // namespace math
}  // namespace stan
#endif

That beta(a, b) is going to underflow and then it’s zero times infinity.

So the Beta CDF is broken when the distribution is too narrow. You’ll have to find some workaround. Maybe approximating the distribution with a gaussian is good enough?

bnicenboim · July 12, 2020, 7:15pm

But I get a value out of the beta_cdf, namely -0.665332. Why is this a problem?

Can you give me some pointers? How do I approximate a beta_cdf with a gaussian?

nhuurre · July 12, 2020, 7:27pm

Beta distribution has mean \frac{a}{a+b} and variance \frac{ab}{\left(a+b\right)^2 \left(a+b+1\right)} so it’s approximately

normal_lcdf(0.5 | a/(a+b), sqrt(a*b/(a+b+1))/(a+b))

and the bigger a and b are the more accurate the approximation.

bnicenboim · July 12, 2020, 8:02pm

Thanks, it actually works with the normal_lcdf, but I have many questions now:

First set of questions, can you explain me what exactly was the problem if the beta_lcdf was not really overflowing? Was only the gradient overflowing? (It would be great if there was a better error, I guess Stan could check if the parameters of the cdf are too large right? I spent weeks with this :/)
When exactly does the beta_lcdf breaks then? Is there a way to know?

Another set of questions is the following. a and b are only in extreme cases large, when they are smallish, the approximation won’t work too well, right? As far as I understand how Stan works, I can’t do and ifelse and check if a+b>1000 then use the normal approximation, right? Is there some way around this?

nhuurre · July 12, 2020, 8:14pm

Yes. The implementation computes the value of the CDF and it’s gradient separately. The function grad_reg_inc_beta quoted above is used only for the gradient.

In this case you probably can do if-else check because both sides are still approximating the same function. A small discontinuity does not invalidate the fit, it just slows convergence.

bnicenboim · July 12, 2020, 8:31pm

thank you so much!
This was very informative, now is there a way to know when does the gradient of the cdf break?

nhuurre · July 12, 2020, 8:37pm

Unfortunately I don’t think so. If there is then maybe @bbbales2 knows more.

bbbales2 · July 12, 2020, 9:15pm

Hmm, I don’t think there’s any good ways to know our CDF is broken. They all break in places, and especially with log scale parameters those can be surprisingly small numbers. But that doesn’t seem like a great response considering this:

Did you ever have error messages more informative than this:

Chain 1: Rejecting initial value:
Chain 1:   Gradient evaluated at the initial value is not finite.
Chain 1:   Stan can't start sampling from this initial value.

I don’t see explicit checks that the computed gradients of the integrals are not infinite. Those could be added. I guess the ideal message here is that the gradients of the beta are becoming infinite?

bnicenboim · July 13, 2020, 5:14am

No, that’s what I got.

Yes, sure, but it’s still hard to fix it. I guess that information about when a cdf will over/underflow (right now, this is only explicit for the normal_lcdf), and when the gradients of cdf will over/underflow would be very useful. I’m still unsure on when to replace the exact cdf for an approximation. Another thing is that when a model fails you get an empty object, the gradients for the initial values (and initial values) could be added, right?

I guess that, ideally, Stan could deal with the infinite gradient for me (at least in some cases), right? Stan could check if the gradient of the beta is going to break because of parameters that are too large, use the normal_lcdf and throw a warning, instead of an error.
Alternatively, another solution is a way to check the gradient in the stan program and do an approximation if it’s infinite (and replace it).

bbbales2 · July 13, 2020, 6:11pm

Hmm, I don’t even think we try to track this stuff now. We usually find out about overflows by accident, and we haven’t tried to put if conditions around all the things that might blow up. There’s always wishful thinking that functions will work everywhere they should mathematically (and then they don’t). I assume we haven’t tried to build guardrails everywhere because it would take a ton of work and just make models that do run without errors really slow from all the checks.

I think for the most part if there’s a more stable way to compute something, we go for that though. We’d need some sort of more stable reference to know if something went off the rails, and if we had that reference we’d use that.

Yeah this is another thing altogether. Especially if you aren’t sure if the approximation is going to affect anything.

Yeah, I don’t think we could deal with the automatically, but it seems like it might be possible to make it much easier to find these things. Doesn’t really help you short term, but I made a topic to discuss this a little (Debugging overflows in values/gradients).

Topic		Replies	Views
Gradient evaluated at the initial value is not finite although the lp is finite Modeling cognitive-science	12	3900	April 13, 2020
Yet another case of (Rejecting initial value: Gradient evaluated at the initial value is not finite.) Modeling	3	536	July 18, 2022
Troubleshooting ODE model Modeling specification , ode	10	950	August 21, 2020
Failure to start because of initial values Modeling	16	3526	July 31, 2017
Problem with using rstan General	40	1996	September 11, 2018

Initial value rejected, but the likelihood exists for that initial value

Related topics