Initial value rejected, but the likelihood exists for that initial value

Hi,

Before I start and to prevent people from trying to avoid my long and not straight-forward model, I’m not expecting anyone to understand and debug the model for me, I think I have some experience with Stan, but I’m completely lost with what’s happening here.

I have a model of sentence processing which is struggling to start on different subsets of my data (reaction times of words in ~200 sentences).The likelihood seems to always exist for the initial value, but depending on the subset of data, the model will fail to start spitting “Gradient evaluated at the initial value is not finite.”

I’m attaching the model here ez_hmm_1.stan (7.6 KB) for reference. The thing is that I simulated the data, I fixed all the parameters except for one, and I identified some sentences that the model can’t deal with. That is, for some of the sentences, the model just run fine, but the following, for example, is problematic. As far as I can tell, there is nothing special here.

list(lnfreq = c(-3.21959716340167, -10.6723729927226, -3.96878841691886, 
-3.78636690441316, -5.15911824237188, -3.52508933292279, -11.7190576704579, 
-9.32291639942472, -3.21959716340167, -6.41942791246034, -6.75261349382341, 
-5.55571242934684, -4.62808966562588, -8.76201613003949, -3.78636690441316, 
-8.02507079950312, -3.52508933292279, -9.40283110709808, -8.24746545540302, 
-8.94316117193497), wrapup_f = c(FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE), RT = c(166.490950141974, 
300.90493554587, 181.674439595697, 221.628223694201, 152.920867070158, 
308.119471128247, 125.048953278296, 295.45315025367, 172.506950010066, 
125.644470180417, 317.8413699003, 163.524005409949, 180.510772105012, 
131.258886202743, 381.342819843197, 150.222320717369, 206.504485197978, 
162.884834661625, 384.054060772294, 164.271852972529), word_pos = c(1, 
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
20), N_obs = 20L, verbose = 0, onlyprior = 0)

I’m using it to trying to debug the model, here I use one iteration, and an initial value of 0, and a print statement of the likelihood.

fit_ez_hmm <- stan("ez_hmm_1.stan",
                   data = ls_pred, init = 0, iter=1, chain = 1)
fit_ez_hmm 

And I get :

The NEXT version of Stan will not be able to parse your Stan program.
Please open an issue at
 https://github.com/stan-dev/stanc3/issues 
if you can share or at least describe your Stan program. This will help ensure that Stan
continues to work on your Stan programs in the future. Thank you!
This message can be avoided by wrapping your function call inside suppressMessages().
c("0", "\nSyntax error in 'string', line 117, column 16 to column 17, parsing error:\n\nExpected a \";\" after \"print(...)\".\n\n\n")
hash mismatch so recompiling; make sure Stan code ends with a blank line

SAMPLING FOR MODEL 'ez_hmm_1' NOW (CHAIN 1).
Chain 1: [-4.28026,-5.06026,-4.54302,-6.03513,-4.54653,-5.75279,-5.25075,-5.90587,-4.54448,-5.43742,-5.7158,-4.53386,-4.67623,-5.08949,-6.39984,-4.67007,-5.29535,-4.64996,-6.53748,-4.59851]

Chain 1: [-4.28026,-5.06026,-4.54302,-6.03513,-4.54653,-5.75279,-5.25075,-5.90587,-4.54448,-5.43742,-5.7158,-4.53386,-4.67623,-5.08949,-6.39984,-4.67007,-5.29535,-4.64996,-6.53748,-4.59851]

Chain 1: Rejecting initial value:
Chain 1:   Gradient evaluated at the initial value is not finite.
Chain 1:   Stan can't start sampling from this initial value.
[1] "Error in sampler$call_sampler(args_list[[i]]) : Initialization failed."
error occurred during calling the sampler; sampling not done

Notice that the printed values in [] are the likelihood of every word in the sentence. So it’s not NaN or infinity, why can’t the model start?
I exposed the likelihood function and I tried with different values, and it seems fine. In fact, here I’m plotting it against possible values of alpha (the true value is 2, and it’s constrained to be positive and it’s the only parameter in my model in these tests), and it seems nice and smooth.

What can be happening here?

I’m using rstan_2.21.1, and I’ve also tried with cmdstanr_0.0.0.9008 using cmdstan-2.23.0.

Thanks!
Bruno

“Gradient is not finite” means the autodiff is failing somehow. It’s difficult to debug since you can’t print the gradients. I’d start looking at those log_sum_exps.
If x=0 the derivative of log(x) is infinite. log_sum_exp(log(x),log(y)) is then equal to log(y) and therefore finite when y>0 but the autodiff calculates the derivative with respect to x as zero times infinity which is indeterminate. This is a problem if x depends on parameters.

2 Likes

Thanks for the fast answer.
I thought that having -infinity in a log_sum_exp wasn’t a problem. But in any case, I printed all the insides of my log_sum_exp functions and there is no infinity there.
Do you spot any other place where the derivative might be infinite?

Maybe s_transform? check if unbounded_sx is finite.

unbounded_sx and s are finite… :/
What else can it be?

or what else should I be looking for?
any trick to identify places where the gradient might break?

The gradient usually breaks when something becomes infinite. I was looking for places where you have something like a + exp(b) because then the result can be finite even if b is infinite. I don’t know.

ok,
so this
lfailp_prev[1] = log1m_exp(lfailp[n - 1]);
is always zero in n=2, because lfailp[1] is -Inf.
But this isn’t a problem for some subsets of the data. So it’s not any infinity right, just when it’s summed to a parameter of the model, right?

1 Like

Ok, thanks for the tips, I commented lines and parts of lines until I found the “problem”.

If I remove the beta_lcdf in n =18, i =16 for the following line there the model starts.
lfailp_terms[i + 1] = lfailp_prev[i + 1] + beta_lcdf(.5| a[i + 1]/scale, b[i + 1]/scale);
But there is no infinity there, and I can’t spot anything weird, the two parameters are 556.025 and
557.204 and the value of the beta_lcdf there is -0.665332

So how come the gradient is infinite there? What should I do there?

Bests,
Bruno

Yup, can confirm, beta_lcdf does not work correctly for values above 540 or so. I bet the problem is here


That beta(a, b) is going to underflow and then it’s zero times infinity.

So the Beta CDF is broken when the distribution is too narrow. You’ll have to find some workaround. Maybe approximating the distribution with a gaussian is good enough?

But I get a value out of the beta_cdf, namely -0.665332. Why is this a problem?

Can you give me some pointers? How do I approximate a beta_cdf with a gaussian?

Beta distribution has mean \frac{a}{a+b} and variance \frac{ab}{\left(a+b\right)^2 \left(a+b+1\right)} so it’s approximately

normal_lcdf(0.5 | a/(a+b), sqrt(a*b/(a+b+1))/(a+b))

and the bigger a and b are the more accurate the approximation.

1 Like

Thanks, it actually works with the normal_lcdf, but I have many questions now:

First set of questions, can you explain me what exactly was the problem if the beta_lcdf was not really overflowing? Was only the gradient overflowing? (It would be great if there was a better error, I guess Stan could check if the parameters of the cdf are too large right? I spent weeks with this :/)
When exactly does the beta_lcdf breaks then? Is there a way to know?

Another set of questions is the following. a and b are only in extreme cases large, when they are smallish, the approximation won’t work too well, right? As far as I understand how Stan works, I can’t do and ifelse and check if a+b>1000 then use the normal approximation, right? Is there some way around this?

Yes. The implementation computes the value of the CDF and it’s gradient separately. The function grad_reg_inc_beta quoted above is used only for the gradient.

In this case you probably can do if-else check because both sides are still approximating the same function. A small discontinuity does not invalidate the fit, it just slows convergence.

1 Like

thank you so much!
This was very informative, now is there a way to know when does the gradient of the cdf break?

Unfortunately I don’t think so. If there is then maybe @bbbales2 knows more.

Hmm, I don’t think there’s any good ways to know our CDF is broken. They all break in places, and especially with log scale parameters those can be surprisingly small numbers. But that doesn’t seem like a great response considering this:

Did you ever have error messages more informative than this:

Chain 1: Rejecting initial value:
Chain 1:   Gradient evaluated at the initial value is not finite.
Chain 1:   Stan can't start sampling from this initial value.

I don’t see explicit checks that the computed gradients of the integrals are not infinite. Those could be added. I guess the ideal message here is that the gradients of the beta are becoming infinite?

No, that’s what I got.

Yes, sure, but it’s still hard to fix it. I guess that information about when a cdf will over/underflow (right now, this is only explicit for the normal_lcdf), and when the gradients of cdf will over/underflow would be very useful. I’m still unsure on when to replace the exact cdf for an approximation. Another thing is that when a model fails you get an empty object, the gradients for the initial values (and initial values) could be added, right?

I guess that, ideally, Stan could deal with the infinite gradient for me (at least in some cases), right? Stan could check if the gradient of the beta is going to break because of parameters that are too large, use the normal_lcdf and throw a warning, instead of an error.
Alternatively, another solution is a way to check the gradient in the stan program and do an approximation if it’s infinite (and replace it).

Hmm, I don’t even think we try to track this stuff now. We usually find out about overflows by accident, and we haven’t tried to put if conditions around all the things that might blow up. There’s always wishful thinking that functions will work everywhere they should mathematically (and then they don’t). I assume we haven’t tried to build guardrails everywhere because it would take a ton of work and just make models that do run without errors really slow from all the checks.

I think for the most part if there’s a more stable way to compute something, we go for that though. We’d need some sort of more stable reference to know if something went off the rails, and if we had that reference we’d use that.

Yeah this is another thing altogether. Especially if you aren’t sure if the approximation is going to affect anything.

Yeah, I don’t think we could deal with the automatically, but it seems like it might be possible to make it much easier to find these things. Doesn’t really help you short term, but I made a topic to discuss this a little (Debugging overflows in values/gradients).

1 Like