Failing gradient causes model for multinomial data to fail, possible auto-diff issue

Henrik_Singmann · April 14, 2025, 2:12pm

I have a model for a multinomial distribution that fails due to one of the gradient’s returning NaN. The interesting thing is that the likelihood function itself works, it is only the gradient that fails, which suggests a problem with auto-differentiation to me. We have also tried various re-parameterisations but this did not help.

functions {
  real pinternal1(real a, real b, real g1) {
    return exp(-exp(-g1 + a) + b);
  }
  real pinternal2(real a, real b, real g1, real g2) {
    return exp(-exp(-g1 + a) - exp(-g2 + a) + b);
  }
  real pfun(real l, real u, real g1, real g2) {
    real denominator;
    
    denominator = exp(g1) + exp(g2);
    
    return (pinternal1(l, g1, g1) - pinternal1(u, g1, g1) + pinternal1(l, g2, g1) - pinternal1(u, g2, g1) + pinternal2(u, g2, g1, g2) - pinternal2(l, g2, g1, g2)) / denominator;
  }
  real gumbelmult(array[] int y, real mu,
                   real crc, real crl, real crh) {
  vector[8] p_tpres;
  vector[3] thres;
  
  // calculate thresholds
  thres[1] = crc - (exp(crl));
  thres[2] = crc;
  thres[3] = crc + (exp(crh));
  
  p_tpres[1] = pfun(negative_infinity(), thres[1], 0, mu);
  p_tpres[2] = pfun(thres[1], thres[2], 0, mu);
  p_tpres[3] = pfun(thres[2], thres[3],  0, mu);
  p_tpres[4] = pfun(thres[3], positive_infinity(), 0, mu);
  
  p_tpres[5] = pfun(negative_infinity(), thres[1], mu, 0);
  p_tpres[6] = pfun(thres[1], thres[2], mu, 0);
  p_tpres[7] = pfun(thres[2], thres[3],  mu, 0);
  p_tpres[8] = pfun(thres[3], positive_infinity(), mu, 0);
  
  return multinomial_lpmf(y | p_tpres);
  }
}
data {
  array[8] int dvec;
}
parameters {
  real<lower=0> g;  
  real crc;  
  real crl;  
  real crh; 
}
transformed parameters {
  real lprior = 0;  
  lprior += student_t_lpdf(g | 3, 1, 2);
  lprior += normal_lpdf(crc | 0, 0.5);
  lprior += normal_lpdf(crl | -0.5, 0.5);
  lprior += normal_lpdf(crh | -0.5, 0.5);
}
model {
  target += gumbelmult(dvec, g, crc, crl, crh);
  target += lprior;
}

When tryign this in rstan, it fails:

library("rstan")
dat <- c(347L, 622L, 529L, 186L, 409L, 774L, 1048L, 885L)
fit1 <- stan(model_code = multinom_mod, iter = 10, verbose = FALSE, 
             data = list(dvec = dat), init = 0)

Error:

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: Rejecting initial value:
Chain 1:   Gradient evaluated at the initial value is not finite.
Chain 1:   Stan can't start sampling from this initial value.
[1] "Error : Initialization failed."
error occurred during calling the sampler; sampling not done

The interesting thing is that the likelihood function itself works outside of Stan and can be used to fit the data:

expose_stan_functions(fit1)

newll <- function(x) -log_prob(fit1, x)
(res <- optim(c(0, -0.5,0.5,-0.5), newll))
# $par
# [1] -0.4753573  0.4441750 -0.2198889 -0.3659075
# 
# $value
# [1] 32.23814

When looking at the gradient directly, it seems to be NaN for the first parameter:

grad_log_prob(fit1, res$par)
# [1]         NaN 0.526258994 0.008817364 0.211254299
# attr(,"log_prob")
# [1] -32.23814

Bob_Carpenter · April 16, 2025, 10:36pm

It’s quite possible for derivatives to be badly behaved numerically when values are fine. It can definitely be a problem for autodiff, but it’s rarely a bug that we can fix since we’re stuck with floating point arithmetic defined as our users define it.

As soon as you get stacked exp() like you have in your code, you should be thinking about how to get rid of those. The exp() function is ridiculously prone to overflow and underflow. exp(-exp(x)) will underflow to zero for x > 7, for example.

The typical solution is to work on the log scale. Everywhere you define a value, take the log value. Sometimes you can push this out so that you never need to exponentiate—some built-in will do it for you.

Derivatives don’t like values like negative_infinity(). The derivatives are usually not defined or they’re infinite. And as soon as something becomes infinite, products and sums often devolve to NaN.

For instance, your fun function is returning elements of a simplex, but you can instead return those on the log scale and use multinomial_logit_lpmf instead of multinomial_lpmf.

Henrik_Singmann · April 18, 2025, 3:14pm

Thanks for the insight. It is indeed enough in this case to change the maximal limits for which we evaluate pfun from +/- infinity to +/- 100.

Topic		Replies	Views
'Gradient evaluated at the initial value is not finite' with very simple multinomial model! Modeling fitting-issues	2	453	August 3, 2023
Gradient evaluated at the initial value is not finite although the lp is finite Modeling cognitive-science	12	4041	April 13, 2020
Known gradient breaking behaviours? Developers	8	717	July 30, 2019
Initial value rejected, but the likelihood exists for that initial value Modeling fitting-issues , specification , cognitive-science	18	1377	July 13, 2020
Integrate_1d function has nan gradients Modeling fitting-issues	2	718	October 7, 2019

Failing gradient causes model for multinomial data to fail, possible auto-diff issue

Related topics