Differing behaviour when using explicit vs implicit ints?

While using a custom probability density function, I’m seeing different behaviour from Stan (cmdstanr interface on an M1 Mac) depending on whether I specify an integer term explicitly as a local variable declaration or simply use the value implicitly without storing it as a local variable. Specifically, since I always need the function in vectorised form, common normalisation terms (e.g., the scale parameter) need to be multiplied by the size of the input vector.

I’m starting here because I’m not sure whether this is a bug I should report on Github or whether this is actually desired behaviour. In general, of course, Stan sampling doesn’t work well with integers.

For quick background, I’m implementing a gamma-exponential distribution (a.k.a. a generalised Gumbel distribution) with a fixed mean of zero (see Gavin Crooks’s amazing Field Guide to Continuous Probability Distributions, Chapter 8):

f(x; \lambda, \alpha) = \frac{1}{\lambda \Gamma(\alpha)} \exp\left\{- \alpha \left(\frac{x}{\lambda} - \psi(\alpha) \right) - \exp\left(- \left(\frac{x}{\lambda} - \psi(\alpha)\right)\right) \right\}\ ,

where \psi represents the digamma function.

The implementation in Stan that does work is the following (taking advantage of the relationship to the gamma distribution to get – I hope! – smarter autodiff):

  real gamma_exp_lpdf(vector x, real lambda, real alpha) {
    int K = size(x);
    return
      -K * log(lambda)
      - sum(x) / lambda
      + gamma_lpdf(exp(-x / lambda) | alpha, exp(digamma(alpha)));
  }

There is nothing wrong with that, but because I only need K once, I originally had the function written like this:

  real gamma_exp_lpdf(vector x, real lambda, real alpha) {
    return
      -size(x) * log(lambda)
      - sum(x) / lambda
      + gamma_lpdf(exp(-x / lambda) | alpha, exp(digamma(alpha)));
  }

In principle, these functions should be identical, but in practice, the sampler goes bananas when using the second form, throwing all sorts of divergences and landing nowhere near the correct distribution even when there is nothing else in the model.

Should I have expected this? If so, is there a bit of the documentation that should have made that clear? If not, what would be the most helpful tag or keyword for submitting a bug?

As a side note, if anybody sees a cleverer way to implement this density, please feel free to add that to the thread. I tried several variations before settling on this one. The current version looks strange, but overall, sampling seems to behaves better when I use the built-in gamma_lpdf() than when implementing the log likelihood more directly from the mathematical specification. I presume that is because gamma_lpdf() takes advantage of some mathematical properties of the gamma distribution to optimise the gradient calculations, but it’s hard to be certain.

Hmm there’s definitely a bug here. If I check the gradients, there’s a clear issue:

modcode <- "
functions {
  real gamma_exp_lpdf(vector x, real lambda, real alpha, int which_lpdf) {
    if (which_lpdf == 1) {
      int K = size(x);
      return
        -K * log(lambda)
        - sum(x) / lambda
        + gamma_lpdf(exp(-x / lambda) | alpha, exp(digamma(alpha)));
    } else {
      return
        -size(x) * log(lambda)
        - sum(x) / lambda
        + gamma_lpdf(exp(-x / lambda) | alpha, exp(digamma(alpha)));
    }
  }
}

data {
  int which_lpdf;
}
parameters {
  real<lower=0> lambda;
  real<lower=0> alpha;
  vector[1] x;
}
model {
    target += gamma_exp_lpdf(x | lambda, alpha, which_lpdf);
}
"
library(cmdstanr)
mod <- cmdstan_model(write_stan_file(modcode), compile_model_methods = TRUE)
fit1 <- mod$sample(data = list(which_lpdf = 1), chains = 1, iter_warmup = 1, iter_sampling = 1)
fit2 <- mod$sample(data = list(which_lpdf = 2), chains = 1, iter_warmup = 1, iter_sampling = 1)

> fit1$grad_log_prob(c(0,0,0))
[1]  0.0000000  1.7213702 -0.4385405
attr(,"log_prob")
[1] -1.138675
> fit2$grad_log_prob(c(0,0,0))
[1]  1.844674e+19  1.721370e+00 -4.385405e-01
attr(,"log_prob")
[1] -1.138675

It looks like it’s specifically caused by the negation (i.e., -size(x) is problematic, but not size(x)). Absolutely no idea why though yet, heads up @WardBrian

The generated C++ only has the expected differences:

 |  try {
-|    int K = std::numeric_limits<int>::min();
 |    current_statement__ = 1;
-|    K = stan::math::size(x);
-|    current_statement__ = 2;
-|    return (((-(K) * stan::math::log(lambda)) - (stan::math::sum(x) /
-|           lambda)) +
+|    return (((-(stan::math::size(x)) * stan::math::log(lambda)) -
+|           (stan::math::sum(x) / lambda)) +
 |           stan::math::gamma_lpdf<false>(
 |             stan::math::exp(stan::math::divide(stan::math::minus(x), lambda)),
 |             alpha, stan::math::exp(stan::math::digamma(alpha))));

My first guess is this is a weird bug in the math library where, for some reason, derivatives are propagating through size if it gets inlined, or something?

Ah, got it @andrjohns. stan::math::size returns a size_t, which is a unsigned type, which is not appropriate for doing math like negations with. By assigning it to a variable, it is cast to int (a signed type).

Not sure how we want to handle this @andrjohns – one ‘easy’ solution is to return ssize_t from the size functions, but we could also just return int. In practice you create/index into anything bigger than an int, since that’s all Stan gives you.

We’re pretty inconsistent in the math library:

  • size() returns size_t (equivalent to unsigned long int)
  • rows() returns int
  • cols() returns an Eigen::Index which at least on GCC is equivalent to long int
  • num_elements() returns int
  • dims() returns a vector of int

I opened an issue to discuss here: Inconsistent type used for size of containers · Issue #3085 · stan-dev/math · GitHub

Thanks, @andrjohns and @WardBrian, for the super-fast response and bug fix!

2 Likes

The real problem here is going to be the instability of exp(digamma(alpha)) and exp(-x/lambda), which can overflow or underflow to zero very easily. If you look at the definition of the gamma distribution (with shape k and scale \theta), it’s

\textrm{gamma}(x \mid k, \theta) = \frac{1}{\displaystyle \Gamma(k) \cdot \theta^k} \cdot x^{k-1} \cdot \exp(-x / \theta),

so that on the log scale we need for Stan (and numerical stability), we have

\log \textrm{gamma}(x \mid k, \theta) = -\log(\Gamma(k)) - k \cdot \log(\theta) + (k - 1) \cdot \log(x) - \frac{\displaystyle x}{\displaystyle \theta}.

Substituting into your last line, you get

\log \textrm{gamma}(\exp(-x / \lambda) \mid \alpha, \exp(\psi(\alpha)) = -\textrm{lgamma}(\alpha) - \alpha \cdot \psi(\alpha) - (\alpha - 1) \cdot x / \lambda - \exp(-x / \lambda - \psi(\alpha)),

where \psi() is the digamma function. Hopefully here -x / \lambda - \psi(\alpha) will not be so small it underflows or so large it overflows (i.e., it should be in the range (-400, 400) at the most). This is better than exponentiating individually and dividing for stability. If the values are out of range, then you’ll probably be in the situation where you won’t be able to represent the log density in double-precision floating point.

Then you’ll have to do the vectorization manually. The trick there is to store intermediate calculations and reuse them where possible.

On the other hand, if the current formulation isn’t giving you numerical issues, it should be fine with the range of values for which you are using it.

1 Like

Thank you, @Bob_Carpenter !

I’ve been using this PDF within a large and complex model that is having issues with infinite gradients when optimising and tiny step sizes when sampling. Values of x / \lambda - \psi(\alpha) outside of (-400, 400) are not at all reasonable in my context, and anything that improves numerical stability is worth considering.

For anybody searching the forums later, it’s important to note that Stan implements the gamma distribution using the rate parameterisation rather than the scale parameterisation in Bob’s post, which reverses some of the signs. The substitution in the last line becomes:

\log \text{gamma}(\exp(-x / \lambda) \mid \alpha, \exp(\psi(\alpha)) = -\text{lgamma}(\alpha) + \alpha \cdot \psi(\alpha) - (\alpha - 1) \cdot x / \lambda - \exp(-(x / \lambda - \psi(\alpha)))

Putting that into a Stan function with some simplification and intermediate calculations to support vectorisation yields:

  real gamma_exp_lpdf(vector x, real lambda, real alpha) {
    int K = size(x);
    vector[K] z = x / lambda - digamma(alpha);
    return -K * (log(lambda) + lgamma(alpha)) - alpha * sum(z) - sum(exp(-z));
  }

Although this form has not been a silver bullet to solve my particular model’s problems, I still really appreciate the support in getting this piece of it as stable as possible.

Can you try an orthogonal parameterization of the gamma distribution and see if that helps?

See Add orthogonal parameterization of gamma distribution · Issue #3040 · stan-dev/math · GitHub for Stan code.