Generalized Gamma Implementation Leads To Divergences

PhDemetri · November 30, 2023, 9:19pm

The generalized gamma is an attractive likelihood for modelling survival data because other common survival likelihoods (i.e. the lognormal, the exponential, the gamma, and the weibull) can be considered as special cases of the generalized gamma.

As such, I’ve implemented the _lpdf and _lcdf functions based on the {flexsurv} implementation found here. Those functions are shown below.

real generalized_gamma_lpdf(real x, real mu, real sigma, real Q) {
  
  real lpdf;
  
  if (Q!=0) {
    
  real y = log(x);
  real w = (y-mu) / sigma;
  real abs_q = abs(Q);
  real qw = Q*w;
  real qi = 1.0 / (Q*Q);
  lpdf = -log(sigma * x) + 
         log(abs_q) * (1 - 2*qi) + 
         qi * (qw - exp(qw)) - 
         lgamma(qi);
         
  } else {
    
   lpdf = lognormal_lpdf(x | mu, sigma);
   
  }
  
  return lpdf;
  
}

real generalized_gamma_lcdf(real x, real mu, real sigma, real Q) {
  
  real y = log(x);
  real w = (y-mu) / sigma;
  real lcdf = 0;
  
  if (Q != 0) {
    
  real qq = 1.0 / (Q*Q);
  real expnu = exp(Q*w)*qq;
  
    if (Q > 0) {
      
      lcdf = gamma_lcdf(expnu | qq, 1);
      
    } else {
      
      lcdf = gamma_lccdf(expnu | qq, 1);
      
    }
    
  } else {
    
    lcdf = lognormal_lcdf(x | mu, sigma);
    
  } 
  
  return lcdf;
  
}

My implementation in agrees with the flexsurv well I have a small model, shown below, for a single set of parameters \mu, \sigma, and Q. I can recover these parameters fairly well with good sampler diagnostics. But every once in a while, the model will experience ~2-4 divergences.

functions {
  #include generalized_gamma_functions.stan
}
data {
  int n;
  array[n] real x;
}
parameters {
  real mu;
  real<lower=0> sigma;
  real Q;
}
model {
  mu ~ student_t(30, 0, 1);
  sigma ~ gamma(3, 3);
  Q ~ student_t(30, 0, 1);
  
  for(i in 1:n){
    target += generalized_gamma_lpdf(x[i]|mu, sigma, Q);
  }
}
generated quantities {
  array[n] real x_ppc;
  for(i in 1:n) {
    x_ppc[i] = generalized_gamma_rng(mu, sigma, Q);
  }
}

Increasing adapt_delta to 0.99 reduces the frequency of divergent transitions, but I still get divergences from time to time. I simulated the data above a 250 times and plotted what parameter combinations lead to divergences. Here is that plot.

There seems to be an issue near Q=0. Looking at the density(below), I imagine the \vert Q \vert is the culpriate.

$$ f(x \mid \mu, \sigma, Q)=\frac{Q Q\left(Q^{-2}\right)^{Q^{-2}}}{\sigma x \Gamma\left(Q^{-2}\right)} \exp \left(Q^{-2}(Q w-\exp (Q w))\right) $$

Question

What are some strategies to prevent the divergences?

One pragmatic solution would be to just use the log normal if I suspect Q\approx 0 and bound Q above/below in the parameter block otherwise.

Are there any other approaches? Ideally I could use one density, but model stacking is something my team regularly does, so we could always fit a log normal survival model and a generalized gamma with bounded Q.

Open to suggestions.

Bob_Carpenter · November 30, 2023, 10:16pm

That’s not unexpected in that HMC technically requires continuously differentiable inputs.

This is a not uncommon approach to numerical issues. For instance, if you evaluate log(1 - x), then you wind up breaking it down into a regular log, then a second order Taylor expansion for values of x near 1 and then a first order Taylor expansion for values really close to 1.

Is there a reason you’re not just replacing a Student-t with 30 degrees of freedom with a normal?

You can add print statements to the Stan program to show the target as you go. For example you can have

for (i in 1:n) {
  x[i] ~ generalized_gamma(mu, sigma, Q);
  print("i = ", i, ";  target = ", target, ";  Q = ", Q);
}

If you vectorize your definition of generalized_gamma, you can save a lot of compute with vectorization (see below).

The place to watch out for divergences numerically is with subtractions (if the values are close, you can get catastrophic cancellation) and in exp, which can easily overflow or underflow. log can also be problematic if values are near zero or near one.

Once you’ve defined an lpdf you can use sampling syntax, but it’s optional.

I’d recommend coding this way:

real generalized_gamma_lpdf(real x, real mu, real sigma, real Q) {
  if (Q == 0) {
    return lognormal_lpdf(x | mu, sigma);
  }    
  real y = log(x);
  real w = (y - mu) / sigma;
  real abs_Q = abs(Q);
  real qw = Q * w;
  real Q_inv_sq = Q^-2;
  return -log(sigma * x)
    + log(abs_Q) * (1 - 2 * Q_inv_sq)
    + Q_inv_sq * (qw - exp(qw))
    - lgamma(qi);
}

or for a vector x (I’m using linear algebra operations, so it can’t be an array):

real generalized_gamma_lpdf(vector x, real mu, real sigma, real Q) {
  if (Q == 0) {
    return lognormal_lpdf(x | mu, sigma);
  }    
  real log_abs_Q = log(abs(Q));
  real Q_inv_sq = Q^-2;
  int N = size(x);
  vector[N] qw = Q * (log(x) - mu) / sigma;
  return sum(-log(sigma * x)
             + log_abs_Q * (1 - 2 * Q_inv_sq)
             + Q_inv_sq * (qw - exp(qw))
             - lgamma(qi));
}

PhDemetri · November 30, 2023, 11:01pm

Thanks for the recomendations.

Is there a reason you’re not just replacing a Student-t with 30 degrees of freedom with a normal?

For sensitivity to the prior, no other good reason.

Could you help me understand some of the decisions you made when recommending re-writing the lpdf? Could you make similar recommendations for the lcdf (as I’ll be using this for survival anlaysis eventually)?

sonicking · December 2, 2023, 11:13pm

Hello, I don’t have an answer to your specific question. But I would suggest you to consider the Generalized F distribution, which in theory nests the Generalized Gamma, but actually easier to work with. I also learned about it from the flexsurv package. I shared my stan code: Generalized F distribution can reduce to Generalized Gamma distribution - but what prior to use? - #2 by Bob_Carpenter

I hope this is useful in some way.

Bob_Carpenter · December 8, 2023, 8:49pm

I think the best thing to do might be to read the efficiency chapter of the User’s Guide, which goes over Stan optimizations and their motivations. There is also consideration for numerical issues, like not applying exp() unless absolutely necessary (it’s prone to underflow, overflow and rounding issues).

The lcdf code is much less complex here:

real generalized_gamma_lcdf(real x, real mu, real sigma, real Q) {
  real y = log(x);
  real w = (y-mu) / sigma;
  real lcdf = 0;
  if (Q != 0) {
    real qq = 1.0 / (Q*Q);
    real expnu = exp(Q*w)*qq;
    if (Q > 0) {
      lcdf = gamma_lcdf(expnu | qq, 1);
    } else {
      lcdf = gamma_lccdf(expnu | qq, 1);
    }
  } else {
    lcdf = lognormal_lcdf(x | mu, sigma);
  } 
  return lcdf;
}

There’s not much to do to make this more efficient unless Q is data. If Q is data, you can separate the zero, less than zero, and greater than zero cases ahead of time and avoid all the conditionals here. Generally I would suggest rewriting this to just return from exceptional cases rather than trying to have a single end-return style. It makes the code much easier to read locally and removes a local variable. Always start with the exceptional cases to get them out of the way before going on to the main body of execution. So this isn’t going to be any more efficient, but here’s how I’d write it:

real generalized_gamma_lcdf(real x, real mu, real sigma, real Q) {
  if (Q == 0) {
    return lognormal_lcdf(x | mu, sigma);
  }
  real inv_Q_sq = Q^-2;
  real expnu = exp(Q * (log(x) - mu) / sigma) * inv_Q_sq;
  if (Q > 0) {
    return gamma_lcdf(expnu | inv_Q_sq, 1);
  }
  return gamma_lccdf(expnu | inv_Q_sq, 1);
}

The problem for numerical stability here is going to be the exp(...). So the next level would be to go into the gamma_lcdf function and see I you can unfold it to avoid computing expnu explicitly. For example, we know that if c * a ~ gamma(alpha, beta), then a ~ gamma(alpha, beta / c). So you can bring in the multiplier to get gamma_lcdf(exp(Q * (log(x) - mu) / sigma), inv_Q_sq, Q^2). And then if you have that, we know that Q * (log(x) - mu) / sigma follows a log-sigma distribution, but we don’t have that one built-in.

There may be algebraic tricks you can play with he cdf and ccdf, but you need to get into the gamma functions and incomplete gamma functions.

Mainly, all of this cdf code is going to get unstable with values near 0 or 1.

Topic		Replies	Views
Issue with Survival Models Modeling fitting-issues , survival	3	1184	November 26, 2021
Optimization of a custom likelihood --when is it worthwile to code it in C++ with its derivative? Modeling techniques , fitting-issues , performance	9	589	July 16, 2020
Gamma(.., ..)^2 -> generalized_gamma(?, ?, ?) Modeling fitting-issues	23	4485	February 7, 2018
Vectorized Implementations of Custom Distributions Modeling	2	359	May 12, 2021
Generalized gamma distributions Developers specification	11	1727	December 7, 2023

Generalized Gamma Implementation Leads To Divergences

Question

Related topics