Possible Issue with gamma_lccdf

We’re having problems with fitting a Gamma distribution to right censored data. Similar Stan files appear to work if we use a log-Normal in place of the Gamma distribution and/or we only use left censored data. However, if we try to estimate the parameters of a Gamma from right censored data, we get peculiar results. Using BridgeStan with our own Metropolis-Hastings sampler also appears to work, implying to us that the issue is something to do with the combination of gamma_lccdf and using gradients.

There is a related issue (here: Numerically Stable gamma_lcdf Gradients by andrjohns · Pull Request #2767 · stan-dev/math) which fixed something related to gamma_lcdf (as needed for left censored data) but does not appear to have considered gamma_lccdf (as needed for right censored data).

I’m keen to know how others would seek to pin down what is happening and/or fix it. Any suggestions would be welcome.

5 Likes

Could you manually compute the complementary distribution from the cdf, or write a custom log gamma ccdf to test whether is the specific implementation of gamma_lccdf or an intrinsic mathematical hurdle of the distribution and HMC?

Thanks. I think implementing our own log ccdf using Stan’s primitives is a good idea. It seems the same idea was useful for the cdf (here: Minimal censored Gamma with VERY bad performance). My guess is that we need a mycdf(.) that calls gamma_q(.).

I never used it myself, but I’m assuming target += 1 - gamma_lcdf(y | alpha, beta) will work straightforwardly for reals, and a simple wrapper function can vectorize it as needed. That plus the fully customized lccdf written (nearly) from scratch will give you two different options to find the potential issue.

You probably mean to use the log1m_exp() function, which first exponentiates the log-cumulative probability (from your_lcdf function), takes the complement, and then returns the log of that in a numerically stable fashion:

target += log1m_exp(gamma_lcdf(y | alpha, beta))

You’re right. I intended it to be the log of one minus the complement of the CDF, not one minus its log [log(1-F(y)) as opposed to 1-log(F(y))]. But since we want the log-probability, I guess it would be log1m(gamma_cdf(y | alpha, beta)), no need to exponentiate it back. As I said, I haven’t used this myself, so it’s good practice to that the quantity is correct, but the general idea should be correct.

1 Like

I created a PR that fixes this is at fix-gamma-lccdf by spinkney · Pull Request #3241 · stan-dev/math · GitHub

5 Likes

I compiled cmdStan with the fix-gamma-lccdf branch of Stan math but am still running into issues when sampling from a Gamma distribution with left-censored data. I consider the following models:

censored_stan_lccdf.stan

data {
  int<lower=0> N_obs;
  int<lower=0> N_cens;
  vector<lower=0>[N_obs] obs;
  vector<lower=0>[N_cens] cens;
}
parameters {
  real log_alpha_raw;
  real log_beta_raw;
}
transformed parameters {
  real<lower=0> alpha = exp(log(30) + log_alpha_raw);
  real<lower=0> beta  = exp(log(1.0 / 3.0) + log_beta_raw);
}
model {
  target += normal_lpdf(log_alpha_raw | 0, 1);
  target += normal_lpdf(log_beta_raw  | 0, 1);

  target += gamma_lpdf(obs | alpha, beta);
  target += gamma_lccdf(cens | alpha, beta);
}

censored_manual_lccdf.stan

functions {
  real gamma_lccdf_manual(real y, real alpha, real beta) {
    if (y <= 0) return 0;
    return log1m_exp(gamma_lcdf(y | alpha, beta));
  }
  real gamma_lccdf_manual(vector y, real alpha, real beta) {
    real s = 0;
    for (n in 1:num_elements(y)) {
      s += gamma_lccdf_manual(y[n], alpha, beta);
    }
    return s;
  }
}
data {
  int<lower=0> N_obs;
  int<lower=0> N_cens;
  vector<lower=0>[N_obs] obs;
  vector<lower=0>[N_cens] cens;
}
parameters {
  real log_alpha_raw;
  real log_beta_raw;
}
transformed parameters {
  real<lower=0> alpha = exp(log(30) + log_alpha_raw);
  real<lower=0> beta  = exp(log(1.0 / 3.0) + log_beta_raw);
}
model {
  target += normal_lpdf(log_alpha_raw | 0, 1);
  target += normal_lpdf(log_beta_raw  | 0, 1);

  target += gamma_lpdf(obs | alpha, beta);
  target += gamma_lccdf_manual(cens, alpha, beta);
}

and simulate data from a Gamma distribution with $$\alpha=30.0$$ and $$\beta=1/3$$ with a censoring threshold of $$T_{censor} = 100.0$$

rng = np.random.default_rng(42)
samples = rng.gamma(shape=30.0, scale=3.0, size=2_000)
is_obs  = samples <= 100.0
obs     = samples[is_obs].astype(float)
cens    = np.full(np.sum(~is_obs), 100.0)
data = {
    "N_obs":  int(obs.size),
    "N_cens": int(cens.size),
    "obs":    obs,
    "cens":   cens,
}

I then sample from each model using the default NUTS hyperparameters and the same initial seed. As you can see below, only censored_manual_lccdf.stan generates samples close to the true values of alpha and beta. The model that uses gamma_lccdf from Stan reaches max tree depth 90% of the time.

I have run this over multiple seeds (for Stan and data generation) and repeatedly see the same behaviour. Implementing a basic MH sampler in Python and using BridgeStan to call log_probyields decent results for both models.

To try and isolate the issue, I created two separate models that are parameterised as:

data {
  real<lower=0> alpha;
  real<lower=0> beta;
}
parameters {
  real<lower=0> y;
}
model {
 ...
}

And use BridgeStan to evaluate log_prob and log_prob_grad for each model across y=0 to y=200, but the values agree with one another, so I’m a bit stumped where the discrepancy is coming from. Any suggestions on how to isolate this?

That is weird! Can you test the following three things?

  1. Can you confirm that cmdstan is picking up the correct branch of Stan math when using cmdstanpy?
  2. Can you expose the functions and test in Python to find specific values that fail? I think in Python you can use GitHub - WardBrian/pybind_expose_stan_fns: Exposing Stan functions in Python. You can then test with the current release vs the branch fix to see if the same values fail. Or if some range is fixed but there are still values in the fix thatl fail.
  3. Test the fix-gamma-lccdf branch again but loop over gamma_lccdf elementwise instead of the vectorized version. Let’s see if the vectorization is the issue.
1 Like

Some responses below:

  1. I believe that Stan math is pointing to the correct branch:
(base) mjc@dacdt-condor-gpu2:~/Documents/cmdstan/stan/lib/stan_math$ git status
On branch develop
Your branch is up-to-date with 'origin/develop'.

nothing to commit, working tree clean
(base) mjc@dacdt-condor-gpu2:~/Documents/cmdstan/stan/lib/stan_math$ git branch -v
* develop b82d68ced2 Merge pull request #3251 from stan-dev/dependabot/github_actions/actions/upload-artifact-5
(base) mjc@dacdt-condor-gpu2:~/Documents/cmdstan/stan/lib/stan_math$ git log --oneline
b82d68ced2 (HEAD -> develop, origin/develop, origin/HEAD) Merge pull request #3251 from stan-dev/dependabot/github_actions/actions/upload-artifact-5
15c59c085f Bump actions/upload-artifact from 4 to 5
5e698970d5 Merge pull request #3248 from stan-dev/tweak-positive-index-wording
53973441c7 Update wording of validate_positive_index
b9167895db Merge pull request #3246 from stan-dev/fix/cleanup-rev-tests
c89ca16f78 Merge pull request #3245 from stan-dev/fix/inline
28c7048a5e [Jenkins] auto-formatting by clang-format version 10.0.0-4ubuntu1
1ca4162bfa fix cpplint
40920f5427 [Jenkins] auto-formatting by clang-format version 10.0.0-4ubuntu1
29ea240604 Merge commit '407dcfeb7bcdc5cbea1ff427e4945161fc47a276' into HEAD
bd4af2fe0f use AGradRev so that all tests clean themselves up in the rev testing folder
407dcfeb7b Merge pull request #3236 from lingium/feature/issue-3235-yule-simon-lccdf
021bac4cc0 fix cpplint
a8a0014e32 [Jenkins] auto-formatting by clang-format version 10.0.0-4ubuntu1
d4c03d88b6 fix cpplint
fd75aaf64a fix double inline keywords
340e343d8c Remove test of deprecated forward_as
a099940a59 Merge pull request #3244 from stan-dev/more-metaprograms
94dda8f755 Update stan/math/prim/prob/yule_simon_lccdf.hpp
2f4fe90141 Merge pull request #3242 from stan-dev/cleanup/remove-forward_as
bf46263fc2 [Jenkins] auto-formatting by clang-format version 10.0.0-4ubuntu1
9d1072913e Merge commit '1e1dfc91e8264116565d8f944c016c68ecb4cfb7' into HEAD
c5c4eb1d0d add inline to all functions so that the definition of functions is used only in on translation unit
1e1dfc91e8 Merge pull request #3241 from stan-dev/fix-gamma-lccdf
(base) mjc@dacdt-condor-gpu2:~/Documents/cmdstan/stan/lib/stan_math$ 

I point cmdStanPy to this version of cmdStan:

set_cmdstan_path("/home/mjc/Documents/cmdstan")
print(f"CmdStan version: {cmdstan_version()}")
print("Using CmdStan at:", cmdstan_path())

in terminal I see:

CmdStan version: (2, 37)
Using CmdStan at: /home/mjc/Documents/cmdstan
  1. I’ll have a go at this and follow up.

  2. I revised the lccdf function to use:

for (n in 1:num_elements(y)) {
            acc += gamma_lccdf(y[n] | alpha, beta);
}

but still see the same issues

I’ve confirmed that it does result in numerical instability for alpha approximately greater than 30. It’s due to gamma_q or grad_reg_inc_gamma. I can rewrite this to use gamma_p and grad_reg_lower_inc_gamma, which avoids potential instability in a tgamma(alpha) and digamma(alpha) call. It seems to sample a bit faster than just wrapping gamma_lcdf with log1m_exp. I’ll put a pr in the next week or so with Thanksgiving.

2 Likes

You can follow the PR at fix gamma lccdf for alpha > 30 and overflow by spinkney · Pull Request #3264 · stan-dev/math · GitHub

2 Likes

@mjcarter @s.maskell test this pr super stable gamma_lccdf by spinkney · Pull Request #3266 · stan-dev/math · GitHub

I tested in various regimes and found it to be faster and handle extreme values quite well. I cannot guarantee all values with finite precision computation but I believe it should work over an extended range relative to the log1m_exp version.

1 Like