Cumulative probability within interval [t, t+1]

For a time-to-event (survival) analysis, I aim to calculate the probability of an event occurring within an interval, e.g. between time t and (t+1), where the time-to-event probability is drawn from a continuous distribution.

The time-to-event distribution could be Exponential, Gamma or Weibull. For this example I will use the simpler exponential distribution:

The probability of the event within the interval (conditional on the event not having previously occured) is given by \int_t^{t+1} 1-e^{- \lambda x} dx ; which can be written as (1-e^{- \lambda (t+1)}) - (1-e^{- \lambda t})

Or in stan syntax: exponential\_cdf(t+1~ |~ \lambda) - exponential\_cdf(t ~|~ \lambda)


The key part of model code on the log scale is therefore:

transformed parameters {
  vector[N] lp; //log probability for an interval

  for(i in 1:N)
      lp[i] = log_diff_exp(exponential_lcdf(t[i]+1 | lambda), 
                           exponential_lcdf(t[i]   | lambda));
} 

However, fitting this model results in an inability to initialise model parameters:

Chain 1 Rejecting initial value:
Chain 1 Gradient evaluated at the initial value is not finite.
Chain 1 Stan can’t start sampling from this initial value.
Chain 1 Rejecting initial value:
Chain 1 Log probability evaluates to log(0), i.e. negative infinity.
Chain 1 Stan can’t start sampling from this initial value.
Chain 1 Initialization between (-2, 2) failed after 100 attempts.

I have fitted many variations of this model and it seems that subtracting the cumulative probabilities is causing the difficulties. A simplified version of the full code is given below.

Can anyone explain this behaviour?


data {
  int <lower=1> N;                  //number of age classes
  array[N] int<lower=1> tested;     //number tested per age class
  array[N] int<lower=0> positive;   //number positive per age class
  array[N] int<lower=1> age_lower;  //lower interval of age class
  array[N] int<lower=age_lower> age_upper;  //upper interval of age class
}

parameters {
   real<lower=0> lambda;  //rate of exponential distribution 
}

transformed parameters {
  vector[N] lp;  //log probability within an interval

  for(i in 1:N)
      lp[i] = log_diff_exp(exponential_lcdf(age_upper[i] | lambda), 
                           exponential_lcdf(age_lower[i] | lambda));
} 

model {
    for(i in 1:N)
       target += binomial_lpmf(positive[i] | tested[i], exp(lp[i]));
}

Part of the problem may be that our log cdfs are for the most part implemented naively by taking the log of the cdf. This doesn’t make unstable in the tails.

You can use print(target()) to track when the target becomes -infinity. Did it not tell you where there was an error in the Stan code?

Also, it’s much more efficient to only compute the exponential lcdfs once. As is, you calculate most of them twice. That’d be something like this:

vector[N] exp_lcdf;
for (n in 1:N) {
  exp_lcdf[n] = exponential_lcdf(t[n] | lambda);
}

Did you really intend to write t[I] + 1 rather than t[i + 1]? Usually, you measure between time points in your time series, not between discrete indexes. So that’d be this:

vector[N - 1] lp;
for (n in 1:N - 1) {
  lp[n] = log_diff_exp(exp_lcdf[n + 1], exp_lcdf[n]);
}

with only N - 1 entries. Or maybe there’s a log(0) log cdf on the boundary?