Zero-length vector for ragged data structures

I have a recurrent events data to which I want to fit a non-homogeneous Poisson process. I followed the ragged structure section in Stan tutorial. Here is a reproducible example:


functions{
  real nhpp_log(vector t, real beta, real theta, real tau){
    vector[num_elements(t)] loglik_part;
    real loglikelihood;
    for (i in 1:num_elements(t)){
      loglik_part[i] = log(beta) - beta*log(theta) + (beta - 1)*log(t[i]);
    }
    loglikelihood = sum(loglik_part) - (tau/theta)^beta;
    return loglikelihood;
  }
}
data {
  int<lower=0> N; //total # of obs
  int<lower=0> K; //total # of groups
  vector<lower=0>[K] tau;//truncated time
  vector<lower=0>[N] event_time; //failure time
  int s[K]; //group sizes
}
parameters{
  real<lower=0> beta;
  real<lower=0> theta;
}
model{
  int position;
  position = 1;
  for (k in 1:K){
    segment(event_time, position, s[k]) ~ nhpp(beta, theta, tau[k]);
    position = position + s[k];
  }
//PRIORS
  beta ~ gamma(1, 1);
  theta ~ gamma(1, 0.01);
}

When there is at least one event in each time interval, this code works perfectly fine. For example:

df0 = list(N = 38L, K = 10L, 
  tau = c(21.269, 18.109, 19.468, 19.89, 18.247, 19.048, 19.957, 21.006, 
17.524, 19.475), 
  event_time = c(5.045, 14.921, 18.566, 7.265, 10.51, 12.155, 16.262, 17.738, 
17.763, 16.059, 18.371, 10.393, 11.787, 5.088, 10.144, 11.646, 
13.274, 15.233, 16.345, 17.583, 15.266, 15.391, 16.355, 17.79, 
7.729, 13.906, 14.287, 12.012, 18.662, 5.654, 5.727, 8.144, 11.608, 
14.756, 14.933, 16.088, 16.45, 18.876), 
  s = c(3L, 6L, 2L, 2L, 7L, 4L, 3L, 2L, 6L, 3L)
)

library(rstan)
fitplp <- stan(
  model_code=plptau1, model_name="NHPP", data=df0, 
  iter=1000, warmup = 500, chains=1, seed = 123, refresh = 0
)
> fitplp
Inference for Stan model: NHPP2.
1 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=500.

        mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
beta    2.16    0.03 0.37   1.53   1.91   2.14   2.37   2.98   115 1.00
theta  10.53    0.12 1.40   7.72   9.59  10.58  11.51  13.16   130 1.00
lp__  -90.13    0.09 1.15 -93.08 -90.57 -89.78 -89.30 -88.98   171 1.03

Samples were drawn using NUTS(diag_e) at Mon Jul  8 10:40:19 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

The true parameters for beta and theta are 2 and 10, which are very close to the estimates.


However, when there is no event in any of the time intervals, which is realisticly very common, the Stan code will not work. For example:

df1 = list(
  N = 8L, K = 10L, 
  tau = c(9.9785, 7.3146, 10.0518, 10.2853, 10.2621, 8.4175, 10.7142, 12.0679, 10.6844, 8.2966), 
  event_time = c(6.7346, 8.1608, 4.4621, 6.5118, 7.9538, 11.2332, 11.6206, 11.9121), 
  s = c(0L, 0L, 2L, 1L, 2L, 0L, 0L, 3L, 0L, 0L)
)
fitplp <- stan(
  model_code=plptau1, model_name="NHPP", data=df1, 
  iter=1000, warmup = 500, chains=1, seed = 123, refresh = 0
)

[1] "Error in sampler$call_sampler(args_list[[i]]) : Initialization failed."
error occurred during calling the sampler; sampling not done

I believe the problem is in the zero events in variable s. Is there any way to get around this issue? Thank you.

Why not just do this?:

  for (k in 1:K){
    if (s[k] == 0) continue;

    segment(event_time, position, s[k]) ~ nhpp(beta, theta, tau[k]);
    position = position + s[k];
  }

ETA: Of course, that would mean not using some elements of tau, but that seems to be what you want to do, anyway, judging from your question.

Thank you @jjramsey! But it seems to have ignored the log likelihood part when there are no events. In the self-defined nhpp_log() function, I have:

In the case of zero-events, and if I assume loglikelihood = log(beta) - beta*log(theta) - (tau/theta)^beta, would this solution cause biased estimation since it just ignores these non-event cases?

I don’t know. It’s hard for me, as an outsider, to tell what the intended logic of your model is.

However, if the model is supposed to do something in the case where there is no events, then the coding of your model does not seem to make a lot of sense. While in Stan, the “~” operator technically does not involve taking actual samples, the statement

foo ~ bar(...)

is still supposed to indicate that foo is a random variable drawn from distribution bar(...). If foo were a zero-length vector, it would not be clear what is being drawn from distribution bar(...).

A good solution to this problem is to code data for at-risk intervals rather than for events. It’s how K-M plots are constructed and it generalizes nicely to event models.