Zero-length vector for ragged data structures

Meow · July 8, 2019, 3:46pm

I have a recurrent events data to which I want to fit a non-homogeneous Poisson process. I followed the ragged structure section in Stan tutorial. Here is a reproducible example:

functions{
  real nhpp_log(vector t, real beta, real theta, real tau){
    vector[num_elements(t)] loglik_part;
    real loglikelihood;
    for (i in 1:num_elements(t)){
      loglik_part[i] = log(beta) - beta*log(theta) + (beta - 1)*log(t[i]);
    }
    loglikelihood = sum(loglik_part) - (tau/theta)^beta;
    return loglikelihood;
  }
}
data {
  int<lower=0> N; //total # of obs
  int<lower=0> K; //total # of groups
  vector<lower=0>[K] tau;//truncated time
  vector<lower=0>[N] event_time; //failure time
  int s[K]; //group sizes
}
parameters{
  real<lower=0> beta;
  real<lower=0> theta;
}
model{
  int position;
  position = 1;
  for (k in 1:K){
    segment(event_time, position, s[k]) ~ nhpp(beta, theta, tau[k]);
    position = position + s[k];
  }
//PRIORS
  beta ~ gamma(1, 1);
  theta ~ gamma(1, 0.01);
}

When there is at least one event in each time interval, this code works perfectly fine. For example:

df0 = list(N = 38L, K = 10L, 
  tau = c(21.269, 18.109, 19.468, 19.89, 18.247, 19.048, 19.957, 21.006, 
17.524, 19.475), 
  event_time = c(5.045, 14.921, 18.566, 7.265, 10.51, 12.155, 16.262, 17.738, 
17.763, 16.059, 18.371, 10.393, 11.787, 5.088, 10.144, 11.646, 
13.274, 15.233, 16.345, 17.583, 15.266, 15.391, 16.355, 17.79, 
7.729, 13.906, 14.287, 12.012, 18.662, 5.654, 5.727, 8.144, 11.608, 
14.756, 14.933, 16.088, 16.45, 18.876), 
  s = c(3L, 6L, 2L, 2L, 7L, 4L, 3L, 2L, 6L, 3L)
)

library(rstan)
fitplp <- stan(
  model_code=plptau1, model_name="NHPP", data=df0, 
  iter=1000, warmup = 500, chains=1, seed = 123, refresh = 0
)

> fitplp
Inference for Stan model: NHPP2.
1 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=500.

        mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
beta    2.16    0.03 0.37   1.53   1.91   2.14   2.37   2.98   115 1.00
theta  10.53    0.12 1.40   7.72   9.59  10.58  11.51  13.16   130 1.00
lp__  -90.13    0.09 1.15 -93.08 -90.57 -89.78 -89.30 -88.98   171 1.03

Samples were drawn using NUTS(diag_e) at Mon Jul  8 10:40:19 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

The true parameters for beta and theta are 2 and 10, which are very close to the estimates.

However, when there is no event in any of the time intervals, which is realisticly very common, the Stan code will not work. For example:

df1 = list(
  N = 8L, K = 10L, 
  tau = c(9.9785, 7.3146, 10.0518, 10.2853, 10.2621, 8.4175, 10.7142, 12.0679, 10.6844, 8.2966), 
  event_time = c(6.7346, 8.1608, 4.4621, 6.5118, 7.9538, 11.2332, 11.6206, 11.9121), 
  s = c(0L, 0L, 2L, 1L, 2L, 0L, 0L, 3L, 0L, 0L)
)
fitplp <- stan(
  model_code=plptau1, model_name="NHPP", data=df1, 
  iter=1000, warmup = 500, chains=1, seed = 123, refresh = 0
)

[1] "Error in sampler$call_sampler(args_list[[i]]) : Initialization failed."
error occurred during calling the sampler; sampling not done

I believe the problem is in the zero events in variable s. Is there any way to get around this issue? Thank you.

jjramsey · July 8, 2019, 4:09pm

Why not just do this?:

  for (k in 1:K){
    if (s[k] == 0) continue;

    segment(event_time, position, s[k]) ~ nhpp(beta, theta, tau[k]);
    position = position + s[k];
  }

ETA: Of course, that would mean not using some elements of tau, but that seems to be what you want to do, anyway, judging from your question.

Meow · July 9, 2019, 2:05pm

Thank you @jjramsey! But it seems to have ignored the log likelihood part when there are no events. In the self-defined nhpp_log() function, I have:

In the case of zero-events, and if I assume loglikelihood = log(beta) - beta*log(theta) - (tau/theta)^beta, would this solution cause biased estimation since it just ignores these non-event cases?

jjramsey · July 9, 2019, 4:32pm

I don’t know. It’s hard for me, as an outsider, to tell what the intended logic of your model is.

However, if the model is supposed to do something in the case where there is no events, then the coding of your model does not seem to make a lot of sense. While in Stan, the “~” operator technically does not involve taking actual samples, the statement

foo ~ bar(...)

is still supposed to indicate that foo is a random variable drawn from distribution bar(...). If foo were a zero-length vector, it would not be clear what is being drawn from distribution bar(...).

sakrejda · July 9, 2019, 9:49pm

A good solution to this problem is to code data for at-risk intervals rather than for events. It’s how K-M plots are constructed and it generalizes nicely to event models.

Topic		Replies	Views
Ragged data matrix and some questions using "segment" function Modeling	1	909	November 1, 2017
Problem when converting to a ragged array model (unsure why - syntax? typo?) Modeling specification	6	991	June 14, 2017
Ragged data and mixture (occupancy) model Modeling ecology	2	511	August 26, 2020
Question about Log probability evaluates to log(0), i.e. negative infinity Modeling rstan , specification	5	571	July 30, 2021
Ragged arrays and time series operations Modeling	15	1854	February 23, 2018

Zero-length vector for ragged data structures

Related topics