Survival model in Rstan


#1

Hi everyone, I am trying to implement survival model in rstan. The code below is bit modified version of the leuk example. The problem is when I run the program, it takes more than 8 hours with 4 chains each 1000 iterations. The data is relatively big, 4000 observations and 7 predictors. Is there any possible way to optimize the code and make it faster?!

data {
  int<lower=0> N;//number of observations
  int<lower=0> NT;//number of death times
  int<lower=0> obs_t[N]; // survival months of each patient
  int<lower=0> S[NT+1]; //unique death times
  int<lower=0> fail[N]; // dead or alive
  int K;
  matrix[N,K] z;
  real mu[N];// Lnr
}
transformed data {
  int Y[N, NT];
  int dN[N, NT]; 
  real c;
  real r; 
  for(i in 1:N) {
    for(j in 1:NT) {
      Y[i, j] = int_step(obs_t[i] - S[j] + .000000001);
     dN[i, j] = Y[i, j] * fail[i] * int_step(S[j + 1] - obs_t[i] - .000000001);
    }
  }
  c = 1.5; 
  r = 0.1; 
}
parameters {
  real betaLnr; 
  vector [K] beta;
  real<lower=0> dL0[NT]; 
} 

model {
  betaLnr ~ normal(0, 1000);
  beta ~ normal(0, 1000);
  for(j in 1:NT) {
    dL0[j] ~ gamma(r * (S[j+1] - S[j]) * c, c);
    for(i in 1:N) {
      if (Y[i, j] != 0)  
       dN[i,j]~poisson(Y[i, j] * exp(mu[i]*betaLnr+dot_product(z[i],beta)))); 
    }
   }
}

#2

I give it s shot. Some thoughts follow:

Performance also depends on how large NT is.

You could precualculate in the transformed data a difference vector for S and then vectorize this statement so it works without a loop.

Equally well you could precalculate the expression that is the argument of the exp as a local variable in the model block using vectorized operations:

vector[N] loc_exp = exp(mu*betaLnr+z*beta);

Also I think mu should better be a vector.

About other optimizations I have to think more…


#3

Ok thank you, I’m gonna try your suggestions.
Looking forward for updates


#5

See also https://github.com/stan-dev/example-models/blob/master/bugs_examples/vol1/leuk/leuk.stan


#6

Thank you very much for keepng me updated.

Yes, I have seen this example before!

I reduced the time by adding “cores” in stan() function. Before I did not know about it,
unless I have seen in your code. cores=6 helped me to run faster, it took around 1,5 hours.


#7

Don’t worry about efficiency in transformed data. But the rest of the model is tricky to make more efficient.

You can precompute a vector of S_diff_times_c[j] = (S[j + 1] - S[j]) * c to use in the dL0 distribution in the model block. Then vectorize to

dL0 ~ gamma(r * S_diff_times_c[j], c);

Then you need to vectorize the dN, but there’s the nasty Y[i, j] > 0 condition,w hich seems to imply the data with Y[i, j] == 0 is not being modeled. Is that the intention?

To vectorize, you need replace Y with a precomputed array of values where Y[i,j] > 0, but that will turn out to be ragged, which is a headache with our current rectangular data structures. Then vectorize by row. And use the poisson_log distribution, which takes a parameter on the log scale, e.g.,

dN[i] ~ poisson_log(log_Y[i,j] + ...);

#8

I’ve tried to vectorize the time difference S_diff_times_c[j] in the Stan dev leukemia survival model and it returned divergent transitions:

transformed data {
  vector[NT] S_diff_times_c;
...
  for(j in 1:NT) {
  S_diff_times_c[j] = (t[j + 1] - t[j]);
}
}

Edit: the error comes from the fact I should be using dL0 ~ gamma(r * S_diff_times_c , c) without the index in the model block for it to work.