Hi all, I’m having trouble building up an interval censored model. I’ve boiled down the current problem I’m having to a manageable (reproducible) example. I’ve attached the related R and stan files. The R file simple_sim.R contains the R code to generate the data described, along with the calls to run the 3 Stan models described:
simple_model_1.stan (1.1 KB)
simple_sim.R (3.4 KB)
simple_model_3.stan (1.5 KB)
simple_model_2.stan (1.2 KB)
We have a population of N subjects, where each subject n has a 0/1 covariate x_n. A time is generated for each subject from an exponential glm with a log link, i.e. for subject n, t_n\sim\textsf{Exponential}(\lambda_n), where \lambda_n=\exp(\beta_0+\beta_1*x_n). However, we do not observe t_n. Instead we’re given fixed screening times \{S_1,\cdots, S_m\}, and observe whether t_n fell between S_1 and S_m (if it occurs before S_1 we say is observed in this interval for right now). If it did not fall between those times we say it fell after S_m.
Thus, if t_n is observed in the interval between S_1 and S_m, the likelihood contribution for subject i is \int_{S_1}^{S_{m}}\lambda_ne^{-\lambda_n*t}dt, and if it’s not the likelihood contribution is \int_{S_{m}}^{\infty}\lambda_ne^{-\lambda_n*t}dt. Rather than just add that to the log probability target in Stan, I’m trying to write this as a categorical likelihood, where letting g_n be 1 if the time is before S_1 (can’t happen right now), 2 if it’s between S_1 and S_m, and 3 if it’s after S_m, g_n\sim\textsf{Categorical}(\theta_n) with \theta_n=[\int_{0}^{S_{1}}\lambda_ne^{-\lambda_n*t}dt, \int_{S_1}^{S_{m}}\lambda_ie^{-\lambda_n*t}dt, \int_{S_{m}}^{\infty}\lambda_ne^{-\lambda_n*t}dt] (for my actual data you don’t actually observe the interval, just other data from each screening time, so we need to marginalize over all possible intervals, thus necessitating the use of \theta_n).
If I treat the screening times as the same for everyone as above, I can fit this in Stan without much hastle (it takes a few tries to initialize since some values of the \beta s cause the last integral to be numerically 0 for my data). Here’s the model chunk (from simple_model_1.stan):
model {
beta ~ normal(0, priorsd);
for (n in 1:N){
vector[3] log_theta;
log_theta[1] = exponential_lcdf(times[2]|lambda[n]);
log_theta[2] = log_diff_exp(
exponential_lcdf(times[n_times[1]]|lambda[n]),
exponential_lcdf(times[2]|lambda[n]));
log_theta[3] = log_diff_exp(0.0,
exponential_lcdf(times[n_times[1]]|lambda[n]));
group[n] ~ categorical(exp(log_theta));
}
}
Since the notation is slightly different than what I write above, for clarity, group[n]=g_n, times[2]=S_1, and times[n_times[1]]=S_m. times contains the screening times, with times[1]=0 and n_times[1]=m.
My problem lies in that I’d like to generalize the model to allow for different subjects to have different screening times and different numbers of screening times. This leads to a ragged data structure. I tried as a first step storing every subject’s screening times in one long vector, and then accessing them as described in the ragged data structure section of the stan manual. The model chunk then becomes (from simple_model_2.stan):
model {
int pos;
pos=1;
beta ~ normal(0, priorsd);
for (n in 1:N){
vector[3] log_theta;
log_theta[1] = exponential_lcdf(times[pos+1]|lambda[n]);
log_theta[2] = log_diff_exp(
exponential_lcdf(times[pos+n_times[n]-1]|lambda[n]),
exponential_lcdf(times[pos+1]|lambda[n]));
log_theta[3] = log_diff_exp(0.0,
exponential_lcdf(times[pos+n_times[n]-1]|lambda[n]));
group[n] ~ categorical(exp(log_theta));
pos = pos + n_times[n];
}
}
Here times is the vector containing every subjects times one after the other, n_times[n] is the number of screening times for subject n, and pos keeps track of where in times we are for the current subject. All I’ve changed from the previous model are how we access the times that we’re plugging into the cdfs. This model also works.
Now where the model breaks down is when I make the length of log_theta n_times[n] (i.e. the number of screens for subject n, which in the current example is still 3 for every subject) and I set the value of log_theta in a loop (both of which will be necessary for the generalization). The model chunk now looks like (I include 0 in the times vector so this should be building up the exact same log_theta as the previous model) (from simple_model_3.stan):
model {
int pos;
pos=1;
beta ~ normal(0, priorsd);
for (n in 1:N){
vector[n_times[n]] log_theta;
for(i in 1:(n_times[n]-1)){
log_theta[i] = log_diff_exp(
exponential_lcdf(times[pos+i]|lambda[n]),
exponential_lcdf(times[pos+i-1]|lambda[n]));
}
log_theta[n_times[n]] = log_diff_exp(0.0,
exponential_lcdf(times[pos+n_times[n]-1]|lambda[n]));
group[n] ~ categorical(exp(log_theta));
pos = pos + n_times[n];
}
}
This model however doesn’t initialize. It says it’s rejecting initial values because “Gradient evaluated at the initial value is not finite.” (there are also some initialization problems as in the first problem with the integrals being numerically 0, but I’m not worried about these). I don’t think the setting of the length of log_theta is causing issues, as I tried just setting it to 3 instead of n_times[n] and it didn’t initialize as well. I’ve tried setting the initial values for the \beta s to the true values in the simulation and that didn’t work. I’ve also looked at the values of log_theta as it’s trying to initialize, and they look to be calculated correctly. Therefore I think there’s a problem with setting the values of log_theta in a loop, but I’m not sure exactly what’s wrong or how I would fix this! Please let me know if anything isn’t clear!