Ragged data structure causing initialization problems for interval censored model

fitting-issues

#1

Hi all, I’m having trouble building up an interval censored model. I’ve boiled down the current problem I’m having to a manageable (reproducible) example. I’ve attached the related R and stan files. The R file simple_sim.R contains the R code to generate the data described, along with the calls to run the 3 Stan models described:
simple_model_1.stan (1.1 KB)
simple_sim.R (3.4 KB)
simple_model_3.stan (1.5 KB)
simple_model_2.stan (1.2 KB)

We have a population of N subjects, where each subject n has a 0/1 covariate x_n. A time is generated for each subject from an exponential glm with a log link, i.e. for subject n, t_n\sim\textsf{Exponential}(\lambda_n), where \lambda_n=\exp(\beta_0+\beta_1*x_n). However, we do not observe t_n. Instead we’re given fixed screening times \{S_1,\cdots, S_m\}, and observe whether t_n fell between S_1 and S_m (if it occurs before S_1 we say is observed in this interval for right now). If it did not fall between those times we say it fell after S_m.

Thus, if t_n is observed in the interval between S_1 and S_m, the likelihood contribution for subject i is \int_{S_1}^{S_{m}}\lambda_ne^{-\lambda_n*t}dt, and if it’s not the likelihood contribution is \int_{S_{m}}^{\infty}\lambda_ne^{-\lambda_n*t}dt. Rather than just add that to the log probability target in Stan, I’m trying to write this as a categorical likelihood, where letting g_n be 1 if the time is before S_1 (can’t happen right now), 2 if it’s between S_1 and S_m, and 3 if it’s after S_m, g_n\sim\textsf{Categorical}(\theta_n) with \theta_n=[\int_{0}^{S_{1}}\lambda_ne^{-\lambda_n*t}dt, \int_{S_1}^{S_{m}}\lambda_ie^{-\lambda_n*t}dt, \int_{S_{m}}^{\infty}\lambda_ne^{-\lambda_n*t}dt] (for my actual data you don’t actually observe the interval, just other data from each screening time, so we need to marginalize over all possible intervals, thus necessitating the use of \theta_n).

If I treat the screening times as the same for everyone as above, I can fit this in Stan without much hastle (it takes a few tries to initialize since some values of the \beta s cause the last integral to be numerically 0 for my data). Here’s the model chunk (from simple_model_1.stan):

model {
    beta ~ normal(0, priorsd);
    for (n in 1:N){
        vector[3] log_theta; 
        log_theta[1] = exponential_lcdf(times[2]|lambda[n]);
        log_theta[2] = log_diff_exp(
                exponential_lcdf(times[n_times[1]]|lambda[n]),
                exponential_lcdf(times[2]|lambda[n]));
        log_theta[3] = log_diff_exp(0.0,
            exponential_lcdf(times[n_times[1]]|lambda[n]));
        group[n] ~ categorical(exp(log_theta));
    }
}

Since the notation is slightly different than what I write above, for clarity, group[n]=g_n, times[2]=S_1, and times[n_times[1]]=S_m. times contains the screening times, with times[1]=0 and n_times[1]=m.

My problem lies in that I’d like to generalize the model to allow for different subjects to have different screening times and different numbers of screening times. This leads to a ragged data structure. I tried as a first step storing every subject’s screening times in one long vector, and then accessing them as described in the ragged data structure section of the stan manual. The model chunk then becomes (from simple_model_2.stan):

model {
    int pos;
    pos=1;
    beta ~ normal(0, priorsd);
    for (n in 1:N){
        vector[3] log_theta;
        log_theta[1] = exponential_lcdf(times[pos+1]|lambda[n]);
        log_theta[2] = log_diff_exp(
                exponential_lcdf(times[pos+n_times[n]-1]|lambda[n]),
                exponential_lcdf(times[pos+1]|lambda[n]));
        log_theta[3] = log_diff_exp(0.0,
            exponential_lcdf(times[pos+n_times[n]-1]|lambda[n]));
        group[n] ~ categorical(exp(log_theta));
        pos = pos + n_times[n];
    }
}

Here times is the vector containing every subjects times one after the other, n_times[n] is the number of screening times for subject n, and pos keeps track of where in times we are for the current subject. All I’ve changed from the previous model are how we access the times that we’re plugging into the cdfs. This model also works.

Now where the model breaks down is when I make the length of log_theta n_times[n] (i.e. the number of screens for subject n, which in the current example is still 3 for every subject) and I set the value of log_theta in a loop (both of which will be necessary for the generalization). The model chunk now looks like (I include 0 in the times vector so this should be building up the exact same log_theta as the previous model) (from simple_model_3.stan):

model {
    int pos;
    pos=1;
    beta ~ normal(0, priorsd);
    for (n in 1:N){
        vector[n_times[n]] log_theta; 
        for(i in 1:(n_times[n]-1)){
            log_theta[i] = log_diff_exp(
                exponential_lcdf(times[pos+i]|lambda[n]),
                exponential_lcdf(times[pos+i-1]|lambda[n]));
        }
        log_theta[n_times[n]] = log_diff_exp(0.0,
            exponential_lcdf(times[pos+n_times[n]-1]|lambda[n]));

        group[n] ~ categorical(exp(log_theta));
        pos = pos + n_times[n];
    }
}

This model however doesn’t initialize. It says it’s rejecting initial values because “Gradient evaluated at the initial value is not finite.” (there are also some initialization problems as in the first problem with the integrals being numerically 0, but I’m not worried about these). I don’t think the setting of the length of log_theta is causing issues, as I tried just setting it to 3 instead of n_times[n] and it didn’t initialize as well. I’ve tried setting the initial values for the \beta s to the true values in the simulation and that didn’t work. I’ve also looked at the values of log_theta as it’s trying to initialize, and they look to be calculated correctly. Therefore I think there’s a problem with setting the values of log_theta in a loop, but I’m not sure exactly what’s wrong or how I would fix this! Please let me know if anything isn’t clear!


#2

Hmm, I didn’t read this closely, but I’d probably go at it with some print debugging.

You can do things like:

for(i in 1:n_times[n]) {
  if(is_inf(log_theta[i])) {
    print("log theta is inf");
  }

  if(is_nan(log_theta[i])) {
    print("log theta is nan");
  }
}

Just start near the categorical variable and march your way back until you find what’s blowing up.


#3

Thanks for the reply! I’ve already tried print debugging like that and the problem doesn’t seem to be log_theta being built up incorrectly with infs or nans (I just did a sanity check and used your prints and they never get triggered for example when I initialize the \beta s to the true simulated values). log_theta looks to be built up correctly/the same way as in the first 2 models (when I print out the values of log_theta[i] being generated in the loop along with the corresponding values from the first 2 models they are exactly the same), which is why I can’t figure out what I’m doing wrong. Stan gives the “Rejecting initial value: Gradient evaluated at the initial value is not finite. Stan can’t start sampling from this initial value.” error, which is harder to diagnose from what I’ve seen on these forums so far (I already know how to deal with the nans/infs errors). Could it be possible that initializing values in a loop like this is causing numerical issues or causing the model to become non-differentiable?


#4

Next thing I’d check then is if exp(log_theta) is going to zero anywhere, cause to get the log density we’ll have to take the log of that number again.

Internally this is all the categorical lpmf is doing: https://github.com/stan-dev/math/blob/develop/stan/math/prim/mat/prob/categorical_lpmf.hpp#L35

You could replace

group[n] ~ categorical(exp(log_theta));

with

target += log_theta[group[n]]

To avoid that issue (it’s also more efficient since we’re not calling exp on a whole vector everyone time).


#5

Nope, no zeros, log_theta looks to be built up correctly from my prints (just double checked and none of the exp(log_theta[i]) are 0). Plus I would be getting the error directly from the categorical lpmf in that case, right? I’ve tried adding it directly to the target like that already and it doesn’t change at all, I still get the “Rejecting initial value: Gradient evaluated at the initial value is not finite. Stan can’t start sampling from this initial value.” error. (I just went through and tried printing errors if the values calculated in the loop weren’t equal to the values from the other models, and received no errors, so log_theta is definitely being built up the same as the other models that don’t have these gradient initialization errors)

Update: I just thought about it, and the one thing that’s different in the loops between the two models is that for the first element of log_theta, instead of calling exponential_lcdf(times[pos+1]|lambda[n]); (which is the integral from 0 to the first time), I call log_theta[i] = log_diff_exp( exponential_lcdf(times[pos+i]|lambda[n]), exponential_lcdf(times[pos+i-1]|lambda[n]));, which is the integral from 0 to the first time minus the integral from 0 to 0 on the log scale. I changed this in the loop to evaluate just the cdf (instead of the difference) if we’re at the first element and it worked! So I think the log_diff_exp statement was not differentiable when the second argument was the exponential cdf evaluated at 0 (even though it produced the correct value of log_theta!!). Does this make sense/ is there a reason for this? It’s not clear at all that this should happen to me.


#6

Niiice! Good stuff!

Haha, I don’t have a good explanation for you. I checked a couple edge cases but wasn’t able to track the issue down. This sorta stuff happens not infrequently unfortunately. Most of the times there’s not much to do other than be aware of it for the next time :D.


#7

I’m not good at going through Stan’s c++ code (so I can’t check this), but would you know if it’s the case that the derivative of exponential_lcdf is defined as nan or inf at 0? I think that’d make sense as a feature and would make sense as to why log_theta was being built correctly (the function returns the correct value) but I was getting the gradient not finite error (because the derivative is defined as nan or inf).


#8

If y_dbl is zero here, there’d be problems with the next line (log of zero): https://github.com/stan-dev/math/blob/develop/stan/math/prim/scal/prob/exponential_lcdf.hpp#L54

But that’s the output value haha. It’s possible a -inf got juggled around and never blew up into a NaN, but you were printing your output pretty aggressively?

I dunno, that’s the line probly. That make any sense to you?


#9

Yeah I’m pretty sure y_dbl would be 0 there, but that’s strange because I was checking the output and I wasn’t getting any problems with log_theta itself, just with the derivative. So not sure why I wasn’t getting a log of 0 there! But a few lines down at https://github.com/stan-dev/math/blob/develop/stan/math/prim/scal/prob/exponential_lcdf.hpp#L57 it looks like the derivative (I think) is defined as something divided by the thing that’s 0 (because y_dbl is 0), so that makes sense that the derivative would be infinite. I guess I’ll just have to watch out for this in the future like you said. Thanks for your help!


#10

Nice find! I agree with your assessment. Good luck with the rest of your model!