How to validate a survival model with both right censoring and interval censoring?

Hi,

I’m working with a plant survival dataset that contains right censored and interval censored data. I managed to successfully fit a Weibull model to the data (see below), but I don’t know how I should perform posterior predictive checks.

I did not find any case studies, forum threads or any other resources for this type of models, so I’m not sure how to proceed.

Can anyone suggest a way to validate the model? If this were any other type of data I would compare observed data against predicted data, but in this case I don’t know how to do that since I don’t have the exact time of death of any plant.

Thanks

 //Survival stuff
  vector[N] y;
  vector[N] y1;
  vector[N] y2;
  int<lower=1,upper=2> cens[N];

  //Explanatory variables
  vector[N] OM;
  vector[N] pH;
  vector[N] SM;
  vector[N] ST;
  int<lower=1,upper=5> cover_id[N];
  
  //group variables
  int<lower=1> site_id[N];
  int<lower=1> plot_id[N];

  //Missing data stuff
  int<lower=0> N_LAI_miss;
  int LAI_missidx[N_LAI_miss];

}
parameters {
  //intercept
  real a;
//varying intercepts
  vector[N_site] a_site;
  vector[N_plot] a_plot;
  vector[N_cover] a_cover;

   //slopes
   real b_om;
   real b_ph;
   real b_sm;
   real b_st;
   real<lower=0> k;
   
}
transformed parameters {
   vector[N] mu=exp(a + a_site[site_id]+a_plot[plot_id]+ a_cover[cover_id]+b_sm*SM+b_st*ST+
    b_om*OM+b_ph*pH+);
   vector[N] lambda= mu / tgamma(1 + 1 /k);
}
model {
  //Likelihood
  for (i in 1:N)
  if ( cens[i] == 1 ) target += weibull_lccdf(y[i] | k, lambda[i]);
  else target += log_diff_exp(weibull_lcdf(y2[i] | k, lambda[i]),weibull_lcdf(y1[i] | k, lambda[i]));
 
 //Priors
  a ~ normal(0, 1);
  k ~ gamma(0.5, 0.5);
  a_site ~ normal(0,1);
  a_plot ~ normal(0,1);
  a_cover  ~ normal(0,1);
  b_om ~ normal(0,1);
  b_ph ~ normal(0,1);
  b_sm ~ normal(0,1);
  b_st ~ normal(0,1);
  
}
generated quantities{
  vector[N] log_lik;
  vector[N] surv;
  
  for(i in 1:N){
  log_lik[i] = weibull_lpdf(y[i] | k, lambda[i]);
  surv[i] = weibull_rng(k, lambda[i]);
}
}
3 Likes

I was wondering if someone has any hints about this. I’m still stuck!

Right now, the only possible solution I can think of is to simulate data from a Weibull distribution, manually applying right and interval censoring and checking if this model can recover the initial parameters.

If you want to perform posterior predictive checks (and are using R) I suggest this vignette: Extracting and visualizing tidy residuals from Bayesian models • tidybayes (mjskay.github.io)

2 Likes

Thanks. I’m familiar with the software options. Unfortunately, it’s the conceptual part that’s the issue.

The conceptual part is outlined under the section Randomized quantile residuals.

2 Likes

@StaffanBetner 's suggestion is perfectly reasonable and exploring residuals will provide you with good means to explore the fit of your model. I’ll add that if you really want to do posterior predictive checks, you can do that with censored models as long as you can reconstruct the censoring process. E.g. if the censoring is due to only sporadic inspection of your experimental units and you know the time of all inspections (not just the two surrounding the event), you can take your posterior predictions and transform them into predictions of alive/dead status at each inspection. This can then be directly compared to your data.

Does that make sense?

1 Like

Thanks @martinmodrak. The right censoring is caused by plants that outlived the experiment. The interval censoring is caused by us running out of money halfway through the experiment and not being able to visit the experiment site as frequently as we would have liked.

I think I understand the idea, but I’m not sure I know how to implement it. Can you share a code snippet? I usually find it easier to understand when I see it written in code.

Unfortunately, I don’t have a working code snippet ready, but the idea is pretty straightforward. The simplest case is right censoring - if the experiment end is the same for all plants, you would have something like:

  surv[i] = weibull_rng(k, lambda[i]);
  if(surv[i] > experiment_end) {
   surv[i] = experiment_end;
  }

I am not sure I understand fully how the interval censoring arised from your description, but the idea is the same - make the full prediction and then replicate what you believe happened in the experiment.

Does that make sense?

1 Like

We don’t know exactly when the plant died. At t=y1 the plant was alive and at t=y2 the plant was dead. We know it died between y1 and y2 but we don’t know when exactly.

Unfortunately no. I guess the idea is to use weibull_rng() to generate right censored data and interval censored data that I can compare with the observed censored data, but I don’t know how to translate that into Stan code.

So let’s assume you have the observation times sorted from earliest to latest in an array called obs_times with size N_observation_times. This array should include all the times that the plant would be observed if it stayed alive throughout the whole experiment, not only the last time alive and first time dead. Let us further assume that your main data column containts the “last time plant was seen alive” (i.e, if the plant was alive at t = y1 and dead at t=y2 that your main output has y1, if the observation is right-censored, you record the last observation time). We also assume all plants were alive at t = 0

Now you can generate a data that would be directly comparable to the data you actually observed by something like (code not tested):

surv_raw[i] = weibull_rng(k, lambda[i]);
surv_censored[i] = 0;
//Increase the simulated censored time as long as it is larger than the observation times
for(t_i in 1:N_observation_times) {
  if(surv_raw[i] > obs_times[t_i]) {
    surv_censored[i] = obs_times[t_i];
  }
}

The code can hopefully be easily adapted if your censoring worked slightly differently (e.g. if you recorded first time dead instead of last time alive).

Does that make sense?

I’ll also repeat that the residual analysis suggested above is also a good option and if you have trouble implementing the censoring PP check, the residuals might be easier to work with.

1 Like

I don’t think the code you suggested works in my case. Also, the code from tidybayes requires functions that are not compatible with rstan and cmdstanr objects and I don’t know how to do the calculations by myself.

I think I need to provide additional details and reproducible code to help you help me.

Starting from the beginning: we planted a bunch of seeds in a plot. These seeds germinated at different points in time. We visited the plot periodically and recorded whether each plant was alive or dead. We know the exact date each acorn germinated. A few plants outlived the experiment but most had died by the end of it.

Here’s example code for 500 plants with a followup period of 150 days/weeks. That is, after 150 days/weeks we stopped visiting the experiment.

sim<-rweibull(500,shape=1.78,scale=70)
y_rcens=if_else(sim>100,150,sim)
cens=if_else(sim>150,1,2)
y2=sim+2
y1=sim-2
y1=if_else(y1<0,1,y1)

y_rcens = time between germination and the end of the experiment
y1 = time between germination and the last time the plant was found alive.
y2 = time between germination and the date it was found dead.

I know the code for creating interval censoring is not perfect, but that’s the best I can come up with at the moment (r - Simulate survival data from a Weibull distribution with both right and interval censoring - Stack Overflow).

Here’s an example model:

data {
    //Dimensions
  int<lower=1> N;

  //Survival stuff
  vector<lower=0>[N] y_rcens;
  vector<lower=0>[N] y1;
  vector<lower=0>[N] y2;
  int<lower=1,upper=2> cens[N];
}
parameters {

   real<lower=0> k;
   real<lower=0> mu;

}
transformed parameters {
   real lambda= mu / tgamma(1 + 1 /k);
}
model {
  //Likelihood
  for (i in 1:N)
  if ( cens[i] == 1 ) target += weibull_lccdf(y_rcens[i] | k, lambda);
  else target += log_diff_exp(weibull_lcdf(y2[i] | k, lambda),weibull_lcdf(y1[i] | k, lambda));
 
 //Priors
  k ~ normal(1,0.5);
  mu ~ gamma(5.8,0.1);
}

which runs like this:

library(cmdstanr)
mod <- cmdstan_model("model_surv_sim.stan")
fit <- mod$sample(
  data = sim_list,
  chains = 4,
  parallel_chains = 4
)

Thanks a lot for the time you spent so far.

I think I made some progress. I calculated the average of y2 and y1 and used the resulting values as observed times of death. Then I plotted a Kaplan Meier curve and compared it against model predictions.

Data preparation:

library(tidyverse)
sim<-rweibull(500,shape=1.78,scale=70)
y_rcens=if_else(sim>100,150,sim)
cens=if_else(sim>150,1,2)
y2=sim+2
y1=sim-2
y1=if_else(y1<0,1,y1)


sim_list<-list(N=length(y1),
               y_rcens=y_rcens,
               y1=y1,
               y2=y2,
               cens=cens)

Model

data {
    //Dimensions
  int<lower=1> N;

  //Survival stuff
  vector<lower=0>[N] y_rcens;
  vector<lower=0>[N] y1;
  vector<lower=0>[N] y2;
  int<lower=1,upper=2> cens[N];
}
parameters {

   real<lower=0> k;
   real<lower=0> mu;

}
transformed parameters {
   real lambda= mu / tgamma(1 + 1 /k);
}
model {
  //Likelihood
  for (i in 1:N)
  if ( cens[i] == 1 ) target += weibull_lccdf(y_rcens[i] | k, lambda);
  else target += log_diff_exp(weibull_lcdf(y2[i] | k, lambda),weibull_lcdf(y1[i] | k, lambda));
 
 //Priors
  k ~ normal(1,0.5);
  mu ~ gamma(5.8,0.1);
}
generated quantities{
  vector[N] surv;
  for(i in 1:N){
   surv[i] = weibull_rng(k, lambda);

}
}

Run the model:

library(cmdstanr)
mod <- cmdstan_model("model_surv_sim.stan")
fit <- mod$sample(
  data = sim_list,
  chains = 4,
  parallel_chains = 4
)

Plot KM curve and compare it against posterior predictions

library(rstan)
library(survival)
fit1 <- rstan::read_stan_csv(fit$output_files())
post_draws<-rstan::extract(fit1)
df_survfit<-data.frame(y1=sim_list$y1,y2=sim_list$y2,
                       y_rcens=sim_list$y_rcens,
                       cens=if_else(sim_list$cens==2,1,0),
                       y_obs=(y1+y2)/2)

surv<-survfit(Surv(y_obs,cens)~1,data=df_survfit)
plot(surv,xlab='Time',ylab="Survival probability",conf.int=F,col=c("gray"),xlim=c(0,170))
for(i in 1:100){
  trt_ecdf <- ecdf(post_draws$surv[i,])
  curve(1 - trt_ecdf(x), from = 0, to=170, add=T, col='gray')
}
lines(survfit(Surv(y_obs,cens)~1,data=df_survfit),add=T)

The resulting plot looks good:

I just wanted to make two comments, then first on interval censoring in survival modeling and the second on posterior retrodictive checks.

Interval censoring is quite natural to incorporate into survival models. Recall that a survival model is defined by some probability density function \rho(x; \theta) for an event taking place at x \in [0, \infty). A typical survival observation is defined by not only by an event occurring at x but also having not occurred at any previous x; mathematically the corresponding observational model is given by

\begin{align*} \pi(x; \theta) &= \; \text{Probability of not occurring before }x \\ &\quad \cdot \text {Probability density of occurring at }x \\ &= \; (1 - \mathbb{P}_{\rho}[I(0, x)]) \\ &\quad \cdot \rho(x; \theta) \\ &= \left[ \int_{x}^{\infty} \mathrm{d} x' \, \rho(x'; \theta) \right] \cdot \rho(x; \theta), \end{align*}

where I(0, x) is the interval from 0 to x.

A right censored observation occurs when x is only known to be larger than some threshold, x_{R}. Mathematically the observational model for these observations is given by the probability

\begin{align*} P_{R}(\theta) &= \text{Probability of occurring after }x_{R} \\ &= \mathbb{P}_{\rho}[I(x_{R}, \infty)]) \\ &=\int_{x_{R}}^{\infty} \mathrm{d} x' \, \rho(x'; \theta). \end{align*}

Superficially we just drop the \rho(x; \theta) term compared to the non-censored observation and then x = x_{R}.

Mathematically an interval censored observation is defined similarly. Here observations are known to be above x_{R} (right-censored) but also below x_{L} (left censored) so that the observational model is given by the probability

\begin{align*} P_{I}(\theta) &= \text{Probability of occurring after }x_{R}\text{ and before }x_{L} \\ &= \mathbb{P}_{\rho}[I(x_{R}, x_{L})]) \\ &= \int_{x_{R}}^{x_{L}} \mathrm{d} x' \, \rho(x'; \theta). \end{align*}

By the way most of Stan’s log CDF implementations are just naive logarithms of the CDF implementations so

rho_cdf(x_{L} | theta) - rho_cdf(x_{R} | theta)

will be more stable and efficient than

log_diff_exp(rho_lcdf(x_{L} | theta), rho_lcdf(x_{R} | theta),

as is done in the current Stan program.

One way to simulate all of these observations is to draw directly from the occurrence model

\tilde{x} \sim \rho(x; \theta),

and then return a censored observation if \tilde{x} falls into any of the censoring intervals, as is done in Felipe’s code. Another possibility is to partition the observational space into disjoint possibilities, compute the probability of each, sample one of those possibilities, and then sample any additional structure within that possibility.

For example let’s say that we observe survival events fully up until x_{1} at which point observations are censored between x_{1} and x_{2}, x_{2} and $x_{3}, and the right censored above $x_{3}. We now have four basic possibilities:

\begin{align*} 1: p_{1} = P[ 0 < x < x_{1} ] &= \int_{0}^{x_{1}} \mathrm{d} x' \, \rho(x'; \theta) \\ 2: p_{2} = P[ x_{1} < x < x_{2} ] &= \int_{x_{1}}^{x_{2}} \mathrm{d} x' \, \rho(x'; \theta) \\ 3: p_{3} = P[ x_{2} < x < x_{3} ] &= \int_{x_{2}}^{x_{3}} \mathrm{d} x' \, \rho(x'; \theta) \\ 4: p_{4} = P[ x_{3} < x ] &= \int_{x_{3}}^{\infty} \mathrm{d} x' \, \rho(x'; \theta) \end{align*}

Once we’ve evaluated (p_{1}, p_{2}, p_{3}, p_{4}) we can draw from a categorical RNG using those weights to determine which kind of observation we make. If we draw 2, 3, or 4 then we’re done and if we draw 1 then we further have to simulate

\tilde{x} \sim \rho(x; \theta)

subject to \tilde{x} < x_{1}.

While this might seem like a mess compared to the more straightforward simulation approach it provides motivation for summary statistics that make for very useful retrodictive checks. In particular given the number of observations of each type we can construct empirical probabilities

\begin{align*} \tilde{p}_{1} &= \frac{ n_{1} }{ N } \approx p_{1} \\ \tilde{p}_{2} &= \frac{ n_{2} }{ N } \approx p_{2} \\ \tilde{p}_{3} &= \frac{ n_{3} }{ N } \approx p_{3} \\ \tilde{p}_{4} &= \frac{ n_{4} }{ N } \approx p_{4}. \end{align*}

both from the observed data and posterior predictive simulations. This then defines four scalar posterior retrodictive checks between the observed empirical probabilities and the distribution of simulated posterior predictive empirical probabilities. Given enough observations the uncensored interval [0, x_{1}) can also be discretized into smaller intervals; the corresponding empirical probability retrodictive checks better probe the shape of the survival function there.

One can also combine probabilities, for example looking at p_{1}, p_{1} + p_{2}, p_{1} + p_{2} + p_{3}, and p_{1} + p_{2} + p_{3} + p_{4} which recreates a empirical CDF over the censoring intervals. This begins to approach the quantile residual summary statistics mentioned above only using the natural structure of the problem to define intervals that doesn’t require auxiliary simulations to work around the censoring. Also by directly comparing the empirical cumulative probabilities one doesn’t rely on assumed calibration behavior that isn’t guaranteed to hold.

6 Likes