Model diagnosis / comparison for survival models

Not sure if others have experience with this, but I’ve been using the time-dependent Brier score, specifically a variation of it using a log loss. It’s summarized well in Graf et al, 1999. Has the limitation that it uses a point estimate of survival probability at a particular survival time, but perhaps you could summarize this over the posterior distribution to get a sense of calibration.

In this formulation, the censored survival times contribute to the score directly up until the time they are censored, and contribute indirectly after that by impacting the weights assigned to observed & non-observed events. It’s not ideal but unless you model the censoring distribution directly, it seems like the best one can do.

One concern I have when using predicted survival times is that the posterior distribution of this predicted value can be very wide (often too wide to be clinically useful!), so this measure may not be sensitive to problems with model fit.


Thanks for the Brier score reference. For reference, I also list “Consistent Estimation of the Expected Brier Score in General Survival Models with Right-Censored Event Times” by Gerds et al. with the accompanied R package pec.

In survival analysis with censored data the mean squared error of prediction can be estimated by weighted averages of time-dependent residuals. Graf et al. (1999) suggested a robust weighting scheme based on the assumption that the censoring mechanism is independent of the covariates. We show consistency of the estimator. Furthermore, we show that a modified version of this estimator is consistent even when censoring and event times are only conditionally independent given the covariates. The modified estimators are derived on the basis of regression models for the censoring distribution. A simulation study and a real data example illustrate the results.


We emphasise that […] all these estimators are robust against misspecification of the survival model.

This is provides an estimator with weaker assumptions for consistency, however requires us to model the censoring time distribution.

If I understand correctly, the potential bias of the estimator now does not depend anymore on the survival model but on the model used for censoring, which is much better if you use the brier score to compare multiple survival models and when use the same underlying model for censoring times.

Thanks Jacki. If I wanted to include this aspect into my model comparison, might a potential way to check for this aspect to work with posterior predictive checks (for the uncensored cases only) and (graphically) check for sharpness?

For two well calibrated models, the one with the narrower posterior intervals is preferable because its predictions are more tighter. The term introduced for this by Gneiting et al. (2007) is “sharpness.” In the limit, a perfect model would provide a delta function at the true answer with a vanishing posterior interval.

Thanks @ermeel - this sharpness attribute sounds like a useful metric, among well-calibrated models. Particularly when the intended use case involves the predicted survival times.

1 Like

Hi Aki, @avehtari,

Model for censored times is equivalent to binary classification and you can use the usual calibration plot for the probabilities.

I know it’s a while ago (we even spoke briefly at StanCon), but could you elaborate more on this?

Or did you have the above brier score in mind?

Can you post Stan code?

Here is some Stan code (simplest version I could think of):

data {
    int<lower=0> N_uncensored;                                      
    int<lower=0> N_censored;                                        
    int<lower=1> NC;                                                
    matrix[N_censored,NC] X_censored;                               
    matrix[N_uncensored,NC] X_uncensored;                           
    vector<lower=0>[N_censored] times_censored;                          
    vector<lower=0>[N_uncensored] times_uncensored;
parameters {
    vector[NC] betas;                                     
    real intercept;                                 
model {
    betas ~ normal(0,2);                                                            
    intercept   ~ normal(-5,2);                                                     
    target += exponential_lpdf(times_uncensored | exp(intercept+X_uncensored*betas)); 
    target += exponential_lccdf(times_censored  | exp(intercept+X_censored*betas));
generated quantities {
    vector[N_uncensored] times_uncensored_sampled;
        real tmp;
        real max_time;
        real max_time_censored;
        max_time = max(times_uncensored);
        max_time_censored = max(times_censored);
        if(max_time_censored > max_time) max_time = max_time_censored;
        for(i in 1:N_uncensored) {
            tmp= max_time + 1; 
            while(tmp > max_time) {
                tmp = exponential_rng(exp(intercept+X_uncensored[i,]*betas));
            times_uncensored_sampled[i] = tmp;

I think what I had in my mind would work only for censored at random. Then the distribution of CCDF’s should be uniform. But if the censoring is not random then the censoring process should be modelled, too.