Model diagnosis / comparison for survival models


Not sure if others have experience with this, but I’ve been using the time-dependent Brier score, specifically a variation of it using a log loss. It’s summarized well in Graf et al, 1999. Has the limitation that it uses a point estimate of survival probability at a particular survival time, but perhaps you could summarize this over the posterior distribution to get a sense of calibration.

In this formulation, the censored survival times contribute to the score directly up until the time they are censored, and contribute indirectly after that by impacting the weights assigned to observed & non-observed events. It’s not ideal but unless you model the censoring distribution directly, it seems like the best one can do.

One concern I have when using predicted survival times is that the posterior distribution of this predicted value can be very wide (often too wide to be clinically useful!), so this measure may not be sensitive to problems with model fit.


Thanks for the Brier score reference. For reference, I also list “Consistent Estimation of the Expected Brier Score in General Survival Models with Right-Censored Event Times” by Gerds et al. with the accompanied R package pec.

In survival analysis with censored data the mean squared error of prediction can be estimated by weighted averages of time-dependent residuals. Graf et al. (1999) suggested a robust weighting scheme based on the assumption that the censoring mechanism is independent of the covariates. We show consistency of the estimator. Furthermore, we show that a modified version of this estimator is consistent even when censoring and event times are only conditionally independent given the covariates. The modified estimators are derived on the basis of regression models for the censoring distribution. A simulation study and a real data example illustrate the results.


We emphasise that […] all these estimators are robust against misspecification of the survival model.

This is provides an estimator with weaker assumptions for consistency, however requires us to model the censoring time distribution.

If I understand correctly, the potential bias of the estimator now does not depend anymore on the survival model but on the model used for censoring, which is much better if you use the brier score to compare multiple survival models and when use the same underlying model for censoring times.


Thanks Jacki. If I wanted to include this aspect into my model comparison, might a potential way to check for this aspect to work with posterior predictive checks (for the uncensored cases only) and (graphically) check for sharpness?

For two well calibrated models, the one with the narrower posterior intervals is preferable because its predictions are more tighter. The term introduced for this by Gneiting et al. (2007) is “sharpness.” In the limit, a perfect model would provide a delta function at the true answer with a vanishing posterior interval.


Thanks @ermeel - this sharpness attribute sounds like a useful metric, among well-calibrated models. Particularly when the intended use case involves the predicted survival times.