K-fold predictive validation of survival (Matlab Interface)

Dear Stan team,

I’m a non-statistician newbie and have used a k-fold model comparison scheme based on Vehtari et al (2016). I just wanted to check that what follows is correct (especially what I do with the posteriors once Stan has estimated them).

I am trying to predict survival from brain cancer (n=70; of which n=40 censored) using Peltola’s/Vehtari’s freely available weibull survival model (single-group model with horseshoe priors):

http://becs.aalto.fi/en/research/bayes/diabcvd/ - thanks so much for this!

The Stan model is:

yobs ~ weibull(alpha, exp(-(mu + Xobs_bg * beta_bg + Xobs_biom * beta_biom)/alpha));
increment_log_prob(weibull_ccdf_log(ycen, alpha, exp(-(mu + Xcen_bg * beta_bg + Xcen_biom * beta_biom)/alpha)));

I have a number of models with different numbers of imaging biomarker covariates (Xobs_biom) and wanted to see which model was the “best one”. I have repeated each model k-times with each model trained on 8/9th of the n=30 uncensored survival data (and all of the n = 40 censored data) and used to predict the remaining 1/9th of the deaths.

The out-of-sample predictive score used is the computed log pointwise predictive density which I have calculated outside of Stan in Matlab as follows:

For each out-of-sample subject:

  1. Evaluate weibullpdf(alpha, exp(-(mu + Xobs_bg * beta_bg + Xobs_biom * beta_biom)/alpha))
    with the covariate values (Xobs_bg and Xobs_biom) for that subject, at time t, where t = actual time of death.
  2. Repeat this for all posterior samples of mu and alpha. Take the mean and then log this.

Following k-fold validation, each model, gives n=30 numbers (one for each death) which I sum to produce the ‘out-of-sample marginal log-likelihood’ aka ‘log evidence’ of the model (a negative value). The model with the highest value wins (assuming all models have equal prior probabilities, difference of 3 is good evidence). I have attached a sample result.

Ps. I have done the las bit outside of Stan (instead of using generated_quantities) both for my own education and also because all the models will take ages to re-run.

Thanks so much!


Great that you have found our code useful!

That seems like quite a challenging data.

Note that this code dosn’t have our new recommendation for setting the horseshoe hyperparameters [1707.01694] Sparsity information and regularization in the horseshoe and other shrinkage priors

Can you tell how many? I’m asking to get a better feeling whether horseshoe is useful and whether you might need regularized horseshoe ([1707.01694] Sparsity information and regularization in the horseshoe and other shrinkage priors) in this case. As you have quite many censored observations, and censored observations are like binary classification observations, you may need regularized horseshoe if the number of covariates is larger than the number of observations.

I think you should include censored observations in your cross-validation.

Correct. You could also compute for the censored cases
weibulldccdf=1-weibullcdf(…) with same parameters and with the censoring time.

It’s not ¨log evidence’. It could be called ‘log pseudo-evidence’ following Geisser and Eddy (1979).

It’s not evidence and you should not combine it with model prior probabilities. Although you could call it pseudo-evidence and use the same scale as for the evidence, it seems that it’s not so good approach and you should instead take into account also the uncertainty in the difference (http://link.springer.com/article/10.1007/s11222-016-9696-4, [1507.04544] Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC). Further evidence on the importance of taking into account the uncertainty can be found in our paper on combining models ([1704.02030] Using stacking to average Bayesian predictive distributions), where we compare also Pseudo-BMA (based on pseudo-evidence differences) and Pseudo-BMA+ (based on using both the difference and the uncertainty), and the latter gives clearly better results.

Unfortunately there is also some challenges in interpreting the difference and the related uncertainty as discussed in my other answer today Interpreting elpd_diff - loo package - #4 by avehtari

Note also that if you are using cross-validation to compare many models, you may overfit in the selection process as shown, e.g. in http://link.springer.com/article/10.1007/s11222-016-9649-y

For a survival models you might also want to compute something like Harrell’s C (http://onlinelibrary.wiley.com/doi/10.1002/sim.4026/abstract), which might have more easily interpretable scale.

Dear Prof Vehtari,

Many thanks for your quick and thorough reply. After reading a couple of those papers it is much more complicated than I thought! Currently, given this smallish dataset, I’m more interested in determining whether different datasets (imaging/ pathology) can all predict survival to a similar degree, rather than saying which dataset is definitively more predictive.

I have 3 model families all predicting survival. All have 2 background covariates (age and gender) + p biomarker covariates (mostly dummy-coded binary variables, all standardised):

  1. ‘Pathology’ with p=2 covariates
  2. ‘Clinical imaging’ with p=40 covariates
  3. Automated imaging with p=108 covariates

Currently, the models with highest pseudo-evidence (not taking model uncertainty into account) are between 3-10 variables in families 2 and 3, which seems not too sparse/ too dense?

OK - I can do that thanks.

For each family, I estimate the full model first and rank the covariates in terms of effect size (no cross-validation). Then I incrementally add each covariate in rank order to the base model (background variables only) and use cross-validation to get the optimal number of covariates, using the pseudo-evidence as a scoring rule. If I interpret your papers correctly, because I don’t use cross-validation in the covariate ranking phase, this will reduce overfitting relatively well. Better, of course, would be to use your BMA solution as reference and covariate projection, but I don’t know how to do this in Matlab.

Many thanks for highlighting this. I could not find a Matlab implementation for these estimates unfortunately. However I have add standard errors (+/- 1 se) to the k-fold estimated elpd (just for the first 20 covariates of the automated imaging model family) - there is a large variation!


However I’m not sure how I can take it further in Matlab myself easily. I could find model weights using the lognormal approximation of pseudo-BMA (from your paper):


but not sure how to proceed from there. I can see stacking performs better but am not sure how to implement this myself. My questions are really:

  1. What number of covariates are needed in each family to provide a useful prediction of survival?
  2. Are the best models from different model families equally informative roughly?

Thanks again!

Then it seems you would not need to do covariate selection within ‘Clinical imaging’ and ‘Automated imaging’.

You could just compare how adding group 1, 2, or 3 affects the performance and you use suitable priors for covariates in groups 2 and 3 to improve the identifiability. I don’t know what your imaging covariates are, but if covariates in group 2 and 3 are all relevant but highly correlating within a group, horseshoe prior is not so good prior. Horseshoe prior is good only if you can assume that only a small number of covariates are relevant. Also remember that in Peltola et al work the hyperparameters for the horseprior were not selected as well as we recommend in the later paper [1707.01694] Sparsity information and regularization in the horseshoe and other shrinkage priors

By comparing how adding group 1, 2, or 3 affects the performance you have less models to compare, which is good. Would this make sense in your problem? And only if this works, then you would check which covariates in the group are most important (and for that I can’t recommend anything else than the projection).

The effect sizes are also inferred from the data, so this approach is also using data twice, and thus the cross-validation is giving too optimistic estimates and you are likely to choose irrelevant covariates which just by chance had biggish effect size, and thus the selection process will overfit in this case, too.

This a good figure illustrating the problem . The fact that smaller models have slightly larger elpd, hints that your prior on covariates is not so good (see comments above).

You could use these weights to drop models with weight 0, but if all models have non-zero weight then you should not select single one, but combine the predictions from all of the.

Dear Prof Vehtari,

Many thanks for your comments - Sorry for my delayed response but I’ve been working through them step-by-step.

Yes this would work.

I have worked out wether each model is better than the baseline demographics model by calculating elpd (estimated using 9-fold CV of censored and uncensored data as you suggested) for each of the four models (I have added a new ‘null’ model of demographics only).

The models are:

  1. Demographic Null (2 covariates (age and gender) only)
  2. ‘Path’ (Model 1 + additional p=2 covariates)
  3. ‘Clinical imaging’ (Model 1 + additional with p=40 covariates)
  4. ‘Automated imaging’ (Model 1 + with p=108 covariates)

Correlations between predictors in model 3 are:


Correlations between predictors in model 4 are:


After reading https://arxiv.org/abs/1707.01694, I have adapted the survival models for 2 and 3 to use the regularized horseshoe prior with the recommended prior for tau (currently number of non-zero parameters = 50%). The results are:

I know this isn’t exactly your suggested design, but follwing this procedure, I though it would be reasonable to get a BMA of these 4 models - the weights reflecting their importance. The lognormal approximation of pseudo-BMA weights for each model to 4 decimal places was [0, 0.9998, 0, 0.0001]. Therefore a reasonable conclusion may be that the pathological data is paramount in predicting survival, right?

However, I’m worried that the models 3 and 4 are not optimally estimated. This is because:

a) model 3 performs worse than model 1 (and model 1 covariates are included in model 3),
b) when I previously calculated k-fold elpd with step-by-step entry of regressors (using higher effect-size variables first), the elpd gradually declined after 5-10 covariates. I understand that this was using data twice (and an overfit as you explained), but is it possible that, for example, the elpd of model 4 with all covariates (presumably many irrelevant), is lower than model 4 with a subset of covariates? If so, I’d need to perform covariate selection prior to this BMA that I’ve done right?
c) models 3 and 4 sometimes had high rhats, but this was not a problem after randomly initialising the chains closer to ‘0’. I hope there isn’t a big problem with this.
d) Your point that the horseshoe may not be the best prior for the regressors.

Proposed solutions:
The only other solutions I can think of are to:

  1. change the prior on tau for more or less shrinkage
  2. not use horseshoe, change all beta priors to gaussian (although this may cause problems with identifiability)
  3. use some other method of sub-selecting models within models 3 and 4.

Many thanks again for your invaluable help.

How about 4? Or did you mean to write 3 and 4?

In model 3 and 4 are you using a joint prior for all covariates? I would recommend using separate prior for covariate group especially for demographic and pathological covariates. This would allow pathological covariates to have strong effect in models 3 and 4, while imaging covariates could get a small effect.

Based on your covariate correlations figures, there are lot of correlations (several really close to -1) and horseshoe is not probably so good. QR transformation for imaging covariates could improve convergence (and then no HS).

How many chains and iteratons are you running? Given good priors and good convergence of MCMC, I don’t usually see decreasing performance when adding more covariates, so it hints that there is something going wrong here. It is also possible (likely) that linear model is not good for imaging covariates. I would test GPs, but unless you are using GP specific software (e.g. GPstuff or GPy), it requires a bit more work.

This is a big problem. Intialising all chains closer to 0 will make rhat overoptimistic (the initial points need to be overdispersed). Try QR transformation for imaging covariates (with no HS).

Thanks Prof Vehtari,

Whilst I implement what you suggested can I double-check a few things please:

The equations to calculate se(elpd_loo) from page 9 of https://arxiv.org/abs/1704.02030:

Shouldn’t se(elpd_loo)=1/n(sqrt(…)) ? Otherwise se will simply get bigger se for larger datasets?

Sorry - yes I meant models 3 and 4.

Models 1 and 2 have a single prior structure for all the covariates ‘background_beta’. The pseudocode is:

background_beta(i)_prior = beta_raw(i) * lamda(i) * tau
beta_raw ~ Normal(0,1) // vector
lamda ~ inv_chi_square(1) // vector
tau ~Normal(0,1) // real; a global parameter

Models 3 and 4 have the same prior for the above ‘background_beta’ covariates (2 of them) but a separate prior structure for the ‘biomarker_beta’ covariates:

background_biom(i)_prior = beta_biom_raw(i) * lamda_tilde(i) * tau
beta_biom_raw ~ Normal(0,1) // vector
lamda(i) ~ Normal(0,1) * sqrt(inv_gamma(0.5, 0.5)) // half Cauchy vector
tau ~ Normal(0,1) * sqrt(inv_gamma(0.5, 0.5)) // real; a global parameter
c_aux = inv_gamma(0.5slab scale, 0.5slab scale) // real; slab scale is pre-specified
lamda_tilde(i) = f(lamda(i), tau, slab scale) // vector; this function from your paper to regularise large values of lamda

Optimising model 3 and 4 slab scale improves model stability (haven’t checked performance yet). I’ve tried replacing the second ‘biom’ group prior with a duplicate of the first, but this hasn’t worked yet. I’m still working on it and haven’t done the QR factorisation yet (see below).

Thanks for this idea. I had a look at the Stan tutorial for this (http://mc-stan.org/users/documentation/case-studies/qr_regression.html); but couldn’t work out how to apply it to the survival model which has two X’s (censored/ uncensored) parameterised by the same betas. Would it be correct to perform QR on each X individually and model two beta_tildes (censored/ uncensored). I had a go at this but got stuck in the transformed parameters block (see below);

// X1 is [m1,n1] covariates for uncensored data;
// X2 is [m2,n2] covariates for censored data;

transformed data {

// Compute, thin, and then scale QR decomposition for each data block separately
matrix[N1, M1] Q1 = qr_Q(X1)[, 1:M1] * N1;
matrix[M1, M1] R1 = qr_R(X1)[1:M1, ] / N1;
matrix[M1, M1] R_inv1 = inverse(R1);

matrix[N2, M2] Q2 = qr_Q(X2)[, 1:M2] * N2;
matrix[M2, M2] R2 = qr_R(X2)[1:M2, ] / N2;
matrix[M2, M2] R_inv2 = inverse(R2);

parameters {
vector[M_biom] beta_biom_raw

transformed parameters {
beta_biom = beta_biom_raw .* regularisation_priors; // priors with structure as above
beta_biom = R_inv1 * beta_tilde1; // this doesn’t look correct?
beta_biom = R_inv2 * beta_tilde2; // this doesn’t look correct?

yobs ~ weibull(…Q1 *beta_tilde1…)
increment_log_prob(weibull_ccdf_log(…Q2 *beta_tilde2…)

Once again - many thanks for your help!

Also I’m using NUTS with 5000 iterations plus 100 warmup.

Remember that elpd_loo was defined as sum of elpd_loo,i’s and thus se of that sum should get bigger.

No Do the QR transformation for X which has covariates for both, and then divide the transformed covariates to two parts.

That seems quite short and may affect the the performance of the adaptation of NUTS algorithm.

Thanks for the comments - I’m still having trouble with QR transformation. Can one do it where number of predictors>> number of observations? I couldn’t work out how to do this (and to rescale back to the original coefficients).

Thanks again

I think it would be better to ask this question again in a new topic with a title referring to QR transformation. Instead of QR, some other transformation might be useful.

Dear Prof Vehtari,

Many thanks for your suggestions and help. I’d hope to acknowledge you in the final manuscript if that’s OK.

I have had a good look at the data now and found that:

  1. As you suggested, gaussian heirarchical priors are much better than horseshoe for this dataset.

  2. Using gaussian priors, all chains converge very well and Rhat is always 1; and number of effective samples is between 10-15000 (total 20000 iterations using 4 chains, after 1000 warm-up) in the betas I am interested in.

  3. Given the above priors, it doesn’t make a huge difference if I transform the data using, for example, PCA.

  4. The predictive k-fold validation is difficult/unstable and I have abandoned it. The dataset is so small (27 deaths in 70 subjects) that there are a number of influential observations that make the model highly variable on repeated runs (and poor at predicting). I think it would be more sensible - for this dataset - to acknowledge this as a significant restriction and wait for a larger dataset to look at this properly. Indeed there may be a non-linear element as you suggested but I think a larger dataset is needed to probe this effectively. Instead I have calculated the marginal log likelihood of the models (or should this be called pseudo-marginal likelihood?) - simply as a goodness of fit measure (appreciating that this will over-fit the data).

  5. I have calculated AUC as you suggested too.

Bottom line is that both imaging and pathology covariates can be used to model survival reasonably well (better than null - age and gender only), but that a larger sample would be needed to properly model this predictively. You can’t really determine any differences between the models of interest with this dataset. I hope you agree that this is reasonable given the limitations of the data?


I hope it’s 10000-15000?

Given the small noisy data, this is likely.

Yes that is small, and survival observations are often weakly informative. On the other hand a number of highly influential observations hint that maybe Weibull is not the sufficient for describing the variation.

Yes. Even if you would change the Weibull and would get the model to be lass variable on repeated runs, it’s likely that you couldn’t infer more about the relevance of the predictors.

If you refer to cross-validation log predictive densities, I would prefer “expected log predictive density” (elpd) or “cross-validation log predictive densities”. Pseudo-marginal likelihood is not commonly used term and might cause confusion.



Actually these are not cross-validated (It was difficult to obtain a consistent cross-validated model). It is the within sample log-likelihood of the model. I generate a model on the full dataset, and then calculate the log-likelihood of the data for each subject. Then sum across subjects. I see this more of a ‘goodness-of-fit term’, if that’s reasonable? I have not tried to be predictive, just to say that the models have reasonable fit.

With this approach you can’t compare models, and you can’t say that e.g. pathology covariates would be helpful for predicting future survival.

When you wrote that you had abandoned k-fold validation, I assumed you switched to PSIS-LOO.

I don’t understand this. You reported very good mixing of the chains. You shouldn’t have problems with cross-validation then.

repeated runs of MCMC or runs with different subset of the data? And of course it’s likely to be poor at predicting, as most of survival models are and you have a very small data set.

I recommend

  • use PSIS-LOO, show khats and pointwise elpd_i’s (this would also help to understand your problem)
  • if many khats>0.7, run k-fold-cv and with default total of 4000 iterations (it should be enough with the 50-75% relatibe efficiency you reported), and show pointwise elpd_i’s
  • you can find code for LOO-AUC in GitHub & BitBucket HTML Preview

My suggestion was to hold off saying ‘x can be used to predict future survival’ but instead say ‘x can be used to model survival, but larger datasets are needed to determine if this can be used predictively’ i.e. to restrain the inference on this small dataset. But if we can’t roughly compare models based on the log likelihood (of the observed data), then this is a problem anyway.

I mean’t with repeated runs of k-fold validation; different folds would have different values for the betas. This is anecdotal, but I assumed this was due to influential observations/ a low number of observations.

Thanks - I have done this with your code below:

As you can see, the elpd_loo of each model is lower than the marginal log likelihood of the observed data (as expected), but much lower for the last two models - so much so that they are indistinguishable from the null model (demographic data, which is nested within all other models). If you agree that the k-hats are reasonable (a few values over 0.7), then does this mean that the improvement in marginal likelihood of the last two models is not generalisable and all due to overfit? Even with the noisey data this would be very surprising. What do you think?

Thanks - I’ve used the AUC count estimate (taking censoring and time into account) from the ref you suggested. It is not a cross-validated AUC however. I was unable to access your link for some reason (error " we can’t show files that are this big right now") but I’m not making any inferences with the AUC so less concerned about it at the moment.

Just to add that combined models (pathology + imaging markers) perform somewhere in the middle, so is there an inherent ‘model complexity’ penalty that the elpd_loo is adding (out of proportion to the effect of increased prior volume for a greater number of covariates)?

Yes, that’s completely natural.

Excellent figures to clarify what’s going on.

Yes, that’s why you need to use cross-validation!

With high khats the PSIS-LOO estimates tend to be biased upwards, and thus in this case even with some khats>0.7 we can be quite confident that models 3 and 4 are not better than the null model. Also it’s interesting that null model has all khats<0.5 so it seems there is no outliers in y values. Model 2 has one khat>0.7, but models 3 and 4 have several khats>0.7, so it seems that covariates have some outlier values making those observations highly influential although it might be explained also just by the increased dimensionality of covariate space.

You are using the term marginal likelihood in a confusing way. If likelihood is p(y|x,theta,M), then marginal likelihood is p(y|x,M). I assume you computed log posterior predictive density sum_{n=1}^N log p(y_n|x_n,D) and log loo predictive density sum_{n=1}^N log p(y_n|x_n,D_{-n}).

Improvement in log posterior predictive density of the last two models is not generalisable and is all due to overfit. This is possible.

Let’s try with a different service

No. I think your imaging covariates have very strange distribution which is messing up your model.