My suggestion was to hold off saying ‘x can be used to predict future survival’ but instead say ‘x can be used to model survival, but larger datasets are needed to determine if this can be used predictively’ i.e. to restrain the inference on this small dataset. But if we can’t roughly compare models based on the log likelihood (of the observed data), then this is a problem anyway.
I mean’t with repeated runs of k-fold validation; different folds would have different values for the betas. This is anecdotal, but I assumed this was due to influential observations/ a low number of observations.
Thanks - I have done this with your code below:
As you can see, the elpd_loo of each model is lower than the marginal log likelihood of the observed data (as expected), but much lower for the last two models - so much so that they are indistinguishable from the null model (demographic data, which is nested within all other models). If you agree that the k-hats are reasonable (a few values over 0.7), then does this mean that the improvement in marginal likelihood of the last two models is not generalisable and all due to overfit? Even with the noisey data this would be very surprising. What do you think?
Thanks - I’ve used the AUC count estimate (taking censoring and time into account) from the ref you suggested. It is not a cross-validated AUC however. I was unable to access your link for some reason (error " we can’t show files that are this big right now") but I’m not making any inferences with the AUC so less concerned about it at the moment.