Not sure if others have experience with this, but I’ve been using the time-dependent Brier score, specifically a variation of it using a log loss. It’s summarized well in Graf et al, 1999. Has the limitation that it uses a point estimate of survival probability at a particular survival time, but perhaps you could summarize this over the posterior distribution to get a sense of calibration.
In this formulation, the censored survival times contribute to the score directly up until the time they are censored, and contribute indirectly after that by impacting the weights assigned to observed & non-observed events. It’s not ideal but unless you model the censoring distribution directly, it seems like the best one can do.
One concern I have when using predicted survival times is that the posterior distribution of this predicted value can be very wide (often too wide to be clinically useful!), so this measure may not be sensitive to problems with model fit.