Median survival probabilities from stan_surv fit

I have fit a Cox model (with a cluster-specific random intercept) using stan_surv. Is there an easy way to predict the median survival time for each observation? Note that this is not what is returned by posterior_survfit (at least not by default), which returns the median of the posterior of the survival probabilities for a number of time points. The “median survival time” is the (first) time that this probability drops below 50%.

Ideally I would be able to get the posterior for the median survival time, for each observation. Is there an easy way to do this?

As an aside, I came across a potential bug. When I fit the model I included a random effect (1|id), where id is a factor. When I call posterior_survfit, I get the following warning:

In model.frame.default(TermsF, data, xlev = xlevs, drop.unused.levels = drop.unused.levels,  ... :
  variable 'id' is not a factor

And then in the return object, the id column just has the numbers 1 through the number of levels, as opposed to the actual factor labels. I assume these correspond to as.numeric(id), but it would be more convenient if the function supported factors without the warning.

Thanks in advance.

1 Like

Hi, sorry for not getting to you earlier, your question is relevant and well written.
tagging @jonah, as I am not very familiar with stan_surv. My guess is that this is not directly possible, because you need the baseline hazard to calculate the survival probabilities at a time point. However, the model can only estiamate the baseline hazard over the range observed in the data and there is no guarantee that the median survival time is within this range.

So my guess is that your best bet is just to predict with control(epoints = something_large) and then search for the median there, but this will not give you actual samples for the median, so it is unclear how to combine the uncertainties… Maybe @jonah knows if there is a way to get the actual samples of the trajectories?

Best of luck with your model!

I’m probably only slightly more familiar with stan_surv than Martin is so I’m not entirely sure either, but I think @sambrilleman or @ermeel would know.

Hi all - sorry for not replying earlier.

So I think the general intuition discussed here is right. Essentially posterior_survfit() doesn’t have an option to return the median survival time. And as @martinmodrak mentioned, there is actually no guarantee that the median survival time exists in the time range of the observed data.

The slightly hacky approach would be similar to what @martinmodrak suggested – essentially predict for lots of values of t and then search for where the survival probability is 0.5. Alternatively you could also write a little function to iterate (for e.g. a forloop or something smarter) to plug in values of t and check if the survival probability is 0.5 and then keep going until you get 0.5 within some tolerance.

I believe that:

  • the median estimate of the median survival time will be the value of t when the median estimate of the survival probability is equal to 0.5
  • the upper (lower) bound of the X% posterior uncertainty interval will be the value of t when the upper (lower) bound of the X% posterior uncertainty interval for the survival probability is equal to 0.5

Some (or all) of these three quantities (i.e. median, lower bound, upper bound) for the median survival time might be undefined for any one dataset.

The less hacky, correct way to do this would be to explicitly solve for t. This can be done in closed form for some simpler baseline hazards and as long as the linear predictor doesn’t have time-varying covariates or coefficients. However, for some more complex baseline hazards or when there is time-varying effects, you often can’t invert the survival function and solve for t and so I think you would have to use a numerical root finding algorithm. I’m not sure whether the default baseline hazard for stan_surv (i.e. M-splines on the baseline hazard / I-splines on the cumulative baseline hazard) is the former or the latter - it would depend on being able to invert the basis functions for the I-splines and being able to solve for t I guess. So I’m not sure how easy/hard this extra feature would be to implement. But short answer here is that it isn’t implemented at the moment, and so the aforementioned hackier version is the only option atm really…

Disclaimer: I’ve not worked with the median survival time before really, so my understanding of the process mentioned above could be a bit off.

Hope that helps!

1 Like

Yeah this sounds quite buggy, sorry!! The handling of the data types for the grouping factors is pretty poor I think, and probably needs some tidying up.

The fact that the factor labels aren’t even being retained is a bit scary, and hopefully it is actually using the correct group-specific coefficients for the predictions.

This is something that should get fixed (although I really don’t know when). I know similar-ish things have been discussed elsewhere, but I don’t know if we have an open Github issue on it. If you have a simple reproducible example you could submit as a Github issue, that would be super handy! That way, at least we have it noted down and won’t forget about it, so that hopefully once someone has the time we can look at fixing it…

Sorry to send another message, but I just was using posterior_survfit (it had been a while!) and now I see what you mean about the id column in the data frame returned by posterior_survfit. Sorry yeah that is a bit confusing. The id column is actually an identifier linking to the row number of the prediction data. So the first row of the covariate data used in the predictions is considered id=1, the second row of the covariate data used in the predictions is considered id=2, etc. The reason is that you can have multiple rows returned (e.g. extrapolating predictions at different times) for a single input row of covariate data. I believe this is true regardless of whether you are just predicting for the estimate/training data, or a newdata dataset.

But I now see this would get confusing when one of your covariates or variables was called id. The safer thing would have been for me to return the entire covariate data with the predictions. :-(

I’ve opened a Github issue, so that hopefully we can get this fixed up one day!

1 Like

I think the warnings mentioned here might not necessarily be related to the use of a variable called id. As far as I can tell, this warning is produced from posterior_survfit for any model containing a categorical predictor and a random intercept. For example, running:

data <- data.frame(
   trt = rep(c("A", "B"), each = 15),
   site = rep(5:10, each = 5),
   status = sample(0:1, 30, replace = TRUE),
   days = sample(10, 30, replace = TRUE))

mod = stan_surv(
   formula = Surv(days, status) ~ trt + (1|site) ,
   data = data,
   basehaz = "weibull",
   chains = 1,
   iter = 500,
   cores = 1)

posterior_survfit(
   mod, 
   type = "surv",
   newdata = data[1:10,],
   times = 1,
   extrapolate = FALSE)

Returns the warning, Warning messages: 1: In model.frame.default(Terms, data, xlev = xlevs, drop.unused.levels = drop.unused.levels, : variable 'trt' is not a factor

Alternatively, removing the random intercept this goes away:

data <- data.frame(
   trt = rep(c("A", "B"), each = 15),
   site = rep(5:10, each = 5),
   status = sample(0:1, 30, replace = TRUE),
   days = sample(10, 30, replace = TRUE))

mod = stan_surv(
   formula = Surv(days, status) ~ trt ,
   data = data,
   basehaz = "weibull",
   chains = 1,
   iter = 500,
   cores = 1)

posterior_survfit(
   mod, 
   type = "surv",
   newdata = data[1:10,],
   times = 1,
   extrapolate = FALSE)

If both trt and site are set up as factors (which they probably should be?), then it spits a warning for both variables. @sambrilleman do you only receive warnings when using a variable called id?

1 Like