Survival models in rstanarm

I have one problem with this approach: How can you guarantee in the Royston & Parmar proposed Time-varying hazard ratios framework, see graphic below, that the derivative w.r.t. to x of the log cumulative hazard ratio, which you need for the likelihood, is non-negative?

and

Here we need to replace s in the second figure by the term \gamma_0 + \boldsymbol{\gamma}^T \mathbf{v}(x) in the first graphic. For instance if I choose a spline basis \mathbf{v}(x) that is guaranteed to be monotone in x (e.g. I-splines or integrated B-splines), then we need to ensure that all the elements of \boldsymbol{\gamma} are non-negative. Now importantly, this vector is defined in the first figure in terms of the covariate vector \mathbf{z}, which in general is real and not constraint to be non-negative. Additionally since covariates depend on the subject, we actually have that \boldsymbol{\gamma} is subject specific! Thus we need to pick the matrix \{\gamma_{jl}\} so that all the subject-specific vectors \boldsymbol{\gamma} are non-negative. Depending on the heterogeneity of covariates, this seems like a hopeless task…

I think this is the reasonable thing to do.

I was also thinking maybe a cubic growth in the hazard after the last knot might make sense physiologically as opposed to a linear growth outside the boundary depending on the problem. If we get a new patient whose survival time is after than the last knot, then that patient has survived an amount of time that is unprecedented. Thus, for that patient to be consistent with the rest of the patients we would expect their hazard to be growing rapidly. I’m not sure if that makes sense?

Also, how is the Stan model you wrote doing? I like the idea evaluation of the I-spline basis functions outside of Stan since the survival/censoring times are fixed. That should save some computation. I was thinking in some cases it might be useful to do it from within Stan though. One case I was thinking is if we have a censored patient who we observed up to time t_i, we may want to model and infer their actual time of death as an unknown parameter in Stan instead of treating them as censored e.g.

real<lower=t_i> D_i;

The Bayesian framework and Stan easily allows us to do this which is nice, but it would require evaluation of the I-Splines from within the Stan code. Another example I can think of where we’d want to do that is if we want to sample possible times the patient could have died in the generated quantities block. I think that would be pretty useful for model diagnostics.

@ermeel Yeah, fair point, I hadn’t really thought about the positivity constraints, or the general issues for the approach in a Bayesian setting, until this discussion began.

I noticed in your Stan program you were putting the I-splines on the log cumulative hazard rather than just the cumulative hazard, and you mentioned here that that is your preference – I was just wondering why?

Did you get your I-spline Stan file working, i.e. did the examples in your test.R file recover the correct hazard ratios?

A couple of weeks ago I started on a package just like @jonah had suggested, only missing the ‘r’ in the name! See here. I haven’t got too far on it, and the further I get, the more functions I end up porting across from rstanarm, and the more I begin to think these modelling functions could just go in there so as to not create unnecessary work with priors, predictions, etc. In any case, I had tried a model using the R+P spline basis – didn’t have it running, and then you started this convo, so I tried using the I-spline basis instead, as you have done – but I haven’t got the Stan model working yet (at least not the splines model – a Weibull model seems to work ok). Anyway, take a look if you’re interested.

1 Like

Off the top of my head, this sounds not unreasonable, however why should the cubic growth be favoured here? The nice thing with the I-Splines is that we can pick the order, and hence the boundary asymptotics.

I haven’t done very extensive tests, but here come my preliminary results (I would really appreciate if you could verify or comment here):

  • For the breast cancer dataset (from R&P) I could recover the estimated treatment-effect from the literature, i.e. \beta_1,\beta_2 in the R&P’s paper.

  • For the pediatric cancer dataset, I could recover the \beta vector corresponding to the covariates, as compared to what Hastie & Efron list in their book.

  • For the leuk dataset version I found here I could recover the authors values of the treatment effect coefficient, which he obtained from coxph with different options to resolve ties.

All the intervals in the figures contain 90\% probability mass of the posterior for that parameter.

I presume the above would imply that I can recover the hazard ratios, if I am not mistaken (since the ratios are \exp(\beta), right?)

I agree. In this regard, it would be super helpful if we could convince ourselves that the recursive expression in the Ramsay paper, allowing to write the I-splines in terms of the M-splines without integrating, is applicable.

This is an interesting example application, I haven’t considered, yet.

Yes that is a very useful diagnostic, I should consider.

Maybe I was a bit vague here. What I meant is that in a Bayesian setting I prefer a spline basis that enforces explicitly monotonicity rather than one that does it “implicitly” by conditioning on the data. I was not referring to any preference of using the i-splines for log cumulative hazard modelling over using them to model the cumulative hazard modelling. But when I think about it, I see that they are fundamentally different, right? In the former the splines are polynomial in \log(t), so the cumulative hazard is an exponential of polynomials in \log(t), whereas in the latter case the cumulative hazard is polynomial in \log(t). Comparing with Weibull case, for instance, this latter approach leads to a much weaker decay of the survival function in \log(t). So I guess it really depends on our physiological model, right?

In the meantime I also added a version that supports random-walk priors for the spline-basis coefficients (\gamma in the Royston&Parmar) terminology. Since the spline-basis (and its derivative) I currently calculate outside of Stan, we can now experiment with a variety of different splines and see whether the random-walk prior would allow us to work with much larger number of knots (see Splines In Stan). For instance, if we use integrated B-splines, we have that a random-walk prior would favour a log cumulative hazard that is linearly increasing (since the prior favours flat derivatives, which are the B-splines).

One nice thing about the case where we use a random-walk prior (constrained to be non-negative) and integrated B-splines, \hat{B}_i(x), is that (since \sum_{i} B_i(x) \equiv 1) we have that \sum_{i} \hat{B}_i(x) = x, thus when all weights are equal we can obtain (with an appropriate intercept term) the Weibull distribution. My understanding of the random-walk prior is that it is consistent with prior knowledge stating that the different spline-basis-coefficients change slightly with a tendency for neighbouring coefficients being equal (in average). Thus this would be a nice way to start from Weibull and explore continuous deviations from it, while allowing for much more flexibility due to many more knots being possible.

Hi all,
Just to re-iterate that:

  1. I’m absolutely happy to collaborate with people who are willing to contribute to survHE. My main objective there is to aid in the task of performing survival analysis to then feed a wider health economic model, but the survival analysis can also be used as a standalone part…
  2. I did implement (a few versions of) Royston-Parmar models in the current version of survHE (BTW, in addition to the version on CRAN, see the development one at https://github.com/giabaio/survHE). Not to say that more efficient clever implementations can’t be made and I’d be happy to discuss!

HTH
Gianluca

I’ve thought a lot about this hazard extrapolation problem, and I think it’s worth treading very carefully. The usual “predictions” that one gets from a frequentist Cox model (using an estimator of the baseline cumulative hazard) are not survival times but \hat{S}(t) for various t. The traditional approach to extrapolation is that \hat{S}(t) past the last event time is said to be constant (i.e., hazard is exactly zero). This assumption is both (1) extremely stupid and (2) obvious to the end user. Any other extrapolation method is certainly going to win out on point (1) but lose on point (2). At least for the default behaviors, I favor stupidity in an obvious way over cleverness in an opaque way. For example, I can think of applications where the hazard would not increase rapidly beyond the last observation time (e.g., time to surgery-related infection, where very few infections occur after the end of study), making cubic extrapolation a poor choice.

With respect to treating censored event times as parameters, I see it like any other missing data problem (except maybe more difficult because we’re only observing part of the event time distribution and we may not want to make strong assumptions about the rest of its shape). Putting the unknown event times in the generated quantities block reduces the dimension of the parameter space explored by the sampler, right? Why have more parameters than you need if imputation in the generated quantities block works just as well? You’ll need to assume the form of the hazard function for all t>0 either way. Am I missing something?

Yeah it’s almost like you need a model appropriate to the situation… :) +1 to not sampling event times with HMC, in some ecological models the distribution is so strongly seasonal it periodically hits zero.

Yeah, for this reason I do think adding all this functionality to rstanarm is still a good option, although I’m not opposed to a separate package.

Yes, that’s a good point. I guess as @ermeel pointed out we can play with the boundary extrapolation by changing the order of the I-Splines. Also for that type of problem where a certain number of patients will never experience the event of interest within the considered time frame a lot of people have considered Cure Models.

Yes, you’re right. It would be better to just compute the death times in the generated quantities block. I’m a little rusty on survival analysis, but I guess the whole point of Cox models is that you can deal with censored data without having to make assumptions about the distribution of when the patient “would have” died. I was just making the point that there are reasons to do the spline computation in Stan instead of precomputing outside of Stan.

1 Like

Just want to note that my former students published papers in Statistics in Medicine developing restricted cubic splines in the hazard space (and not log hazard) along with step function time-dependent covariates, getting the analytic form of the cumulative hazard with these time-dependent covariates. Their models worked well, with interruption of MLE iterations if a hazard point was starting to go negative, and using step-halving at that point. Royston did not reference these papers in theirs.

Thanks for mentioning. Could you list the actual references?

An alternative for step-function-time-dependent covariates would be the extended Cox Model (Poisson model equivalent to a piecewise constant proportional hazard Cox model). The advantage I see here is that it does not have the problem with negative likelihoods. The disadvantage is the somewhat more involved data preparation. Of course piecewise constant hazards could be made more flexible with increasing knots and putting a random walk prior over the hazard values. Of course this looses the continuity to some extend given when using (restricted cubic) splines, but I would question whether this is actually such a big problem, when working on hazard scale…

Here are two papers I saved related to time-dependent covariates - see especially the Herndon paper: http://data.vanderbilt.edu/fh/Papers/survival/CoxTDC

I think the problem with piecewise flat hazards is the extreme sensitivity to choice of cutpoints in time, which if properly reflected in the posterior makes the posterior less sharp. You might also look at adaptive linear spline hazard models: http://data.vanderbilt.edu/fh/Papers/survival/koo95haz.pdf

1 Like

Just an update to anyone on this thread and interested… I’ve just opened a pull request to merge a stan_surv modelling function into rstanarm. The development branch I’m keen to merge in is located here.

Quoting from the stan_surv help file:

“Currently, the command fits standard parametric (exponential, Weibull and Gompertz) and flexible parametric (cubic spline-based) survival models on the hazard scale, with covariates included under assumptions of either proportional or non-proportional hazards. Where relevant, non-proportional hazards are modelled using a flexible cubic spline-based function for the time-dependent effect (i.e. the time-dependent hazard ratio).”

Left censored, right censored, and interval censored data are also allowed, as well as delayed entry (i.e. left truncation).

The defaults are basically…

  • cubic M-splines on the baseline hazard with independent normal prior for each spline coefficient, and
  • when the user wants to estimate non-proportional hazards, the time-dependent hazard ratio is modelled using cubic B-splines with a random walk prior for the spline coefficients

There are a bunch of post-estimation tools working, e.g. log_lik, loo, etc, and a posterior_survfit method for predicting the survival function / hazard function / cumulative hazard etc. I’ve also added a vignette to give some background to the methods / usage.

It would be cool if anyone was keen to test it out – before or after the pull request is merged – but ideally before release. I’ve tested on a few different example and simulated datasets, but the more testing the better (for either stan_surv or the post-estimation / prediction functions).

So please do your best to break it, and be sure to let me know if you do!

Also open to suggestions for improvements.

6 Likes

This is awesome news! Thanks @sambrilleman. I will test it on some of my datasets and report.

@sambrilleman,

you say in the help for stan_surv under prior_intercept:

Note however that default scale for prior_intercept is 20 for stan_surv models (rather than 10,
which is the default scale used for prior_intercept by most rstanarm modelling functions).

Could you explain why you decide to pick a wider prior as default for \log(\lambda) here?

Another question I have is related to the following statement under basehaz_ops:

An intercept is included in the spline basis and included in the count of the degrees of freedom, such that two boundary knots and df - 4 internal knots are used to generate the cubic spline basis. The default is df = 6 that is, two boundary knots and two internal knots.

As far as I can see it you use the M-splines implemented in splines2, right? Are you referring to the intercept=TRUE option there? I am unsure what this precisely does (or if it actually does anything at all). Maybe can you clarify here what you mean by “an intercept is included in the spline basis”?

An addendum to the above: For the case of M-splines to model the baseline hazard, I have been using the following simplified version of it which uses intercept=FALSE in the splines2 package:

data {
    int<lower=0> N_uncensored;                                   
    int<lower=0> N_censored;                                        
    int<lower=1> m;                                                 
    int<lower=1> NC;                                                
    matrix[N_censored,NC] X_censored;                               
    matrix[N_uncensored,NC] X_uncensored;                                                
    matrix[N_uncensored,m] m_spline_basis_evals_uncensored;                  
    matrix[N_uncensored,m] i_spline_basis_evals_uncensored;   
    matrix[N_censored,m] i_spline_basis_evals_censored;
}
parameters {
    simplex[m] gammas;       
    vector[NC] betas;                                            
    real intercept;   
}
model {
    betas ~ normal(0,1);
    intercept   ~ normal(0,10);
    target += -(i_spline_basis_evals_censored*gammas) .* exp(X_censored*betas + intercept);
    target += -(i_spline_basis_evals_uncensored*gammas) .* exp(X_uncensored*betas + intercept);
    target +=  log(m_spline_basis_evals_uncensored*gammas) + X_uncensored*betas + intercept;
    
}

This one here uses a uniform prior over the simplex, which we could of course change to Dirichlet priors, if required. An advantage of this is that we can also use the prior_intercept option of stan_surv here as well, as for the cases when basehaz is equal to exp, weibull or gempertz.

Cheers @ermeel. Yeah, you are correct, I was using intercept = TRUE in the splines2 basis and that is what I was referring to in the basehaz_ops help documentation. I can’t remember why I opted for including the intercept in the basis. But I agree, it would probably be cleaner to use intercept = FALSE and then explicitly include the intercept as a separate parameter with it’s own prior. As you mention, that way, prior_intercept will be relevant for all types of baseline hazard. I’ll push those changes when I get a spare minute.

I chose a default scale of 20 for prior_intercept since I’ve seen examples where the log baseline hazard rate can be smaller than -10. It was a somewhat arbitrary choice, but I didn’t think -10 was large enough. Of course, the appropriate prior totally depends on the dataset, and whether covariates have been centred or not, etc. Maybe there is a way to autoscale prior_intercept based on the data, but I wasn’t sure how.

At the moment the covariates aren’t being centred in estimation using stan_surv. But they probably could be. I just wasn’t sure whether centring might cause some undesirable behaviour with the spline basis in the time-dependent effects. Centring the covariates would probably be fine, even for the time-dependent effects(?), but I was just playing it safe for the moment until had looked into it further.

I am used to the idea of time scale invariance so I want any procedure I use for time-to-event analyses to give substantively similar answers if my time scale is in days or years. That means I want some kind of data rescaling and/or prior specification to take care of time rescaling up to a multiplicative factor.

For an application I am working on right now, I mean center all covariates and rescale the continuous covariates to have a standard deviation of 1. It is in the context of Weibull models, but you could adapt the reasoning:

With mean-centered covariates and Weibull shape parameter \alpha = 1, the baseline hazard \kappa corresponds to the hazard experienced by those at the sample mean covariate values. Thus, reasonable priors for the log-baseline hazards are \mathcal{N}(\log(E/PT), \left(\log(100)/2 \right)^2), where E is the number of observed events and PT is the total at-risk person-time. For exponential hazards, this prior asserts that the true hazard experienced at the sample mean covariate value has only \approx 5% probability of being more than two orders of magnitude away from the crude (pooled) event rate. The data-driven prior specification for the log-baseline hazards makes the model invariant to the time scale of the data (i.e., days vs. years). An alternative approach would be to rescale all times so that the mean event times were \approx 1.

Thanks. Just wanted to make sure: The priors for the spline coefficients in the baseline hazard will be uniform on the simplex, as I mentioned above?

I checked to some of the analyses I did in the past, and this seems to be a good default, I agree.