Survival models in rstanarm


#82

Ok, great. Yeah, the last point from @lcomm would explain the behaviour I was seeing. It was increasing the dimension of the simplex that seemed to enforce the smoothing, rather than changing the concentration parameter. I had \alpha_i = 1 for all i=1,...k in my example.

I’m not sure whether it’s desirable. Maybe it is? But it definitely wasn’t intuitively expected, at least not my me. I expected that increasing the degrees of freedom for the spline basis would increase flexibility, but that prior distribution had other ideas.


#83

Hello!

I am not used with survival models, but I found that estimating the \alpha vector as hierarchical parameters might solve this dimensionality problem while providing an interesting amount of shrinkage. But I’m affraid it would increase the sagpling time and model complexity!


#84

Do you have example code?


#85

Sorry, I was not clear. The idea is simply to estimate the \alpha vector instead of fixing it.


#86

@sambrilleman, I am creeping on your rstanarm pull request again (sorry!) and I can’t tell if your regression models ever use the QR decomposition of the design matrix. Have your explored that at all and/or decided it was not necessary?

(Also, happy holidays! Please take your time with any replies.)


#87

@lcomm No worries, Happy New Year!

Short answer is no, I have not explored it, nor ever used it in my own research. Hence why I hadn’t implemented it in any of the models (stan_jm, stan_mvmer, stan_surv). Basically, I’m just not familiar with it, and didn’t feel comfortable implementing it until I was.

But assuming there aren’t any issues with using QR decomposition for survival models, then I’m perfectly happy for someone to add it. (Or I can try look at doing it when I get time).


#88

I have been using an M-spline based model with QR decomposition on the design matrix for various Pharma applications (with but not only genetic data). It significantly improved the statistical efficiency. I’d vote for this to be added at some point, but I appreciate it’s not the highest priority.


#89

I have not used the QR decomposition in my own code before, but I will be adding it over the next few days. So far as I can tell, there is no particular reason it would not work for survival regression models. It seems like predictors in time-to-event models will suffer from the same problematic correlation as in other models. My main concern is how it interacts with prior specification. Based on this page, you can put the priors directly on the original coefficients \beta instead of translating to priors for the QR version of the coefficients, \tilde\beta (NB: the notation for the design matrix X on that page is the transpose of the rstanarm definition of X). You will get ignorable warnings about a Jacobian, though. I imagine spitting out warnings like that would be a no-go for rstanarm so I am not sure how that would work.

Digging through the rest of rstanarm’s approach, it looks like @bgoodri did all of the \beta = (R^*)^{-1}\tilde\beta in R post-processing by extracting anything named beta and premultiplying by (R^*)^{-1}. If you do it this way, you will need to put priors on \tilde{\beta} directly. There will be no transformed parameter \beta within Stan, so obviously you cannot specify the prior in terms of \beta.

I am not an expert on this by any means. Please correct me if I am wrong!


#90

I would like to follow up on this: In the paper you reference, the authors state

An essential point is that the ESS is defined as a property of a prior and likelihood pair, so that, for example, a given prior might have two different ESS values in the context of two different likelihoods.

The ESS you mentioned assumes a multinomial likelihood, which is not the case in our particular survival model. Here the prior parameters, that are given Dirichlet prior, influence the likelihood through the hazard function as

h(t) = e^{\beta_0} \sum_{k=1}^m \gamma_k M_k(t)

(similar for the cumulative hazard function, see above) and \beta_0 is given a Normal prior. Thus our prior is a product of 2 parametric distributions, hence falling into case 3 of the paper and an ESS vector for each of the 2 distributions is applicable.

Therefore I am not sure whether ESS=\sum_{k}\alpha_k is still applicable here, however it appears to explain qualitatively what Sam was observing.

Out of curiosity, has someone studied cases where one would pick \alpha_1=\alpha_2=\cdots=\alpha_k = a_k with a_k an decreasing sequence in k that has the property that 1/a_k = o(k) as k\rightarrow \infty. In other words, a_k goes to zero as k\rightarrow \infty but not as fast as 1/k. Maybe something like a_k = C/\log(k)? I admit this will probably only be interesting in asymptotic regimes, but nevertheless might lead to some insights…


#91

I agree that you should not directly think of this as a clean cut prior sample size for the weights the way you could with a multinomial likelihood. Actually, I’m not sure I have ever found myself referencing this paper in a context where I can directly use any of the results like that – my likelihoods never seem to be that standard. I do think you can take some of the intuition about how to think about the \mathrm{Dirichlet}(a_1, a_2) and \mathrm{Dirichlet}(b_1, b_2, b_3) distributions in general and the strength of information conveyed by \sum_i a_i and \sum_k b_k in general and port it to contexts other than the multinomial likelihood. (Note: I could totally be wrong about this and it just hasn’t backfired in an obvious way yet.)

If we wanted to do this more rigorously, how could we go about it? The point is that for increasing k, we want the resulting function to look “more wiggly” but not “too much more wiggly.” We’ve decided that getting smoother with greater k is decidedly not “more wiggly” enough because it’s actually less wiggly. Fair enough. But are we really trying to have identical effective sample sizes on the weights for all possible specifications? (Actually, it would be really cool to know how to do that, but not worth holding up the rest of the functionality @sambrilleman has added, in part because I am worried that “not too much more wiggly” is too much of a moving target. I think your decreasing sequence idea has merit to it, though.)


#93

\mbox{Dirichlet}(1) is uniform, so when we have \mbox{Dirichlet}(\alpha), it’s equivalent to starting with a uniform and making \sum_{n=1}^N \alpha_n - 1 observations.


#94

Maybe this is a notational difference, but I think of the relationships as

\mathrm{Uniform}(0,1) = \mathrm{Beta}(1,1) = \mathrm{Dirichlet}(1,1)

If the \mathrm{Beta}(1,1) is placed on an unknown proportion p, that information is equivalent to seeing 2 prior observations (1 “yes” and 1 “no”). A \mathrm{Dirichlet}(1,1) is just a prior on (p, 1-p), no? Am I remembering incorrectly?


#95

Same notation. The kernel of the density is

\mbox{beta}(\theta \mid \alpha, \beta) \ \propto \ \theta^{\alpha - 1} \times (1 - \theta)^{\beta - 1}.

It seems natural to start with the uniform distribution as representing no prior information. I think of the prior count as reflected in the exponents, \alpha - 1 and \beta - 1.

If you look at the binomial (a conjugate likelihood), it takes the same form, with kernel

\mbox{binomial}(y \mid N, \theta) \propto \theta^y \times (1 - \theta)^{N - y}.

That is, it’s \theta to the power of the number of successes and 1 - \theta to the power of the number of failures, just like in the prior. Then you just add them up to get the posterior,

p(\theta \mid y, N, \alpha, \beta) \ \propto \ \mbox{beta}(\alpha + y, \beta + N - y).

Where could you start such that one prior success and one prior failure leads to a posterior of \mbox{beta}(1,1)? Both \alpha and \beta must be strictly greater than 0 for the beta density to properly integrate to one.


#96

Just to go back to this. I’ve been working on survival models with group-specific parameters (i.e. frailty models) on this branch. For the M-spline model, using constrained coefficients for the M-splines is the only way to get a reliable estimate of SD of the group-specific intercepts. Otherwise, with a small number of clusters, the SD of the group-specific intercepts is massively biased (overestimated). So, I think constrained M-spline coefficients defined as a simplex seems the way to go.

However I am still unsure on what the most appropriate default argument for prior_aux is. As I see it, there are two possible approaches:

  1. Have the default for prior_aux set to missing, and describe the default behaviour in the documentation.

  2. Have the default prior_aux = dirichlet(1) and if the user chooses any survival model apart from the default M-spline model (e.g. they specify basehaz = "weibull") then they will encounter an error saying that the Dirichlet prior is not valid and they will have to manually specify a different value for prior_aux.

Option 2 is more explicit, but much more annoying for the user. I think I prefer Option 1.

I suggest a straw poll. @bgoodri @jonah @lcomm @ermeel and whoever else wants to place a vote on this.


#97

Hello stan_surv developpers! Thank you for your work on bringing more survival models to rstanarm!

I am conducting a review of readily-available functions in R for Bayesian proportional hazards survival models (for application to clinical trials analysis). I definitely would like to mention Stan and the extension to rstanarm you are working on.
I am not familiar with the process for the release of updates/new packages on CRAN. But has an approximate timeline been set for the release of stan_surv?

Many thanks!


#98

We are having some issues with the time that it takes to compile rstanarm on Windows that is making it difficult to incorporate new Stan programs.


#99

Hi Sam, sorry for the delay and thanks for sharing this observation.

I would also prefer option 1 or alternatively devote a seperate function to it?