Ok, great. Yeah, the last point from @lcomm would explain the behaviour I was seeing. It was increasing the dimension of the simplex that seemed to enforce the smoothing, rather than changing the concentration parameter. I had \alpha_i = 1 for all i=1,...k in my example.
I’m not sure whether it’s desirable. Maybe it is? But it definitely wasn’t intuitively expected, at least not my me. I expected that increasing the degrees of freedom for the spline basis would increase flexibility, but that prior distribution had other ideas.
I am not used with survival models, but I found that estimating the \alpha vector as hierarchical parameters might solve this dimensionality problem while providing an interesting amount of shrinkage. But I’m affraid it would increase the sagpling time and model complexity!
@sambrilleman, I am creeping on your rstanarm pull request again (sorry!) and I can’t tell if your regression models ever use the QR decomposition of the design matrix. Have your explored that at all and/or decided it was not necessary?
(Also, happy holidays! Please take your time with any replies.)
Short answer is no, I have not explored it, nor ever used it in my own research. Hence why I hadn’t implemented it in any of the models (stan_jm, stan_mvmer, stan_surv). Basically, I’m just not familiar with it, and didn’t feel comfortable implementing it until I was.
But assuming there aren’t any issues with using QR decomposition for survival models, then I’m perfectly happy for someone to add it. (Or I can try look at doing it when I get time).
I have been using an M-spline based model with QR decomposition on the design matrix for various Pharma applications (with but not only genetic data). It significantly improved the statistical efficiency. I’d vote for this to be added at some point, but I appreciate it’s not the highest priority.
I have not used the QR decomposition in my own code before, but I will be adding it over the next few days. So far as I can tell, there is no particular reason it would not work for survival regression models. It seems like predictors in time-to-event models will suffer from the same problematic correlation as in other models. My main concern is how it interacts with prior specification. Based on this page, you can put the priors directly on the original coefficients \beta instead of translating to priors for the QR version of the coefficients, \tilde\beta (NB: the notation for the design matrix X on that page is the transpose of the rstanarm definition of X). You will get ignorable warnings about a Jacobian, though. I imagine spitting out warnings like that would be a no-go for rstanarm so I am not sure how that would work.
Digging through the rest of rstanarm’s approach, it looks like @bgoodri did all of the \beta = (R^*)^{-1}\tilde\beta in R post-processing by extracting anything named beta and premultiplying by (R^*)^{-1}. If you do it this way, you will need to put priors on \tilde{\beta} directly. There will be no transformed parameter \beta within Stan, so obviously you cannot specify the prior in terms of \beta.
I am not an expert on this by any means. Please correct me if I am wrong!
I would like to follow up on this: In the paper you reference, the authors state
An essential point is that the ESS is defined as a property of a prior and likelihood pair, so that, for example, a given prior might have two different ESS values in the context of two different likelihoods.
The ESS you mentioned assumes a multinomial likelihood, which is not the case in our particular survival model. Here the prior parameters, that are given Dirichlet prior, influence the likelihood through the hazard function as
h(t) = e^{\beta_0} \sum_{k=1}^m \gamma_k M_k(t)
(similar for the cumulative hazard function, see above) and \beta_0 is given a Normal prior. Thus our prior is a product of 2 parametric distributions, hence falling into case 3 of the paper and an ESS vector for each of the 2 distributions is applicable.
Therefore I am not sure whether ESS=\sum_{k}\alpha_k is still applicable here, however it appears to explain qualitatively what Sam was observing.
Out of curiosity, has someone studied cases where one would pick \alpha_1=\alpha_2=\cdots=\alpha_k = a_k with a_k an decreasing sequence in k that has the property that 1/a_k = o(k) as k\rightarrow \infty. In other words, a_k goes to zero as k\rightarrow \infty but not as fast as 1/k. Maybe something like a_k = C/\log(k)? I admit this will probably only be interesting in asymptotic regimes, but nevertheless might lead to some insights…
I agree that you should not directly think of this as a clean cut prior sample size for the weights the way you could with a multinomial likelihood. Actually, I’m not sure I have ever found myself referencing this paper in a context where I can directly use any of the results like that – my likelihoods never seem to be that standard. I do think you can take some of the intuition about how to think about the \mathrm{Dirichlet}(a_1, a_2) and \mathrm{Dirichlet}(b_1, b_2, b_3) distributions in general and the strength of information conveyed by \sum_i a_i and \sum_k b_kin general and port it to contexts other than the multinomial likelihood. (Note: I could totally be wrong about this and it just hasn’t backfired in an obvious way yet.)
If we wanted to do this more rigorously, how could we go about it? The point is that for increasing k, we want the resulting function to look “more wiggly” but not “too much more wiggly.” We’ve decided that getting smoother with greater k is decidedly not “more wiggly” enough because it’s actually less wiggly. Fair enough. But are we really trying to have identical effective sample sizes on the weights for all possible specifications? (Actually, it would be really cool to know how to do that, but not worth holding up the rest of the functionality @sambrilleman has added, in part because I am worried that “not too much more wiggly” is too much of a moving target. I think your decreasing sequence idea has merit to it, though.)
\mbox{Dirichlet}(1) is uniform, so when we have \mbox{Dirichlet}(\alpha), it’s equivalent to starting with a uniform and making \sum_{n=1}^N \alpha_n - 1 observations.
If the \mathrm{Beta}(1,1) is placed on an unknown proportion p, that information is equivalent to seeing 2 prior observations (1 “yes” and 1 “no”). A \mathrm{Dirichlet}(1,1) is just a prior on (p, 1-p), no? Am I remembering incorrectly?
It seems natural to start with the uniform distribution as representing no prior information. I think of the prior count as reflected in the exponents, \alpha - 1 and \beta - 1.
If you look at the binomial (a conjugate likelihood), it takes the same form, with kernel
That is, it’s \theta to the power of the number of successes and 1 - \theta to the power of the number of failures, just like in the prior. Then you just add them up to get the posterior,
p(\theta \mid y, N, \alpha, \beta)
\ \propto \
\mbox{beta}(\alpha + y, \beta + N - y).
Where could you start such that one prior success and one prior failure leads to a posterior of \mbox{beta}(1,1)? Both \alpha and \beta must be strictly greater than 0 for the beta density to properly integrate to one.
Just to go back to this. I’ve been working on survival models with group-specific parameters (i.e. frailty models) on this branch. For the M-spline model, using constrained coefficients for the M-splines is the only way to get a reliable estimate of SD of the group-specific intercepts. Otherwise, with a small number of clusters, the SD of the group-specific intercepts is massively biased (overestimated). So, I think constrained M-spline coefficients defined as a simplex seems the way to go.
However I am still unsure on what the most appropriate default argument for prior_aux is. As I see it, there are two possible approaches:
Have the default for prior_aux set to missing, and describe the default behaviour in the documentation.
Have the default prior_aux = dirichlet(1) and if the user chooses any survival model apart from the default M-spline model (e.g. they specify basehaz = "weibull") then they will encounter an error saying that the Dirichlet prior is not valid and they will have to manually specify a different value for prior_aux.
Option 2 is more explicit, but much more annoying for the user. I think I prefer Option 1.
Hello stan_surv developpers! Thank you for your work on bringing more survival models to rstanarm!
I am conducting a review of readily-available functions in R for Bayesian proportional hazards survival models (for application to clinical trials analysis). I definitely would like to mention Stan and the extension to rstanarm you are working on.
I am not familiar with the process for the release of updates/new packages on CRAN. But has an approximate timeline been set for the release of stan_surv?
This thread seems to be the most recent about Cox model estimation with Stan and Stan-based software (rstanarm, brms). Here follows a couple thoughts about this, about which I’d like your feedback.
Obtaining a fit for a “classical” Cox model (i. e. time to a single, possibly censored, non-repeating event) is easy. The “Poisson trick” quoted in a lot of literature allows to do this almost mechanically :
Use Terry Therneau’s sirvFit from the survival package to format the data in the “long” format [PtId, IntervalId, StartOfInterval, EndOfInterval, StatusAtEOI, covariates]. This format has one observation per patient and per defined time interval:
Use these data to fit a Poisson model, assimilating the occurence of an event as the count of events intervening in the interval. For example, with brms, one could fit the model:
brm(StatusAtEOI ~ 0 + IntervalId + offset(log(EndOfInterval-StartOfInterval) + <Model for the covariates' effects>, family=poisson(link=log),...)
Provided that your priors are reasonable, you will recover coefficient estimates and estimated SDs close to the ones given by (e. g.) coxph).
With “reasonably uninformative” priors (such as student_t(3, 0, 10) in brms parlance), you’ll get estimates a bit shrunk towards 0 and a bit less variable than coxph estimates ; using “unreasonable” priors such as student_t(3, 0, 1000) will recove coxph's estimates almost exactly.
This should (I didn’t yet checked) also work to include time-dependent covariates after prefit massaging, or time-dependent effects, as explained in T. Thernau’s vignette.
Thes approach has the (large) advantage to exist. It has however a couple of problems:
Whereas the “natural unit” of observation is the subject (for example, a patient), and the observed value is a (variable-length) vector (i. e. the individual’s history), both brms and stanarm treat the data as structured by elementary (scalar) observations, totally ignoring the indivudual. All the infrastructure (log_likelihood, LOO, WAIC) is observation-based, i. e. meaningless for individuals. In consequences, model comparisons, for example, can currently only be done by bayes_factor, whose limits should be clear for all by now.
The “Poisson trick” allows to treat elegantly the case of aggregated data. But it cannot be adapted to multistage/competitive risks models, where a categorical variable to denote the current state), and a matrix of state transition risks/probabilities would be natural (again, see Therneau’s vignette on the subject). It would be preferable to mode the elementary event as a Bernouilli RV, which could be generalized as a categorical RV.
I don’t think any of this is difficult to program in Stan itself. On the other hand, I just can’t see how to (ab-)use the current notations available either in brms or rstanarm to pass the necessary information. Examples :
computing log_likelihood per unit : both brms and rstanarm do something along the lines of :
mu = X*Beta + offset
target += poisson_log_lpmf(Y | mu)
which could be easily (but perhaps not as efficiently) bya transformed parameter section along the lines of:
vector log_lik[nPtId];
for (i in 1:nPtId) {
log_lik[i] = 0;
}
for (i in 1:nObs) {
log_lik[PtId[i]] += poisson_log_lpmf(Y[PtId[i]] | X[i,]*Beta+offset[i]);
}
Can some kind soul suggest a “good” way to replace the Poisson modeling by a Bernoullimideling ? I have an idea, but my algebra is a bit shaky, and I have doubts about the numerical stability…
A third difficulty is rather more embarrassing : a good part of the infrastructure entails the fitted or predicted values of the observation. I have trouble defining what should be such fitted values or prediction, given that they should have a meaning at the individual (vector) level, not at the scalar level.
You can use a binomial, as that and Poisson are approximately normal when expectations are large and not near boundary for binomial. But the normal’s even easier to compute.
This is better off being defined as:
vector[nObs] lambda = X * Beta + offset;
and then you can just use
for (n in 1:nObs)
log_lik[PtId[n]] += poisson_log_lpmf(Y[PtId[n]] | lambda[n]);
It’d be even more efficient to make things conntiguous in the data so you can slice out all the n where PtId[n] = k.
On the R side, I think @sambrilleman’s feature/survival branch for rstanarm is basically ready for use and just waiting on me and @bgoodri to finish reviewing it. I think there will also be a StanCon presentation about it.