Ok, great. Yeah, the last point from @lcomm would explain the behaviour I was seeing. It was increasing the dimension of the simplex that seemed to enforce the smoothing, rather than changing the concentration parameter. I had \alpha_i = 1 for all i=1,...k in my example.

I’m not sure whether it’s desirable. Maybe it is? But it definitely wasn’t intuitively expected, at least not my me. I expected that increasing the degrees of freedom for the spline basis would increase flexibility, but that prior distribution had other ideas.

I am not used with survival models, but I found that estimating the \alpha vector as hierarchical parameters might solve this dimensionality problem while providing an interesting amount of shrinkage. But I’m affraid it would increase the sagpling time and model complexity!

@sambrilleman, I am creeping on your rstanarm pull request again (sorry!) and I can’t tell if your regression models ever use the QR decomposition of the design matrix. Have your explored that at all and/or decided it was not necessary?

(Also, happy holidays! Please take your time with any replies.)

Short answer is no, I have not explored it, nor ever used it in my own research. Hence why I hadn’t implemented it in any of the models (stan_jm, stan_mvmer, stan_surv). Basically, I’m just not familiar with it, and didn’t feel comfortable implementing it until I was.

But assuming there aren’t any issues with using QR decomposition for survival models, then I’m perfectly happy for someone to add it. (Or I can try look at doing it when I get time).

I have been using an M-spline based model with QR decomposition on the design matrix for various Pharma applications (with but not only genetic data). It significantly improved the statistical efficiency. I’d vote for this to be added at some point, but I appreciate it’s not the highest priority.

I have not used the QR decomposition in my own code before, but I will be adding it over the next few days. So far as I can tell, there is no particular reason it would not work for survival regression models. It seems like predictors in time-to-event models will suffer from the same problematic correlation as in other models. My main concern is how it interacts with prior specification. Based on this page, you can put the priors directly on the original coefficients \beta instead of translating to priors for the QR version of the coefficients, \tilde\beta (NB: the notation for the design matrix X on that page is the transpose of the rstanarm definition of X). You will get ignorable warnings about a Jacobian, though. I imagine spitting out warnings like that would be a no-go for rstanarm so I am not sure how that would work.

Digging through the rest of rstanarm’s approach, it looks like @bgoodri did all of the \beta = (R^*)^{-1}\tilde\beta in R post-processing by extracting anything named beta and premultiplying by (R^*)^{-1}. If you do it this way, you will need to put priors on \tilde{\beta} directly. There will be no transformed parameter \beta within Stan, so obviously you cannot specify the prior in terms of \beta.

I am not an expert on this by any means. Please correct me if I am wrong!

I would like to follow up on this: In the paper you reference, the authors state

An essential point is that the ESS is defined as a property of a prior and likelihood pair, so that, for example, a given prior might have two different ESS values in the context of two different likelihoods.

The ESS you mentioned assumes a multinomial likelihood, which is not the case in our particular survival model. Here the prior parameters, that are given Dirichlet prior, influence the likelihood through the hazard function as

h(t) = e^{\beta_0} \sum_{k=1}^m \gamma_k M_k(t)

(similar for the cumulative hazard function, see above) and \beta_0 is given a Normal prior. Thus our prior is a product of 2 parametric distributions, hence falling into case 3 of the paper and an ESS vector for each of the 2 distributions is applicable.

Therefore I am not sure whether ESS=\sum_{k}\alpha_k is still applicable here, however it appears to explain qualitatively what Sam was observing.

Out of curiosity, has someone studied cases where one would pick \alpha_1=\alpha_2=\cdots=\alpha_k = a_k with a_k an decreasing sequence in k that has the property that 1/a_k = o(k) as k\rightarrow \infty. In other words, a_k goes to zero as k\rightarrow \infty but not as fast as 1/k. Maybe something like a_k = C/\log(k)? I admit this will probably only be interesting in asymptotic regimes, but nevertheless might lead to some insights…

I agree that you should not directly think of this as a clean cut prior sample size for the weights the way you could with a multinomial likelihood. Actually, I’m not sure I have ever found myself referencing this paper in a context where I can directly use any of the results like that – my likelihoods never seem to be that standard. I do think you can take some of the intuition about how to think about the \mathrm{Dirichlet}(a_1, a_2) and \mathrm{Dirichlet}(b_1, b_2, b_3) distributions in general and the strength of information conveyed by \sum_i a_i and \sum_k b_kin general and port it to contexts other than the multinomial likelihood. (Note: I could totally be wrong about this and it just hasn’t backfired in an obvious way yet.)

If we wanted to do this more rigorously, how could we go about it? The point is that for increasing k, we want the resulting function to look “more wiggly” but not “too much more wiggly.” We’ve decided that getting smoother with greater k is decidedly not “more wiggly” enough because it’s actually less wiggly. Fair enough. But are we really trying to have identical effective sample sizes on the weights for all possible specifications? (Actually, it would be really cool to know how to do that, but not worth holding up the rest of the functionality @sambrilleman has added, in part because I am worried that “not too much more wiggly” is too much of a moving target. I think your decreasing sequence idea has merit to it, though.)