Huge R-hat and low effective size in piecewise exponential model when using more than one predictor and interval

riko · September 13, 2019, 1:16pm

Hello together,

this is my first post here since I found nothing else regarding this problem. Maybe some of you know how to fix my problem.

I wrote a piecewise constant exponential (piecewise constant hazards) survival model for STAN given in the model code below. Basically I recycled the likelihood function given in Bayesian Survival Analysis (Ibrahim, 2001), which is

L(\beta,\lambda|D)=\prod_{i=1}^n \prod_{j=1}^J \left ( \lambda_j \exp(x_i' \beta) \right )^{\delta_{ij} \nu_i} \exp \left ( -\delta_{ij}[\lambda_j (y_i-s_{j-1})+\sum_{g=1}^{j-1} \lambda_g (s_g-s_{g-1})] \exp(x_i' \beta) \right )

where \delta_{ij}=1 if the i^{th} subject failed or was censored in the j^{th} interval and 0 otherwise. The derivation of the likelihood can be found in Ibrahim (2001). If J=1 the model reduces to a parametric exponential model with \lambda \equiv \lambda_1.

My model basically iterates over the observations i=1,...,n and intervals j=1,...,J and given a status vector \nu=(\nu_1,...,\nu_n), a matrix delta which contains the \delta_{ij} for i=1,...,n, j=1,...,J, the vector s=(s_1,...,s_J) creating the intervals, the design matrix X where x_i=(x_1,...,x_p) is a row belonging to an observation, the vector of baseline hazards \lambda=(\lambda_1,...,\lambda_J) and the vector of regression coefficients \beta=(\beta_1,...,\beta_p) calculates the likelihood. More specifically, my likelihood function pch_log in the functions block calculates the log-likelihood.

I put a normal prior \mathcal{N}(0,1) on \beta and a Gamma prior \mathcal{G}(1,1) on the piecewise constant baseline hazards \lambda.

This works quite well if

The number or predictors (x_1,...,x_p) stays small
The number of intervals J-1 stays small

I have included a reproducible example using the retinopathy dataset from the survival package which includes survival data from a trial of laser coagulation as a treatment to delay diabetic retinopathy. The events are loss of sight here and the appended code 1 uses just the treatment (laser vs no treatment) as a predictor for 2 intervals (0,50] and (50,80] with different constant baseline hazards \lambda_1,\lambda_2.

The appended Code 2 then uses 4 intervals and 4 predictors, and STAN produces posteriors including huge R-hat values and effective sample sizes below 10 for all parameters.

My question now is: Has anyone an idea how to improve the model so that STAN does not encounter such difficulties in exploring the posterior distribution? I already centered the non-categorial predictors and tried initial prior values, but I achieved no improvement.

I have no clue why STAN has such difficulties to explore the posterior distribution. Changing the priors on \lambda to uniform priors has not helped much.

Thanks for any help in advance,

Riko

functions{
  real pch_log(vector y, vector nu, matrix delta, matrix X, vector lambda, vector s, vector beta){
    vector[num_elements(y)] log_prob;
    real logprob;
    real hazSum;

    for(i in 1:num_elements(y)){
      for(j in 1:(num_elements(s)-1)){
        if(delta[i,j]==1){ // failure or censored event in interval j for observation i
          if(nu[i]==1){ // failure occured in interval j for observation i
            hazSum=0;
            for(g in 2:j){ // is zero when j=1
              hazSum += lambda[g-1]*(s[g]-s[g-1]);
            }
            log_prob[i]=log(lambda[j])+(X[i,]*beta)+(-(lambda[j]*(y[i]-s[j])+hazSum)*exp(X[i,]*beta));
            
            if(log_prob[i]<-100000){ // if likelihood is very close to zero (log-likelihood would become -infty) we set it to -100000
              log_prob[i]=-100000;
            }
          }
          if(nu[i]==0){ // censoring occured in interval j for observation i
            hazSum=0;
            for(g in 2:j){ // is zero when j=1
              hazSum += lambda[g-1]*(s[g]-s[g-1]);
            }
            log_prob[i]=-(lambda[j]*(y[i]-s[j])+hazSum)*exp(X[i,]*beta);
            
            if(log_prob[i]<-100000){ // if likelihood is very close to zero (log-likelihood would become -infty) we set it to -100000
              log_prob[i]=-100000;
            }
          }
        }
      }
    }
  logprob = sum(log_prob); // compute sum of log-likelihoods and return
  return logprob;
  }
}
data{
  int<lower=1> N; // number of patients
  vector<lower=0>[N] y; // observed times
  vector<lower=0,upper=1>[N] nu; // censoring status 1 or 0
  int<lower=0> numCovariates; // number of covariates (predictors)
  matrix[N,numCovariates] X; // design matrix for observations
  
  int<lower=1> J; // number of timepoints building the intervals
  vector<lower=0>[J] s; // vector with interval endpoints
  matrix[N,J-1] delta; // matrix with indicators for event or failure in interval j=1,...,J for patients
}
parameters{
  vector[numCovariates] beta_raw; // regression coefficients                                   
  vector<lower=0>[J-1] lambda; // piecewise constant baseline hazards
}
transformed parameters{
  vector[numCovariates] beta;
  beta = 1 * beta_raw;
}
model{
  lambda ~ gamma(1,1); // prior for piecewise constant baseline hazards
  beta_raw ~ std_normal(); // raw prior for regression coefficients

  y ~ pch(nu, delta, X, lambda, s, beta);
}

Code 1: Survival regression for predictor treatment using two intervals

require(survival)
require(tidyverse)
data("retinopathy")
head(retinopathy)
retinopathy$age=scale(retinopathy$age)

# Data preparation
y = retinopathy$futime # censoring or failure time
N=length(y)
nu = retinopathy$status # (0 = censored, 1 = visual loss)

X =as.matrix(retinopathy %>% select(trt)) # predictor matrix
# trt: 0 = no treatment, 1 = laser treatment
s = c(0,50,80) # split points for intervals
J=length(s);
interval1=retinopathy %>% filter(retinopathy$futime <= 80) %>% mutate(interval=1)
interval2=retinopathy %>% filter(retinopathy$futime > 40)  %>% mutate(interval=2)

retinopathyNew=rbind(interval1,interval2)
retinopathyNew
j=retinopathyNew$interval

# Create matrix with delta_ij's
delta=matrix(nrow=length(y),ncol=J-1,0)
head(delta)
for(i in 1:length(y)){
  for(k in 1:J-1){
    if(j[i]==k){
      delta[i,k]=1
    }
  }
}

# Put data in list for STAN
stan_data <- list(N=N,
                  y=y,
                  nu=nu,
                  numCovariates=1,
                  X=X,
                  J=J,
                  s=s,
                  delta=delta
)

# Run STAN
set.seed(12);
require(rstan)
options(mc.cores = parallel::detectCores()) # use multiple cores if available, speeds up simulations
rstan_options(auto_write = TRUE) # avoids recompiling unchanged models, speed up simulations

pch_surv_model_fit__rethinopathy_Treatment <- stan(model_code = pch_model, data=stan_data, control=list(adapt_delta=0.999, max_treedepth=10), chains=2, iter=5000, init=list(list(beta=c(0,0),lambda=c(1,1)),list(beta=c(0,0),lambda=c(1,1))))

I end up with a posterior mean of -0.8 for \beta_1 and means of 0.02 and 0.98 for \lambda_1,\lambda_2. The resulting R-hat is 1 for all parameters.

Code 2: Survival regression for four predictors treatment, age, type and laser using four intervals

require(survival)
require(tidyverse)
data("retinopathy")
head(retinopathy)
retinopathy$age=scale(retinopathy$age)

# Data preparation
y = retinopathy$futime # censoring or failure time
N=length(y)
nu = retinopathy$status # (0 = censored, 1 = visual loss)

X =as.matrix(retinopathy %>% select(trt,age,laser,type) %>% mutate(laser=as.numeric(retinopathy$laser)-1,type=as.numeric(retinopathy$type)-1)) # predictor matrix
# trt: 0 = no treatment, 1 = laser treatment
# age: age scaled to mean zero and sd one (centered)
max(retinopathy$futime)
hist(retinopathy$futime)
s = c(0,20,40,60,80)
J=length(s);
interval1=retinopathy %>% filter(retinopathy$futime <= 20) %>% mutate(interval=1)
interval2=retinopathy %>% filter(retinopathy$futime > 20) %>% filter(futime <= 40) %>% mutate(interval=2)
interval3=retinopathy %>% filter(retinopathy$futime > 40) %>% filter(futime <= 60) %>% mutate(interval=3)
interval4=retinopathy %>% filter(retinopathy$futime > 60) %>% filter(futime <= 80) %>% mutate(interval=4)

retinopathyNew=rbind(interval1,interval2,interval3,interval4)
retinopathyNew
j=retinopathyNew$interval

# Create matrix with delta_{ij}'s
delta=matrix(nrow=length(y),ncol=J-1,0)
head(delta)
for(i in 1:length(y)){
  for(k in 1:J-1){
    if(j[i]==k){
      delta[i,k]=1
    }
  }
}

# Put data in list for STAN
stan_data <- list(N=N,
                  y=y,
                  nu=nu,
                  numCovariates=4,
                  X=X,
                  J=J,
                  s=s,
                  delta=delta
)

# Run STAN
set.seed(12);
require(rstan)
options(mc.cores = parallel::detectCores()) # use multiple cores if available, speeds up simulations
rstan_options(auto_write = TRUE) # avoids recompiling unchanged models, speed up simulations

pch_surv_model_fit_rethinopathy_TreatmentAgeLaserType <- stan(model_code = pch_model, data=stan_data, control=list(adapt_delta=0.8, max_treedepth=15), chains=2, iter=50000, init=list(list(beta=c(0,0,0,0),lambda=c(1,1,1,1)),list(beta=c(0,0,0,0),lambda=c(1,1,1,1)),list(beta=c(0,0,0,0),lambda=c(1,1,1,1)),list(beta=c(0,0,0,0),lambda=c(1,1,1,1))))

In this case, I end up with huge R-hat values, 1-digit effective sample size, and tweaking the hyper parameters of the priors has not helped at all.

bbbales2 · September 14, 2019, 2:34pm

Do you have a plot of what the is piecewise constant here? It is difficult to reverse engineer the eqs. though thanks for typing them out.

Piecewise constant things have jumps, right? So as a plot, dy/dx wouldn’t exist. If these differentials are necessary for Stan to work (if a parameter goes into computing x), then this lack of differentiability could easily break the sampler.

Have you seen this thread: Survival models in rstanarm ? I think there are some survival models in rstanarm (that came from that thread), and there are probably other things as well. Would adapting one of those models be appropriate for what you’re trying to do?

bbbales2 · September 14, 2019, 2:37pm

Oh neat, they say piecewise constant in that thread: Stan for survival models

Have a look and see if there’s anything interesting.

bbbales2 · September 14, 2019, 3:11pm

There were a couple talks at Stancon Cambridge about survival models.

From the schedule:

Bayesian analyses of time-to-event data using the rstanarm R package . Eren M. Elçi, Sam Brilleman. Public Health and Preventive Medicine, Monash University. Abstract
A Decision-Theoretic Journey From Early Clinical Data to Late Stage Efficacy using Hierarchical Joint Models. . Krzysztof Sakrejda, Eric Novik. Generable Abstract

They’re back to back and start here: https://youtu.be/d5avHkTHvy4?t=9256

ermeel · September 14, 2019, 4:44pm

I put the the case study (admittedly almost a paper) here which accompanies the StanCon talk. This is joint work with @sambrilleman. Feedback is highly welcome!

The case study discusses all the options for baseline hazards in the proportional hazard framework, currently available in the upcoming stan_surv extension to rstanarm. It also shows how to use the PW-constant baseline hazard, which are supported as a marginal case of M-splines or B-splines of degree 0.

We do not put a Gamma-process prior on the piecewise constant baseline but rather a Dirichlet prior (see case study).

I’ve seen a Stan implementation of piecewise constant baseline hazard with Gamma (incremental) priors using the Poisson process formulation, you can find it here. It is a port from Bugs.

PS: We will soon publish a paper describing in more detail the functionality, including how we implemented support for time-dependent effects (non-proportional hazards) and time-dependent covariates.

riko · September 15, 2019, 2:58pm

@ermeel I watched the StanCon Cambridge talk now, read your case study and tried to replicate it on my machine.

First off, nice talk and very promising work. I have been doing survival analysis a lot recently and did not know there is an extension to rstanarm in the making.

To the case study itself:

I did not find any information about the default priors you are using for the piecewise constant or any other model. On page 5, you say that the default Spline degree is \delta=3 but I did not found information about the default priors. Also, some more information about the model specification of the Splines approach would be helpful. I skimmed the Royston & Parmar paper about the RMST but I could not find anything about the splines idea there.
I tried to replicate my analysis with your function, but after installing the development version of rstanarm, the stan_surv function could not be found. I run Mac OS 10.14.6 and the latest R version so this should not be an issue.
After installing the development version of rstanarm I end up with the following affirmation that rstanarm has been installed:

** R
** data
*** moving datasets to lazyload DB
** demo
** inst
** byte-compile and prepare package for lazy loading
** help
No man pages found in package  ‘rstanarm’ 
*** installing help indices
*** copying figures
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (rstanarm)

A few lines before this confirmation I get 18 warnings from clang which all look like

In file included from stan_files/polr.cc:3:
In file included from stan_files/polr.hpp:18:
In file included from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/rstan/include/rstan/rstaninc.hpp:3:
In file included from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/rstan/include/rstan/stan_fit.hpp:22:
In file included from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppEigen/include/RcppEigen.h:25:
In file included from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppEigen/include/RcppEigenForward.h:40:
In file included from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppEigen/include/unsupported/Eigen/SparseExtra:51:
/Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppEigen/include/unsupported/Eigen/../../Eigen/src/Core/util/ReenableStupidWarnings.h:10:30: warning: pragma diagnostic pop could not pop, no matching push [-Wunknown-pragmas]
    #pragma clang diagnostic pop
                             ^
18 warnings generated.

Do you know what this means or why I cannot access thestan_surv function? I can access all other functions of rstanarm correctly.

Is there the possibility to inspect the underlying STAN models? I still do not know why my model approach has divergent transitions and low effective sample size, which I am really curious about. I don’t know if your function will lead to better results as I cannot install it but I am keen to try.

Thanks in advance

riko · September 15, 2019, 3:07pm

@bbbales2 Thanks for the reply and the links to the resources. This could be a potential problem but I don’t think that’s the reason why STAN encounters these problems. For example, if I use two intervals and only one predictor, the treatment used, I end up with survival functions like this for the retinopathy dataset:

Of course, the fit is quite weird here, but the intervals are just used for testing purposes. The fainted blue lines are some credible survival functions.

ermeel · September 15, 2019, 6:17pm

Hey @riko , thanks for the feedback.

For the M-splines we as a default a symmetric Dirichlet prior with \alpha=1. This information (and more) is available in the stan_surv help (and vignette). The vignette also contains the precise information on the splines used. Here is a screenshot:

For example setting basehaz='ms' and basehaz_ops = list(degree = 0, knots=c(1,2,3,4,5))) will be equivalent to a piecewise constant baseline hazard with interior knots placed at times 1,2,3,4,5 and boundary knots determined my the minimum and maximum of times available in data.

In general I would recommend to install (this) development version of rstanarm in a separate directory (“environment”). I do devtools::dev_mode(on=T), install/use the development version of rstanarm and after I am done I do dev tools::dev_mode(on=F). This allows you to have two versions of rstanarm which do not “interact” and no manual adjustments of library paths is needed.

Just to be sure here is a potential step-by-step instruction how to install the development version of rstanarm (making sure help and vignette are available). More experienced Stan-Dev members, please propose an improved way to do the same, if possible:

First ensure you really cloned the feature/frailty-models branch (and not the master branch) by running

git clone --single-branch --branch feature/frailty-models https://github.com/stan-dev/rstanarm.git

Then do the following in R:

path_to_rstanarm_dev = "~/rstanarm_frailty_models_dev" # this is the directory of your local rstanarm clone of the feature/frailty-models branch
devtools::dev_mode(on=T)
install.packages("StanHeaders")
pkgbuild::compile_dll(path=path_to_rstanarm_dev)
devtools::document(pkg=path_to_rstanarm_dev)
install.packages(path_to_rstanarm_dev,  repos = NULL, type="source")
devtools::dev_mode(on=F)

Next time you want to use the development version of rstanarm and in particular stan_surv, just do

devtools::dev_mode(on=T)
library(rstanarm)
# use stan_surv
devtools::devmode(on=F)

Regarding Stan code, yes it is available but comes with a lot of extra code in order to support the full flexibility of stan_surv.

To understand the m-Spline baseline hazard model I can recommend to check this (simplified) Stan model and this RMarkdown (which contains the R code required to precalculate the splines). Both in stan_surv and in this code we rely on the splines2 R-package, so you can alter the R and Stan code in the example (and set the degree to 0) to get a feeling for the prior effect on the piecewise constant hazard and inspect what’s going on. Note that in the simple Stan model above, I use (implicitly) a symmetric Dirichlet prior with \alpha=1, but you an easily add a custom Dirichlet prior there. Also, I recommend the thread Ben just recommended. It heavily inspired part of the implementation of stan_surv.

riko · September 18, 2019, 4:20pm

@ermeel I managed to install the development version of rstanarm now and I have replicated your RMarkdown case study. The installation was a bit of a hassle but the code worked flawless.

I still need to check out your link to the simplified STAN model and RMarkdown for understanding more of the splines approach.

Regarding the paper, I would recommend adding a section or some information on the priors used for each model (although one can find a lot in the corresponding prior information sites of rstanarm, but this was something I was missing when reading the case study for the first time).
As far as I am concerned, a little bit more about the theory of the spline approach taken would also help. Maybe it is just me, but I think using splines to model baseline hazards has not been studied widely before due to the computational difficulties so a large part of people interested in this should be new to this idea.

I also spent some time fitting some models to microarray data with a large number of predictors. I did not encounter a lot of divergent transitions, Rhat problems or small effective sample size there, only the running time becomes a few hours quickly.

There was only one situation in which the stan_surv function (or better, the update function after defining the model via stan_surv) produced convergence issues with larger Rhat and small effective sample sizes was when I tried to put a LASSO prior on the regression coefficients via

prior.stan.constLASSO <- update(prior.stan.const,
                           prior_intercept = normal(0, 0.01),
                           prior           = lasso(df=1, location=0, autoscale = TRUE))

which produced convergence issues in form of a large Rhat, low ESS and BFMI.
When I instead used a LASSO prior with scale=0.5, that is

prior.stan.constLASSO <- update(prior.stan.const,
                           prior_intercept = normal(0, 0.01),
                           prior           = lasso(df=1, location=0, scale=0.5, autoscale = TRUE))

I encountered no convergence issues in STAN.

I do not exactly know if this is due to the rstan_surv function or because of my use of the LASSO prior. The rstanarm vignette for prior distributions and options states regarding the LASSO prior:

The lasso approach to supervised learning can be expressed as finding the posterior mode when the likelihood is Gaussian and the priors on the coefficients have independent Laplace distributions. It is commonplace in supervised learning to choose the tuning parameter by cross-validation, whereas a more Bayesian approach would be to place a prior on “it”, or rather its reciprocal in our case (i.e. smaller values correspond to more shrinkage toward the prior location vector). We use a chi-square prior with degrees of freedom equal to that specified in the call to lasso or, by default, 1. The expectation of a chi-square random variable is equal to this degrees of freedom and the mode is equal to the degrees of freedom minus 2, if this difference is positive.

Therefore I wondered why STAN did not produce a posterior of the chi-square prior placed on the regression coefficients when printing the model fit. Is this the expected behaviour or maybe a bug?

Also, when using the full dataset with 1570 genes, the loo function told be that I should better use k-fold CV instead. I also don’t know if this is to be expected.

martinmodrak · September 27, 2019, 12:51pm

Hi,
not sure if your problems ended up resolved or not, so just a few minor points. If you are still struggling, it might be worth starting a new topic focused on the outstanding issues - this thread is a bit too long to digest quickly.

This means that the estimates of out-of-sample prediction obtained by loo are unrealiable. This does not (usually) indicate a problem with your model, only that you cannot trust the output of loo. .

AFAIK, the LASSO prior may be problematic in many cases, so unless you completely need to, I would definitely consider a different prior. Dan Simpson has a rather fun/weird post on LASSO.

Best of luck with your model!

riko · September 28, 2019, 4:57pm

@martinmodrak Thanks for the reply, the convergence issues of the LASSO prior are not resolved by now, but I will experiment a little further. Your link to the Bayesian LASSO post was helpful, maybe the explanations there are the reason STAN has trouble fitting the LASSO Model.

betanalpha · September 30, 2019, 12:56am

The LASSO is not appropriate for regularizing entire distributions, which is necessary for a Bayesian approach to sparsity. See the discussion in Sections 1 and 2 of https://betanalpha.github.io/assets/case_studies/bayes_sparse_regression.html.

riko · September 30, 2019, 7:34am

@betanalpha Thanks for the reply, I just read your case study, which really clarified some problems of the Bayesian LASSO for me. I think the key problem with my LASSO fit boils down to exactly the problems you addressed in your case study, namely that the relevant slopes exhibit signs of overregularization while the irrelvant slopes aren’t as strongly regularized as I would like, pushing STAN into troubles when exploring the posterior.

Topic		Replies	Views
Stan for survival models Modeling	26	12469	July 18, 2018
Survival models in rstanarm Developers	120	13885	October 21, 2020
Using stan_jm for parametric proportional hazards regression only rstanarm	25	4586	May 5, 2018
Huge RHATs, Extreme Model Performance Variability Depending on Priors Modeling	10	981	September 23, 2020
Issues with running Bayesian hierarchical reinforcement learning model in Stan Modeling rstan , fitting-issues , specification , performance	12	909	June 28, 2021

Huge R-hat and low effective size in piecewise exponential model when using more than one predictor and interval

Related topics