Censored data loglikelihood trouble


When including Stan code in your post it really helps if you make it as readable as possible by using Stan code chunks (```stan) with clear spacing and indentation. For example, use

data{
  int<lower=1> N; //number of subjects
  matrix[N,3] X;
  int censor[N];
  vector[N] BP;
  real<lower=0> hyp_diag;
}

parameters{
  real<lower=1> intercept;
  vector<lower=0>[3] beta;
  real<lower=0> sigma; //regression variance
  real intercept_b;
  vector[3] beta_p; //binomial parameter coeffs
  real<lower=0> trunc_sigmau;
}


model{
  intercept ~ normal(0,100);
  intercept_b ~ normal(0,20);
  trunc_sigmau ~ student_t(3,0,2.5);
  sigma ~ student_t(3,0,2.5);
  to_vector(beta) ~ multi_normal([0,0,0],diag_matrix(square([5,5,5]')));
  to_vector(beta_p) ~ multi_normal([0,0,0],diag_matrix(square([5,5,5]')));
  for(n in 1:N){
    if (censor[n]==1) {
      BP[n] ~ normal(intercept+X[n,1]*beta[1]+X[n,2]*beta[2]+X[n,3]*beta[3],sigma);
      if (BP[n]>hyp_diag) {
        target += bernoulli_lpmf(1-censor[n]|inv_logit(intercept_b+X[n,1]*beta_p[1]+X[n,2]*beta_p[2]+X[n,3]*beta_p[3]));
      }
    }else{
      target += normal_lccdf(BP[n]|intercept+X[n,1]*beta[1]+X[n,2]*beta[2]+X[n,3]*beta[3],trunc_sigmau);
      target += bernoulli_lpmf(1-censor[n]|inv_logit(intercept_b+X[n,1]*beta_p[1]+X[n,2]*beta_p[2]+X[n,3]*beta_p[3]));
    }
  }
}
model=stan_model(file = 'H:\\Test Bayes\\BP censored.stan')
nsim=800 #800
age=runif(nsim,min=25,max=74)
sex=rbinom(nsim,size=1,prob=0.5)
gene=rbinom(nsim,size=1,prob=0.51)
sig_err=10
err=rnorm(nsim,mean=0,sd=sig_err)
beta0=123;beta1=0.5;beta2=5;beta3=3
Z=beta0+beta1*age+beta2*sex+beta3*gene+err
betas=c(beta0,beta1,beta2,beta3)

list_hyp=which(Z>130)
num_hyp=length(list_hyp)
treat=rbinom(num_hyp,1,prob=0.6)
old_Z=Z
Z[list_hyp]=Z[list_hyp]+treat*rnormTrunc(num_hyp,mean=-30,sd=20,max=0)

hyp=data.frame(id=1:nsim,age=age,sex=sex,gene=gene,BP=Z,true_BP=old_Z)
hyp$age_g=cut(age,breaks=c(min(age)-1,quantile(age,
                                               #c(1/4,2/4,3/4)),
                                               c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.9)),
                           max(age)+1))
hyp$censor=1
hyp$censor[list_hyp[as.logical(treat)]]=0
data=list(N=dim(hyp)[1],
          X=as.matrix(hyp[,c('age','sex','gene')],ncol=3),
          censor=hyp$censor,
          BP=hyp$BP,
          hyp_diag=130
)

fit_mod=sampling(model,data=data,chains=1,iter=2000,warmup=1000)


I am trying to model censored data with normal distribution where mean depends upon covariates. In reality, I will never have information for beta’s all positive in parameters. If I drop that constraint, Stan will give me log(0) posterior density coming from the following part of the stan code within the loop.

 target += normal_lccdf(BP[n]|intercept+X[n,1]*beta[1]+X[n,2]*beta[2]+X[n,3]*beta[3],trunc_sigmau);

Q1: Is there a way to allow me drop the beta’s all positive condition?
Q2: After examining trace plot, it seems that the result did not converge. Is this possibly due to previous normal_lccdf line?

Thanks.

First, you don’t want to use multi-normal where you don’t have to. Your beta and beta_p priors should be coded as:

beta ~ normal(0, 5);
beta_p ~ normal(0, 5);

It works out to the same thing up to a constant.

I didn’t understand why there was a conditional (BP[n] > hyp_diag) or why there is a Bernoulli. There’s a section of the Stan user’s guide that explains how to handle censored data: 4.3 Censored data | Stan User’s Guide. It usually just means that if measurements are above (or below) a given threshold, they are just reported as being above. Here there seems to be some kind of Bernoulli choice going on. But I’m not sure for what.

If you do need the Bernoulli for some reason, rather than bernoulli_lpmf(x | inv_logit(y)), you can write bernoulli_logit_lpmf(x | y)—the _logit suffix says the argument is on the log odds scale. It’s a bit more efficient and more numerically stable to use the bernoulli_logit version.

Why are the beta constrained to be positive? It’s not necessary in a normal regression and if the data wants them to be less than zero for a better fit, the sampler is going to struggle with parameters pushed up against constraint boundaries (the boundaries themselves map to +/- infinity).

It’s also much more efficient to use transformed data to sort the N entries into ones that are censored and ones that are not. Then everything can be vectored much more efficiently. But that’s a step to do after everything’s working in the simpler coding.