Censored data loglikelihood trouble

zasdzcc · April 6, 2023, 5:53pm

When including Stan code in your post it really helps if you make it as readable as possible by using Stan code chunks (```stan) with clear spacing and indentation. For example, use

data{
  int<lower=1> N; //number of subjects
  matrix[N,3] X;
  int censor[N];
  vector[N] BP;
  real<lower=0> hyp_diag;
}

parameters{
  real<lower=1> intercept;
  vector<lower=0>[3] beta;
  real<lower=0> sigma; //regression variance
  real intercept_b;
  vector[3] beta_p; //binomial parameter coeffs
  real<lower=0> trunc_sigmau;
}


model{
  intercept ~ normal(0,100);
  intercept_b ~ normal(0,20);
  trunc_sigmau ~ student_t(3,0,2.5);
  sigma ~ student_t(3,0,2.5);
  to_vector(beta) ~ multi_normal([0,0,0],diag_matrix(square([5,5,5]')));
  to_vector(beta_p) ~ multi_normal([0,0,0],diag_matrix(square([5,5,5]')));
  for(n in 1:N){
    if (censor[n]==1) {
      BP[n] ~ normal(intercept+X[n,1]*beta[1]+X[n,2]*beta[2]+X[n,3]*beta[3],sigma);
      if (BP[n]>hyp_diag) {
        target += bernoulli_lpmf(1-censor[n]|inv_logit(intercept_b+X[n,1]*beta_p[1]+X[n,2]*beta_p[2]+X[n,3]*beta_p[3]));
      }
    }else{
      target += normal_lccdf(BP[n]|intercept+X[n,1]*beta[1]+X[n,2]*beta[2]+X[n,3]*beta[3],trunc_sigmau);
      target += bernoulli_lpmf(1-censor[n]|inv_logit(intercept_b+X[n,1]*beta_p[1]+X[n,2]*beta_p[2]+X[n,3]*beta_p[3]));
    }
  }
}

model=stan_model(file = 'H:\\Test Bayes\\BP censored.stan')
nsim=800 #800
age=runif(nsim,min=25,max=74)
sex=rbinom(nsim,size=1,prob=0.5)
gene=rbinom(nsim,size=1,prob=0.51)
sig_err=10
err=rnorm(nsim,mean=0,sd=sig_err)
beta0=123;beta1=0.5;beta2=5;beta3=3
Z=beta0+beta1*age+beta2*sex+beta3*gene+err
betas=c(beta0,beta1,beta2,beta3)

list_hyp=which(Z>130)
num_hyp=length(list_hyp)
treat=rbinom(num_hyp,1,prob=0.6)
old_Z=Z
Z[list_hyp]=Z[list_hyp]+treat*rnormTrunc(num_hyp,mean=-30,sd=20,max=0)

hyp=data.frame(id=1:nsim,age=age,sex=sex,gene=gene,BP=Z,true_BP=old_Z)
hyp$age_g=cut(age,breaks=c(min(age)-1,quantile(age,
                                               #c(1/4,2/4,3/4)),
                                               c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.9)),
                           max(age)+1))
hyp$censor=1
hyp$censor[list_hyp[as.logical(treat)]]=0
data=list(N=dim(hyp)[1],
          X=as.matrix(hyp[,c('age','sex','gene')],ncol=3),
          censor=hyp$censor,
          BP=hyp$BP,
          hyp_diag=130
)

fit_mod=sampling(model,data=data,chains=1,iter=2000,warmup=1000)

I am trying to model censored data with normal distribution where mean depends upon covariates. In reality, I will never have information for beta’s all positive in parameters. If I drop that constraint, Stan will give me log(0) posterior density coming from the following part of the stan code within the loop.

 target += normal_lccdf(BP[n]|intercept+X[n,1]*beta[1]+X[n,2]*beta[2]+X[n,3]*beta[3],trunc_sigmau);

Q1: Is there a way to allow me drop the beta’s all positive condition?
Q2: After examining trace plot, it seems that the result did not converge. Is this possibly due to previous normal_lccdf line?

Thanks.

Bob_Carpenter · April 21, 2023, 7:01pm

First, you don’t want to use multi-normal where you don’t have to. Your beta and beta_p priors should be coded as:

beta ~ normal(0, 5);
beta_p ~ normal(0, 5);

It works out to the same thing up to a constant.

I didn’t understand why there was a conditional (BP[n] > hyp_diag) or why there is a Bernoulli. There’s a section of the Stan user’s guide that explains how to handle censored data: 4.3 Censored data | Stan User’s Guide. It usually just means that if measurements are above (or below) a given threshold, they are just reported as being above. Here there seems to be some kind of Bernoulli choice going on. But I’m not sure for what.

If you do need the Bernoulli for some reason, rather than bernoulli_lpmf(x | inv_logit(y)), you can write bernoulli_logit_lpmf(x | y)—the _logit suffix says the argument is on the log odds scale. It’s a bit more efficient and more numerically stable to use the bernoulli_logit version.

Why are the beta constrained to be positive? It’s not necessary in a normal regression and if the data wants them to be less than zero for a better fit, the sampler is going to struggle with parameters pushed up against constraint boundaries (the boundaries themselves map to +/- infinity).

It’s also much more efficient to use transformed data to sort the N entries into ones that are censored and ones that are not. Then everything can be vectored much more efficiently. But that’s a step to do after everything’s working in the simpler coding.

Topic		Replies	Views
Censored regression Modeling	2	463	August 8, 2022
Censored regression via rstan Modeling	4	1036	May 12, 2020
GLMM posterior Modeling	0	421	October 31, 2018
Stan Users Guide - 4.3 Censored Data Modeling fitting-issues	0	298	August 23, 2023
Computing log-likelihood for censored data Modeling	0	381	July 17, 2019

Censored data loglikelihood trouble

Related topics