Censored regression

[edit: escaped code]

I ran a censored regression with the following stan code:

data {
   int<lower=0> K;
   int<lower=0> N;
   int<lower=0> cens[N];  // censoring flag=1 or 0
   matrix[N,K] X ;
   vector[N] y;
   vector[N] C;    //vector of left censoring pt=0
  }
  
  parameters {
    vector[K] beta;  
    real alpha;   
    real lsigma;   //log(sigma)
  }
 
  transformed parameters {
  real<lower=0> sigma;
  sigma = exp(lsigma);
}
  model {
    vector[N] mu= X*beta + alpha;  
    
    //prior model
 
     beta ~ normal(0,10) ;
     alpha ~ normal(0,10);  
     lsigma ~ normal(0,10);  

     //observational model
      for (i in 1:N)
        target += cens[i]*normal_lcdf(C[i] | mu[i],sigma)  + (1-cens[i])*normal_lpdf(y[i] |  mu[i],sigma);
   }

The model ran successfully with the following results:

           mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff
beta[1]   -0.07    0.00 0.05   -0.16   -0.10   -0.07   -0.04    0.02  2450
beta[2]    0.12    0.00 0.04    0.04    0.09    0.12    0.15    0.20  2530
beta[3]    0.52    0.00 0.05    0.43    0.49    0.52    0.55    0.61  1979
beta[4]   -0.37    0.00 0.05   -0.47   -0.40   -0.37   -0.34   -0.28  2012
beta[5]   -0.38    0.00 0.05   -0.48   -0.42   -0.38   -0.35   -0.29  2501
alpha     -0.06    0.00 0.04   -0.14   -0.09   -0.06   -0.03    0.02  2120
lsigma    -0.04    0.00 0.03   -0.11   -0.07   -0.04   -0.02    0.03  2281

Now, I reran the same model with the vectorized version of the stan code given below where I only changed a) cesoring flag (cens) from “integer array” to a “vector” and b) used vectorized log likelihood specification instead of loop. I expected same results, but there were some non-trivial differences between the two and I am wondering why? Anybody has any insights?

data {
   int<lower=0> K;
   int<lower=0> N;
   vector[N] cens;  // int<lower=0> cens[N];
   matrix[N,K] X ;
   vector[N] y;
   vector[N] C;    
  }
  
  parameters {
    vector[K] beta; 
    real alpha;     
    real lsigma;   
  }
 
  transformed parameters {
  real<lower=0> sigma;
  sigma = exp(lsigma);
}
  model {
    vector[N] mu= X*beta + alpha;  
    
    //prior model
 
     beta ~ normal(0,10) ;
     alpha ~ normal(0,10);  
     lsigma ~ normal(0,10);   //mean=1 and variance=10

     //observational model
       target += cens*normal_lcdf(C | mu,sigma) + (1-cens)*normal_lpdf(y | mu,sigma);
    }
              mean se_mean   sd       2.5%        25%        50%        75%
beta[1]      -0.03    0.00 0.00      -0.04      -0.04      -0.03      -0.03
beta[2]       0.06    0.00 0.00       0.06       0.06       0.06       0.06
beta[3]       0.31    0.00 0.00       0.30       0.31       0.31       0.31
beta[4]      -0.20    0.00 0.00      -0.21      -0.20      -0.20      -0.20
beta[5]      -0.20    0.00 0.00      -0.20      -0.20      -0.20      -0.19
alpha        -0.39    0.00 0.00      -0.39      -0.39      -0.39      -0.39

The problem you’re running into is that normal_lcdf(...) returns a single real value that is the sum of the vectorization. So the these two are very different:

target += cens*normal_lcdf(C | mu,sigma) + (1-cens)*normal_lpdf(y | mu,sigma);
for (i in 1:N)
  target += cens[i]*normal_lcdf(C[i] | mu[i],sigma)  + (1-cens[i])*normal_lpdf(y[i] |  mu[i],sigma);

The latter one using a loop is the mathematically correct form assuming you are censoring observation i below at C[I].

A much more efficient way to do this is to create two integer indices, one for censored values and one for uncensored values and then use slicing to vectorize the uncensored observations.

1 Like

Ah!. Thank you.