# Censored regression

[edit: escaped code]

I ran a censored regression with the following stan code:

``````data {
int<lower=0> K;
int<lower=0> N;
int<lower=0> cens[N];  // censoring flag=1 or 0
matrix[N,K] X ;
vector[N] y;
vector[N] C;    //vector of left censoring pt=0
}

parameters {
vector[K] beta;
real alpha;
real lsigma;   //log(sigma)
}

transformed parameters {
real<lower=0> sigma;
sigma = exp(lsigma);
}
model {
vector[N] mu= X*beta + alpha;

//prior model

beta ~ normal(0,10) ;
alpha ~ normal(0,10);
lsigma ~ normal(0,10);

//observational model
for (i in 1:N)
target += cens[i]*normal_lcdf(C[i] | mu[i],sigma)  + (1-cens[i])*normal_lpdf(y[i] |  mu[i],sigma);
}
``````

The model ran successfully with the following results:

``````           mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff
beta[1]   -0.07    0.00 0.05   -0.16   -0.10   -0.07   -0.04    0.02  2450
beta[2]    0.12    0.00 0.04    0.04    0.09    0.12    0.15    0.20  2530
beta[3]    0.52    0.00 0.05    0.43    0.49    0.52    0.55    0.61  1979
beta[4]   -0.37    0.00 0.05   -0.47   -0.40   -0.37   -0.34   -0.28  2012
beta[5]   -0.38    0.00 0.05   -0.48   -0.42   -0.38   -0.35   -0.29  2501
alpha     -0.06    0.00 0.04   -0.14   -0.09   -0.06   -0.03    0.02  2120
lsigma    -0.04    0.00 0.03   -0.11   -0.07   -0.04   -0.02    0.03  2281
``````

Now, I reran the same model with the vectorized version of the stan code given below where I only changed a) cesoring flag (cens) from “integer array” to a “vector” and b) used vectorized log likelihood specification instead of loop. I expected same results, but there were some non-trivial differences between the two and I am wondering why? Anybody has any insights?

``````data {
int<lower=0> K;
int<lower=0> N;
vector[N] cens;  // int<lower=0> cens[N];
matrix[N,K] X ;
vector[N] y;
vector[N] C;
}

parameters {
vector[K] beta;
real alpha;
real lsigma;
}

transformed parameters {
real<lower=0> sigma;
sigma = exp(lsigma);
}
model {
vector[N] mu= X*beta + alpha;

//prior model

beta ~ normal(0,10) ;
alpha ~ normal(0,10);
lsigma ~ normal(0,10);   //mean=1 and variance=10

//observational model
target += cens*normal_lcdf(C | mu,sigma) + (1-cens)*normal_lpdf(y | mu,sigma);
}
``````
``````              mean se_mean   sd       2.5%        25%        50%        75%
beta[1]      -0.03    0.00 0.00      -0.04      -0.04      -0.03      -0.03
beta[2]       0.06    0.00 0.00       0.06       0.06       0.06       0.06
beta[3]       0.31    0.00 0.00       0.30       0.31       0.31       0.31
beta[4]      -0.20    0.00 0.00      -0.21      -0.20      -0.20      -0.20
beta[5]      -0.20    0.00 0.00      -0.20      -0.20      -0.20      -0.19
alpha        -0.39    0.00 0.00      -0.39      -0.39      -0.39      -0.39
``````

The problem you’re running into is that `normal_lcdf(...)` returns a single real value that is the sum of the vectorization. So the these two are very different:

``````target += cens*normal_lcdf(C | mu,sigma) + (1-cens)*normal_lpdf(y | mu,sigma);
``````
``````for (i in 1:N)
target += cens[i]*normal_lcdf(C[i] | mu[i],sigma)  + (1-cens[i])*normal_lpdf(y[i] |  mu[i],sigma);
``````

The latter one using a loop is the mathematically correct form assuming you are censoring observation `i` below at `C[I]`.

A much more efficient way to do this is to create two integer indices, one for censored values and one for uncensored values and then use slicing to vectorize the uncensored observations.

1 Like

Ah!. Thank you.