Clarification on Censored Data Models

I would like to create a partially synthetic dataset (along the lines of Stephen P. Jenkins’ “Measuring inequality using censored data:
a multiple-imputation approach to estimation and inference”
, but Bayesian). As a first step I’m sampling one random value from the posterior of each censored value and attaching it to the observed data.

The code I’m running is just a basic example using censored lognormal data:

stan.mix.cen.test="
data {
  int<lower=0> N_obs;
  int<lower=0> N_cens;
  real y_obs[N_obs];
  real U;
}
parameters {
  real<lower=U> y_cens[N_cens];
  real<lower=0> mu;     
  real<lower=0> sigma;  
} 
model {
  y_obs ~ lognormal(mu,sigma);
  y_cens ~ lognormal(mu, sigma);
}"

set.seed(123)
y <- rlnorm(5000,10.5,.65)
U<-100000
N.cens <- length(y[y>U])
ycens <- y[y<U]
dataList = list(y_obs = ycens , N_obs = length(ycens), U=U, N_cens = N.cens)
cenfit = stan(model_code=stan.mix.cen.test, data=dataList,
               chains=4 , iter=200 , warmup=200) 

stan.out = data.frame(extract(cenfit))

 ys = stan.out[,1:N.cens]
 y.cens.samp <- apply(ys,2,sample,size=1)

post.data = data.frame("y"=c(as.numeric(ycens),as.numeric(y.cens.samp)))


1 Like

I realize that taking the maximum posterior value of y_cens is not the right way of creating the synthetic data.

There’s a chapter in the user’s guide section of the manual that explains how to code up censoring and truncation in Stan.