Clarification on Censored Data Models

I would like to create a partially synthetic dataset (along the lines of Stephen P. Jenkins’ “Measuring inequality using censored data:
a multiple-imputation approach to estimation and inference”
, but Bayesian). As a first step I’m sampling one random value from the posterior of each censored value and attaching it to the observed data.

The code I’m running is just a basic example using censored lognormal data:

``````stan.mix.cen.test="
data {
int<lower=0> N_obs;
int<lower=0> N_cens;
real y_obs[N_obs];
real U;
}
parameters {
real<lower=U> y_cens[N_cens];
real<lower=0> mu;
real<lower=0> sigma;
}
model {
y_obs ~ lognormal(mu,sigma);
y_cens ~ lognormal(mu, sigma);
}"

``````
``````set.seed(123)
y <- rlnorm(5000,10.5,.65)
U<-100000
N.cens <- length(y[y>U])
ycens <- y[y<U]
dataList = list(y_obs = ycens , N_obs = length(ycens), U=U, N_cens = N.cens)
cenfit = stan(model_code=stan.mix.cen.test, data=dataList,
chains=4 , iter=200 , warmup=200)

stan.out = data.frame(extract(cenfit))

ys = stan.out[,1:N.cens]
y.cens.samp <- apply(ys,2,sample,size=1)

post.data = data.frame("y"=c(as.numeric(ycens),as.numeric(y.cens.samp)))

``````
1 Like

I realize that taking the maximum posterior value of y_cens is not the right way of creating the synthetic data.

There’s a chapter in the user’s guide section of the manual that explains how to code up censoring and truncation in Stan.