I am considering the following situations. Let Y=123+0.5*age+5*sex+3*gene+\epsilon where \epsilon\sim Normal(0,10). A status variable censor will register whether Y has been changed by a coin flipping procedure.
If Y\in (130,160], then with a chance of 0.6 Y is added by a random normal with mean -10 and sd 3 truncated above 0. In this case censor=2 if Y is added by a number and 1 if Y is untouched.
If Y\in (160,\infty), then Y has a chance of 0.5 to be added by a random normal with mean -20 and sd 5 truncated above 0. In this case censor=3 if Y is added by a number or 1 if Y is untouched.
Since in either cases Y is added by a truncated normal with mean at least twice larger than sd, I can replace approximate the truncated density by normal density. Thus I can see there are three cases according to censoring.
If censor=1, then Y\sim Normal(\beta^Tx,\sigma).
If censor=2, then Y\sim Normal(\mu+\beta^Tx,\sigma_{\mu}).
If censor=3, then Y\sim Normal(\mu_2+\beta^Tx,\sigma_{\mu_2}.
I would expect \sigma=10,\sigma_{mu}=\sqrt{10^2+3^2},\sigma_{mu_2}=\sqrt{10^2+5^2},\mu=-10,\mu_2=-20. \beta here includes intercept term.
Here is r code to generate data.
set.seed(1234)
nsim=800 #800
age=runif(nsim,min=25,max=74)
sex=rbinom(nsim,size=1,prob=0.5)
gene=rbinom(nsim,size=1,prob=0.51)
sig_err=10
err=rnorm(nsim,mean=0,sd=sig_err)
beta0=123;beta1=0.5;beta2=5;beta3=3
Z=beta0+beta1*age+beta2*sex+beta3*gene+err
list_hyp=which(Z>130 & Z<=160)
list_su_hyp=which(Z>160)
num_hyp=length(list_hyp)
num_su_hyp=length(list_su_hyp)
treat_hyp=rbinom(num_hyp,1,prob=0.6)
treat_su_hyp=rbinom(length(list_su_hyp),1,prob=0.5)
Z[list_hyp]=Z[list_hyp]+treat_hyp*rnormTrunc(num_hyp,mean=-10,sd=3,max=0)
Z[list_su_hyp]=Z[list_su_hyp]+treat_su_hyp*rnormTrunc(num_su_hyp,mean=-20,sd=5,max=0)
hyp=data.frame(id=1:nsim,age=age,sex=sex,gene=gene,BP=Z,true_BP=old_Z)
hyp$censor=1
hyp$censor[list_hyp[as.logical(treat_hyp)]]=2
hyp$censor[list_su_hyp[as.logical(treat_su_hyp)]]=3
data=list(N=dim(hyp)[1],
X=as.matrix(hyp[,c('age','sex','gene')],ncol=3),
censor=hyp$censor,
BP=hyp$BP,
hyp_diag=160,
num_censor=sum(hyp$censor)
)
sampling(model,data=data)
data{
int<lower=1> N;
matrix[N,3] X;
int censor[N];
vector[N] BP;
real<lower=0> hyp_diag;
}
parameters{
real intercept;
vector[3] beta;
real<lower=0> sigma;
real<upper=0> mu;
real<upper=mu> mu2;
real<lower=sigma> sig_mu;
real<lower=sigma> sig_mu2;
}
model{
intercept ~ normal(0,100);
sigma ~ student_t(3,0,2.5);
sig_mu ~ student_t(3,0,2.5);
sig_mu2 ~ student_t(3,0,2.5);
mu ~ normal(0,10);
mu2 ~normal(0,10);
to_vector(beta) ~ multi_normal([0,0,0],diag_matrix(square([10,10,10]')));
for(n in 1:N){
if (censor[n]==1) {
BP[n] ~ normal(intercept+X[n,1]*beta[1]+X[n,2]*beta[2]+X[n,3]*beta[3],sigma);
}else if (censor[n]==2){
BP[n] ~ normal(mu+intercept+X[n,1]*beta[1]+X[n,2]*beta[2]+X[n,3]*beta[3],sig_mu);
}else if (censor[n]==3){
BP[n] ~ normal(mu2+intercept+X[n,1]*beta[1]+X[n,2]*beta[2]+X[n,3]*beta[3],sig_mu2);
}
}
}
Here is output:
intercept 127.0305705 0.0282909377 1.37467000 124.3051316 129.6931821 2361.034
beta[1] 0.4233285 0.0004757727 0.02391632 0.3766278 0.4703922 2526.909
beta[2] 4.5895118 0.0115347598 0.66522478 3.2544294 5.8699729 3325.981
beta[3] 2.4931633 0.0120725312 0.65245388 1.2161767 3.7494475 2920.808
sigma 9.2381318 0.0039795457 0.23835612 8.7711815 9.7197286 3587.448
mu -11.6151320 0.0106144503 0.67463635 -12.9403125 -10.2880964 4039.657
mu2 -11.8625166 0.0106154207 0.70014741 -13.2387657 -10.5160568 4350.153
sig_mu 9.3278164 0.0041279792 0.24834598 8.8595974 9.8258083 3619.422
sig_mu2 9.6664530 0.0059720354 0.40824948 9.0047848 10.5756451 4673.116
Rhat
intercept 0.9997761
beta[1] 1.0000864
beta[2] 0.9994506
beta[3] 1.0000846
sigma 0.9998949
mu 0.9997122
mu2 0.9999956
sig_mu 0.9995576
sig_mu2 1.0004869
I intentionally set \mu_2's upper bound as \mu to make sure that \mu_2<\mu. Apparently \mu_2 is very close to \mu according to Stan. However, this is not what is set in the simulation. Similarly covariance estimations \sigma_\mu,\sigma_{\mu_2} are too far away from real answer. Somehow truncated normal’s sd was never considered. I have checked the trace plot for convergence and ACF plot as well. I did not find anything weird.
Q1: Since I am using approximation of truncated normal by normal, is this the cause of the issue?
Q2: How do I recover \sigma_\mu and \sigma_{\mu_2}?