Why I cannot recover covariance and mean parameter?

I am considering the following situations. Let Y=123+0.5*age+5*sex+3*gene+\epsilon where \epsilon\sim Normal(0,10). A status variable censor will register whether Y has been changed by a coin flipping procedure.

If Y\in (130,160], then with a chance of 0.6 Y is added by a random normal with mean -10 and sd 3 truncated above 0. In this case censor=2 if Y is added by a number and 1 if Y is untouched.

If Y\in (160,\infty), then Y has a chance of 0.5 to be added by a random normal with mean -20 and sd 5 truncated above 0. In this case censor=3 if Y is added by a number or 1 if Y is untouched.

Since in either cases Y is added by a truncated normal with mean at least twice larger than sd, I can replace approximate the truncated density by normal density. Thus I can see there are three cases according to censoring.

If censor=1, then Y\sim Normal(\beta^Tx,\sigma).
If censor=2, then Y\sim Normal(\mu+\beta^Tx,\sigma_{\mu}).
If censor=3, then Y\sim Normal(\mu_2+\beta^Tx,\sigma_{\mu_2}.

I would expect \sigma=10,\sigma_{mu}=\sqrt{10^2+3^2},\sigma_{mu_2}=\sqrt{10^2+5^2},\mu=-10,\mu_2=-20. \beta here includes intercept term.


Here is r code to generate data.

  set.seed(1234)
  nsim=800 #800
  age=runif(nsim,min=25,max=74)
  sex=rbinom(nsim,size=1,prob=0.5)
  gene=rbinom(nsim,size=1,prob=0.51)
  sig_err=10
  err=rnorm(nsim,mean=0,sd=sig_err)
  beta0=123;beta1=0.5;beta2=5;beta3=3
  Z=beta0+beta1*age+beta2*sex+beta3*gene+err
  list_hyp=which(Z>130 & Z<=160)
  list_su_hyp=which(Z>160)
  num_hyp=length(list_hyp)
  num_su_hyp=length(list_su_hyp)
  treat_hyp=rbinom(num_hyp,1,prob=0.6)
  treat_su_hyp=rbinom(length(list_su_hyp),1,prob=0.5)
  Z[list_hyp]=Z[list_hyp]+treat_hyp*rnormTrunc(num_hyp,mean=-10,sd=3,max=0)
  Z[list_su_hyp]=Z[list_su_hyp]+treat_su_hyp*rnormTrunc(num_su_hyp,mean=-20,sd=5,max=0)
  hyp=data.frame(id=1:nsim,age=age,sex=sex,gene=gene,BP=Z,true_BP=old_Z)
  hyp$censor=1
  hyp$censor[list_hyp[as.logical(treat_hyp)]]=2
  hyp$censor[list_su_hyp[as.logical(treat_su_hyp)]]=3
  
  data=list(N=dim(hyp)[1],
            X=as.matrix(hyp[,c('age','sex','gene')],ncol=3),
            censor=hyp$censor,
            BP=hyp$BP,
            hyp_diag=160,
            num_censor=sum(hyp$censor)
  )
  sampling(model,data=data)
data{
  int<lower=1> N; 
  matrix[N,3] X;
  int censor[N];
  vector[N] BP;
  real<lower=0> hyp_diag;
}

parameters{
  real intercept;
  vector[3] beta;
  real<lower=0> sigma; 
  real<upper=0> mu;
  real<upper=mu> mu2;
  real<lower=sigma> sig_mu;
  real<lower=sigma> sig_mu2;
}

model{
  intercept ~ normal(0,100);
  sigma ~ student_t(3,0,2.5);
  sig_mu ~ student_t(3,0,2.5);
  sig_mu2 ~ student_t(3,0,2.5);
  mu ~ normal(0,10);
  mu2 ~normal(0,10);
  to_vector(beta) ~ multi_normal([0,0,0],diag_matrix(square([10,10,10]')));
  for(n in 1:N){
    if (censor[n]==1) {
      BP[n] ~ normal(intercept+X[n,1]*beta[1]+X[n,2]*beta[2]+X[n,3]*beta[3],sigma);
    }else if (censor[n]==2){
      BP[n] ~ normal(mu+intercept+X[n,1]*beta[1]+X[n,2]*beta[2]+X[n,3]*beta[3],sig_mu);
    }else if (censor[n]==3){
      BP[n] ~ normal(mu2+intercept+X[n,1]*beta[1]+X[n,2]*beta[2]+X[n,3]*beta[3],sig_mu2);
    }
  }
}

Here is output:

intercept 127.0305705 0.0282909377 1.37467000 124.3051316 129.6931821 2361.034
beta[1]     0.4233285 0.0004757727 0.02391632   0.3766278   0.4703922 2526.909
beta[2]     4.5895118 0.0115347598 0.66522478   3.2544294   5.8699729 3325.981
beta[3]     2.4931633 0.0120725312 0.65245388   1.2161767   3.7494475 2920.808
sigma       9.2381318 0.0039795457 0.23835612   8.7711815   9.7197286 3587.448
mu        -11.6151320 0.0106144503 0.67463635 -12.9403125 -10.2880964 4039.657
mu2       -11.8625166 0.0106154207 0.70014741 -13.2387657 -10.5160568 4350.153
sig_mu      9.3278164 0.0041279792 0.24834598   8.8595974   9.8258083 3619.422
sig_mu2     9.6664530 0.0059720354 0.40824948   9.0047848  10.5756451 4673.116
               Rhat
intercept 0.9997761
beta[1]   1.0000864
beta[2]   0.9994506
beta[3]   1.0000846
sigma     0.9998949
mu        0.9997122
mu2       0.9999956
sig_mu    0.9995576
sig_mu2   1.0004869

I intentionally set \mu_2's upper bound as \mu to make sure that \mu_2<\mu. Apparently \mu_2 is very close to \mu according to Stan. However, this is not what is set in the simulation. Similarly covariance estimations \sigma_\mu,\sigma_{\mu_2} are too far away from real answer. Somehow truncated normal’s sd was never considered. I have checked the trace plot for convergence and ACF plot as well. I did not find anything weird.

Q1: Since I am using approximation of truncated normal by normal, is this the cause of the issue?

Q2: How do I recover \sigma_\mu and \sigma_{\mu_2}?

There’s a notable difference between the simulation the Stan model: in the simulation individuals are treated only if their pre-treatment BP was high but the model ignores pre-treatment BP and kind of assumes the treatment is assigned at random among all participants.
Even after treatment high-BP individuals have higher-than-average BP so the naive inference is that treatment increases BP. Of course, if (age, sex, gene) are good predictors the model may be able to predict that the treated individuals would have had higher-than-average BP and use that information to infer that the treatment does decrease BP but I think the simulated data set has too much noise for that.
I would at least add statements saying the pre-treatment BP was above the threshold

for(n in 1:N){
  if (censor[n]==1) {
    BP[n] ~ normal(intercept+X[n,1]*beta[1]+X[n,2]*beta[2]+X[n,3]*beta[3],sigma);
  }else if (censor[n]==2){
    target += normal_lccdf(130, BP[n] - mu, sig_mu); // pre-treatment BP > 130
    BP[n] ~ normal(mu+intercept+X[n,1]*beta[1]+X[n,2]*beta[2]+X[n,3]*beta[3],sig_mu);
  }else if (censor[n]==3){
    target += normal_lccdf(160, BP[n] - mu2, sig_mu); // pre-treatment BP > 160
    BP[n] ~ normal(mu2+intercept+X[n,1]*beta[1]+X[n,2]*beta[2]+X[n,3]*beta[3],sig_mu2);
  }
}

I think the most correct thing to do is to model the pre-treatment BP as a latent parameter.

data{
  int<lower=1> N; 
  matrix[N,3] X;
  int censor[N];
  vector[N] BP;
  real<lower=0> hyp_diag;
}
transformed data {
  int N_preBP2 = 0;
  int N_preBP3 = 0;
  for (i in 1:N) {
    if (censor[i] == 2) {
        N_preBP2 += 1;
    } else if (censor[i] == 3) {
        N_preBP3 += 1;
    }
  }
}

parameters{
  real intercept;
  vector[3] beta;
  real<lower=0> sigma; 
  real<upper=0> mu;
  real<upper=mu> mu2;
  real<lower=sigma> sig_mu;
  real<lower=sigma> sig_mu2;
  vector<lower=130,upper=160>[N_preBP2] preBP2;
  vector<lower=160>[N_preBP3] preBP3;
}

model{
  intercept ~ normal(0,100);
  sigma ~ student_t(3,0,2.5);
  sig_mu ~ student_t(3,0,2.5);
  sig_mu2 ~ student_t(3,0,2.5);
  mu ~ normal(0,10);
  mu2 ~normal(0,10);
  to_vector(beta) ~ multi_normal([0,0,0],diag_matrix(square([10,10,10]')));
  int n_preBP2 = 1;
  int n_preBP3 = 1;
  for(n in 1:N){
    real z = intercept+X[n,1]*beta[1]+X[n,2]*beta[2]+X[n,3]*beta[3];
    if (censor[n]==1) {
      BP[n] ~ normal(z,sigma);
    } else if (censor[n]==2){
      preBP2[n_preBP2] ~ normal(z, sigma);
      BP[n] ~ normal(mu + preBP2[n_preBP2],sig_mu);
      n_preBP2 += 1;
    } else if (censor[n]==3){
      preBP3[n_preBP3] ~ normal(z, sigma);
      BP[n] ~ normal(mu2 + preBP3[n_preBP3],sig_mu2);
      n_preBP3 += 1;
    }
  }
}
2 Likes

I think the main issue of using normal_lccdf is that it will treats those with a few \sigma away as 0 density. And this frequently occurred to me during simulation, when I used explicit truncated normal distribution. Do you encounter such issue during simulation?

If normal_lccdf(x, y, z) underflows to zero you may still be able to write normal_lcdf(-x, -y, z). This is theoretically identical (the normal distribution is symmetric) but I think Stan math library takes extra care to implement the lcdf in a numerically stable fashion.

In any case, the last model in my post (with extra latent parameters) does not use explicitly truncated distributions.

Just out of curiosity. Why does the code work? Is there any inherent reason to believe explicit latent modeling better than not using latent representation? I think the models are completely equivalent. However, in computation, there is difference. And when do I expect latent representation work for computation in general? Is there a guideline for that? Thanks.

I’m not so sure the models are equivalent. I think the model with cdfs may be missing some correlation between pre-treatment BP and post-treatment BP.
I think explicit modeling of the latents makes it easier to see when the model accurately recapitulates the simulated data generating process. I don’t know if there are any guidelines for computational issues though.

The models are equivalent in the following sense by marginalizing the latent variables, where constraints were dropped. If constraints were applied, then the models would not be equivalent.

I have checked the calculation with and without extra constraint on preBP2 and preBP3. There is almost no difference. Had I implemented non latent model by integrating out preBP2 and preBP3, would the results agree? It seems that latent representation is data augmenting to recover treatment effect before intervention. Is that correct?