Hi,
I encounter a problem where I want to fit a distribution for each of two groups with missing group membership. Suppose there are 2 groups, group 0 and group 1, and a sample size of N. Some observations have missing group membership, that is, we don’t know if they come from group 0 or group 1. I impute the missing group membership from the observed group membership, and try to fit a distribution for each group. The issue is that the model can be fitted but the max tree depth is reached. So the estimates are not reliable. I try increasing the tree depth, but this doesn’t help, and I find some similar situations also have this max tree depth reached issue.
The stan and R code is as below.
### load some packages
library(tidyverse)
library(rstan)
### generate data
set.seed(1)
# N is sample size
N <- 100
# z = 0 means group 0 and z = 1 means group 1
z <- rbernoulli(N, 0.8) %>% as.numeric()
# x is 1 in group 1 and 0 in group 0
x = ifelse(z == 1, 1, 0)
# y = 2 + x + N(0,1) in group 1, y = 1 + N(0, 1) in group 0
y = ifelse(z == 1, 2 + x + rnorm(N), 1 + rnorm(N)) # y
# 10% of z is set to be missing
z[sample(N, 0.1*N, replace = FALSE)] <- NA
### stan model code
# parameters are real values, so I use a truncated normal on [0, 1] with logit mean to model the group membership, instead of using bernoulli distribution. my idea is that the missing group membership is group 1 if the sampled value from this distribution is >= 0.5, otherwise is group 0.
stancode <- "
data {
int N;
int Nobs;
int Nmis;
vector[N] y;
vector[N] x;
int id0[Nobs];
int id1[Nmis];
vector[Nobs] zobs;
}
parameters {
vector[2] a;
vector[2] b;
real<lower=0> sigma0;
real<lower=0> sigma1;
vector<lower=0, upper=1>[Nmis] zmis;
real muz;
real<lower=0> sigmaz;
}
transformed parameters {
vector[N] ztrans;
{int m = 1;
for(n in id0) {
ztrans[n] = zobs[m];
m += 1;
}
}
{int m = 1;
for(n in id1) {
ztrans[n] = zmis[m];
m += 1;
}
}
}
model {
a ~ normal(0, 1);
b ~ normal(0, 1);
sigma0 ~ normal(1, 1) T[0, ];
sigma1 ~ normal(1, 1) T[0, ];
for(n in 1:Nmis) {
zmis[n] ~ normal(0.5, 0.5) T[0, 1];
}
muz ~ normal(0.5, 0.5);
sigmaz ~ normal(0, 0.5) T[0, ];
for(n in 1:N) {
ztrans[n] ~ normal(inv_logit(muz), sigmaz) T[0, 1];
}
for(n in 1:N) {
if(ztrans[n] < 0.5) {
y[n] ~ normal(a[1] + a[2]*x[n], sigma0);
} else {
y[n] ~ normal(b[1] + b[2]*x[n], sigma1);
}
}
}
"
### fit a stan model
stanfit <- stan(model_code = stancode, data = list(N = N, Nobs = sum(!is.na(z)), Nmis = sum(is.na(z)), id0 = which(!is.na(z)), id1 = which(is.na(z)), y = y, x = x, zobs = na.omit(z)), seed = 1, chain_id = 1, chains = 1, control = list(max_treedepth = 15))
### check some results
# mean and sd estimates look good
summary(stanfit)$summary
# shinystan shows max tree depth reached
shinystan::launch_shinystan(stanfit)
I am thinking if the problem comes from a result that the missing group membership can be imputed as group 0 or group 1 for the same person during sampling, that is, for the same person under 2000 iterations, some iterations have an imputed value of group 0 and other iterations have an imputed value of group 1. I am wondering if the changing group membership over iterations can affect sampling in stan.
# extract the samples for the missing group membership
samples <- rstan::extract(stanfit)
zmisfit <- samples$ztrans[, is.na(z)]
# convert to group membership, 0 or 1
zmisfit <- ifelse(zmisfit >= 0.5, 1, 0)
# calculate the mean of group membership for each person
colMeans(zmisfit) %>% range()
# 0.239 0.958
# each person can be in group 0 or 1 over different iterations
Can anyone give me some idea?