Modelling two distributions for two groups with missing group membership

Hi,

I encounter a problem where I want to fit a distribution for each of two groups with missing group membership. Suppose there are 2 groups, group 0 and group 1, and a sample size of N. Some observations have missing group membership, that is, we don’t know if they come from group 0 or group 1. I impute the missing group membership from the observed group membership, and try to fit a distribution for each group. The issue is that the model can be fitted but the max tree depth is reached. So the estimates are not reliable. I try increasing the tree depth, but this doesn’t help, and I find some similar situations also have this max tree depth reached issue.

The stan and R code is as below.

### load some packages
library(tidyverse)
library(rstan)

### generate data
set.seed(1)
# N is sample size
N <- 100
# z = 0 means group 0 and z = 1 means group 1
z <- rbernoulli(N, 0.8) %>% as.numeric()
# x is 1 in group 1 and 0 in group 0
x = ifelse(z == 1, 1, 0) 
# y = 2 + x + N(0,1) in group 1, y = 1 + N(0, 1) in group 0
y = ifelse(z == 1, 2 + x + rnorm(N), 1 + rnorm(N)) # y 
# 10% of z is set to be missing
z[sample(N, 0.1*N, replace = FALSE)] <- NA

### stan model code
# parameters are real values, so I use a truncated normal on [0, 1] with logit mean to model the group membership, instead of using bernoulli distribution. my idea is that the missing group membership is group 1 if the sampled value from this distribution is >= 0.5, otherwise is group 0.

stancode <- "
data {
int N;
int Nobs;
int Nmis;
vector[N] y;
vector[N] x;
int id0[Nobs];
int id1[Nmis];
vector[Nobs] zobs;
}

parameters {
vector[2] a;
vector[2] b;
real<lower=0> sigma0;
real<lower=0> sigma1;
vector<lower=0, upper=1>[Nmis] zmis;
real muz;
real<lower=0> sigmaz;
}

transformed parameters {
vector[N] ztrans;

{int m = 1;
for(n in id0) {
ztrans[n] = zobs[m];
m += 1;
}
}

{int m = 1;
for(n in id1) {
ztrans[n] = zmis[m];
m += 1;
}
}

}

model {
a ~ normal(0, 1);
b ~ normal(0, 1);
sigma0 ~ normal(1, 1) T[0, ];
sigma1 ~ normal(1, 1) T[0, ];

for(n in 1:Nmis) {
zmis[n] ~ normal(0.5, 0.5) T[0, 1];
}

muz ~ normal(0.5, 0.5);
sigmaz ~ normal(0, 0.5) T[0, ];
for(n in 1:N) {
ztrans[n] ~ normal(inv_logit(muz), sigmaz) T[0, 1];
}


for(n in 1:N) {
if(ztrans[n] < 0.5) {
y[n] ~ normal(a[1] + a[2]*x[n], sigma0);
} else {
y[n] ~ normal(b[1] + b[2]*x[n], sigma1);
}
}

}
"

### fit a stan model
stanfit <- stan(model_code = stancode, data = list(N = N, Nobs = sum(!is.na(z)), Nmis = sum(is.na(z)), id0 = which(!is.na(z)), id1 = which(is.na(z)),  y = y, x = x, zobs = na.omit(z)), seed = 1, chain_id = 1, chains = 1, control = list(max_treedepth = 15))

### check some results
# mean and sd estimates look good
summary(stanfit)$summary
# shinystan shows max tree depth reached
shinystan::launch_shinystan(stanfit)

I am thinking if the problem comes from a result that the missing group membership can be imputed as group 0 or group 1 for the same person during sampling, that is, for the same person under 2000 iterations, some iterations have an imputed value of group 0 and other iterations have an imputed value of group 1. I am wondering if the changing group membership over iterations can affect sampling in stan.

# extract the samples for the missing group membership
samples <- rstan::extract(stanfit)
zmisfit <- samples$ztrans[, is.na(z)]
# convert to group membership, 0 or 1
zmisfit <- ifelse(zmisfit >= 0.5, 1, 0)
# calculate the mean of group membership for each person
colMeans(zmisfit) %>% range()
# 0.239 0.958
# each person can be in group 0 or 1 over different iterations

Can anyone give me some idea?

As you mention in your post, Stan doesn’t support discrete unknown parameters as they are incompatible with Hamiltonian Monte Carlo sampling. Unfortunately, declaring continuous parameters and then conditioning the likelihood on an if statement is not a useful solution to this problem; it induces a likelihood that is discontinuous in the parameter values.

The solution is to marginalize over the unknown discrete states, yielding what is called a mixture model for the observations with missing membership. Some useful resources about marginalization and mixture modeling can be found here 7 Latent Discrete Parameters | Stan User’s Guide, here 5 Finite Mixtures | Stan User’s Guide, and here Mixture models in Stan: you can use log_mix() | Statistical Modeling, Causal Inference, and Social Science.