Assigning bernoulli prior to missing entries in covariate

I am fitting a logistic model in STAN (rstan library). My response variable does not have any missing values, however one of my covariates “HB” is binary and has missing entries.

Thus the goal is to impute at each iteration the missing entries in the binary vector, using a Bernoulli prior (with parameter, say, 0.5).

However, I am running into issues:

  • The missing data needs to be declared either as real or vector in the parameters or transformed parameters block;
  • Realizations from a Bernoulli distribution in the model block need to be integers;
  • As far as I am aware there is no function in STAN to convert a real or vector to integer.

I used the guidelines provided in section 3.3 of the STAN user guide. With the model below, the parser gives me an error at the bernoulli assignment line (penultimate line in the model block), saying it needs integers. Note: I also tried defining HB_miss as a real in the parameters block and getting the same error.

m2 <- '
data {                          
int<lower=0> N;                // total number of observations
int<lower=0,upper=1> y[N];     // setting the dependent variable y as binary
vector[N] X;                   // independent variable 1

int<lower=0> N_obs; 
int<lower=0> N_miss; 
int<lower=1, upper=N> ii_obs[N_obs]; 
int<lower=1, upper=N> ii_miss[N_miss]; 

vector[N_obs] HB_obs;         // independent variable 2 (observed) 

}
parameters {
real b_0;                      // intercept
real b_X;                      // beta 1,2, ...
real b_HB;
vector[N_miss] HB_miss;
}
transformed parameters {
vector[N] HB;
HB[ii_obs] = HB_obs;
HB[ii_miss] = HB_miss;
}
model {
b_0 ~ normal(0,100);           
b_X ~ normal(0,100);           
b_HB ~ normal(0,100); 
HB_miss ~ bernoulli(0.5); // This is where the parser gives me an error
y ~ bernoulli_logit(b_0 + b_X * X + b_HB * HB); // model
}

Any ideas how I can assign a bernoulli prior to HB_miss effectively in STAN?

You basically can’t do this. Stan does not work with discrete parameters because it cannot calculate the gradients. You probably can solve this by using a mixture model for the missing observations with two components and a mixture probability .5.

Component 1 for HB_miss = 0

bernoulli_logit(b_0 + b_X * X)

Component 2 for HB_miss = 1

bernoulli_logit(b_0 + b_X * X + b_HB)

Chapter 5 and 7 of the Stan User Guide might give more explanation.

1 Like

Got it!
Indeed I found out I can use log_mix for the mixture of the 2 bernoulli distributions you suggest above. Also need to marginalize over the parameters using bernoulli_logit_lpmf:

for (i in 1:N) {
if (HB[i] == -1) {
target += log_mix(lambda, bernoulli_logit_lpmf(y[i]| b_0 + b_X * X[i] + b_HB), bernoulli_logit_lpmf(y[i]| b_0 + b_X * X[i]));
} else {
HB[i] ~ bernoulli(lambda);
y[i] ~ bernoulli_logit(b_0 + b_X * X[i] + b_HB * HB[i]);
}

1 Like