 Missing binomial data with more than 1 trial

So I am trying to figure out how to (properly) account for missing data when the outcome is binomial and there are more than one trial for the binomial distribution (i.e., it is not a bernouli distribution).

For example, modeling the data below is trivially easy in stan

y=rbinom(100,10,0.25)

This stan model works perfectly (well, the priors could be better, but well enough).

data{
int N;
int y[N];
}

parameters{
real p;
}

model{
p~uniform(0,1);
y~binomial(10,p);
}

If the data were normally distributed and I had missing observations I could model the missing data as a latent variable, say like this:

parameters{
real mu;
real sigma;
real y_mis[N_mis];
}
model{
mu~normal(0,1);
sigma~normal(0,1);
y_mis~normal(mu,sigma);
y_obs~normal(mu,sigma);
}

If the data were bernoulli I could use a process similar to the one described here. Which looks something like this:

model{
for(i in 1:N){
if(missing){
target+=log_mix(inv_logit(p), binomial_logit_lpmf(0|1,p),binomial_logit_lpmf(1|1,p));
}else{
target+=binomial_logit_lpmf(y[i]|10,p);
}
}

But it doesn’t seem like that mixture model will work here since there is now a mismatch in the binomial distribution being used for the missing data (with 1 trial) and the observed data (with 10 trials).

I thought of using the binomial_rng function to simply sample missing data from the expected distribution and then using that as imputed values for the missing data, but since that function only works in the generated quantities or the transformed data blocks I don’t know that it would work here (and I suspect it was deliberately designed that way to prevent someone from doing exactly what I was thinking of doing. I am assuming there is a good reason to not do things that way).

I am fairly new to all this so it is entirely likely that I’m missing something obvious, but any advice would be appreciated.

Update.

I now know that log_mix() can take a vector of probabilities and likelihoods. So for now I am doing something like this:

if(y[i] is missing){//however you sort out missing variables
vector[K+1] probs;
vector[K+1] likelihoods;
for(j in 0:K){//where K is the number of trials in the binomial distribution you're working with
probs[j] = choose(K,j)*(p^j)*(1-p)^(K-j); //just the probability of observing j successes given probability p
likelihoods[j]=binomial_logit_lpmf(j|K,p);
}
target+=log_mix(probs, likelihoods);
}else{
target+=binomial_logit_lpmf(y[i]|K,p);
}

This seems to work, although that doesn’t necessarily mean I’m doing this right. If I’m doing something wrong here I would love to know.

1 Like

In almost all cases, missing values in the response cannot provide any information about your model coefficients and can thus be simply ignored, see e.g. MICE missing values on the response This is different from missing data in predictors where it may make sense to model those explicitly. So unless you need to use the binomially distributed data as predictors, ignoring them is probably best you can do.

Best of luck with your model!

2 Likes