Missing binomial data with more than 1 trial

pgok · November 19, 2021, 12:13am

So I am trying to figure out how to (properly) account for missing data when the outcome is binomial and there are more than one trial for the binomial distribution (i.e., it is not a bernouli distribution).

For example, modeling the data below is trivially easy in stan

y=rbinom(100,10,0.25)

This stan model works perfectly (well, the priors could be better, but well enough).

data{
  int N;
  int y[N];
}

parameters{
  real p;
}

model{
  p~uniform(0,1);
  y~binomial(10,p);
}

If the data were normally distributed and I had missing observations I could model the missing data as a latent variable, say like this:

parameters{
    real mu;
    real sigma;
    real y_mis[N_mis];
}
model{
    mu~normal(0,1);
    sigma~normal(0,1);
    y_mis~normal(mu,sigma);
    y_obs~normal(mu,sigma);
}

If the data were bernoulli I could use a process similar to the one described here. Which looks something like this:

model{
   for(i in 1:N){
        if(missing){
            target+=log_mix(inv_logit(p), binomial_logit_lpmf(0|1,p),binomial_logit_lpmf(1|1,p));
        }else{
            target+=binomial_logit_lpmf(y[i]|10,p);
    }
}

But it doesn’t seem like that mixture model will work here since there is now a mismatch in the binomial distribution being used for the missing data (with 1 trial) and the observed data (with 10 trials).

I thought of using the binomial_rng function to simply sample missing data from the expected distribution and then using that as imputed values for the missing data, but since that function only works in the generated quantities or the transformed data blocks I don’t know that it would work here (and I suspect it was deliberately designed that way to prevent someone from doing exactly what I was thinking of doing. I am assuming there is a good reason to not do things that way).

I am fairly new to all this so it is entirely likely that I’m missing something obvious, but any advice would be appreciated.

Update.

I now know that log_mix() can take a vector of probabilities and likelihoods. So for now I am doing something like this:

//somewhere in your model block
if(y[i] is missing){//however you sort out missing variables
    vector[K+1] probs;
    vector[K+1] likelihoods;
    for(j in 0:K){//where K is the number of trials in the binomial distribution you're working with
        probs[j] = choose(K,j)*(p^j)*(1-p)^(K-j); //just the probability of observing j successes given probability p
        likelihoods[j]=binomial_logit_lpmf(j|K,p); 
    }
    target+=log_mix(probs, likelihoods);
}else{
    target+=binomial_logit_lpmf(y[i]|K,p);
}

This seems to work, although that doesn’t necessarily mean I’m doing this right. If I’m doing something wrong here I would love to know.

martinmodrak · November 29, 2021, 8:06pm

In almost all cases, missing values in the response cannot provide any information about your model coefficients and can thus be simply ignored, see e.g. MICE missing values on the response This is different from missing data in predictors where it may make sense to model those explicitly. So unless you need to use the binomially distributed data as predictors, ignoring them is probably best you can do.

Best of luck with your model!

Topic		Replies	Views
Binomial model with number of trials as a variable General	5	1202	April 6, 2022
Accounting for unobserved successes in binary repreated trial model Modeling ecology	3	797	December 21, 2017
Using log_mix when imputing missing observations of a binary predictor variable Modeling specification	2	423	February 14, 2020
Treating a missing count data as a parameter and putting truncated binomial prior for it Modeling specification	7	402	October 26, 2023
Modeling with bernoulli or binomial distribution General brms	3	860	September 16, 2022

Missing binomial data with more than 1 trial

Related topics