Missing count data, use all possible combinations with fractional likelihood weights

Hello, I have a nasty analysis where data is full of holes (not collected by me). I tried to fill up the holes algorithmically (a damn huge effort) but still, in some situations that’s not possible:

The above is an example of covid outbreak data in long-term care facilities, and my goal is to evaluate vaccine effectiveness. In the example, I have some missing that I cannot impute deterministically. We have a total of 11 asymptomatic cases whose vaccination status is unknown. Since I have the denominators, I can say that those 11 are either partially or fully vaccinated and that the partially ones can be a number between 0 and 2 (and consequently the fully ones are 9 to 11).

I was wondering if it’s legit to create 3 copies of the data, all weighted 1/3 at the likelihood level (the |weight() parameter in brms) for all possible legit imputations, with a random intercept at the outbreak level. The same would be done for all outbreaks with missing data in the dataset. Would that make sense?

I know that the correct approach would be repeating the analysis many times with random imputations and averaging but the number of possible imputation combinations is towards infinite.

Another possible approach is not to have uniform weights but have them coherent with the rest of the data (more complicated and more degrees of freedom in choosing how to do this)

What if you modeled your missing data with suitable distributions?

1 Like


For a number of reasons:

  • Here we are speaking of discrete count imputation which I believe Stan cannot model. also the numbers are low, so I wasn’t sure how much bias could a continuous approximation entails.
  • Is the approach you suggested computationally as heavy as multiple imputation?
  • The imputed values need to follow a number of hard constraints on their values and sums which varies by outbreak.
  • But here’s the most important one: I’m definitely not versed in stan and I’m mostly a humble brms user, so I wouldn’t know how to manage such a complex data structure.

But maybe what I want to do is totally feasible in Stan and some tutorial could help me!

Anyway, aside from the specific problem, I’m also interested to know if the technique in general would make theoretical sense.
To summarise, the idea is to replicate observations with missing values substituting them with all possible values (or a sample of them) and then underweight their likelihood proportionally (1 over n. replicates) plus adding a random effect to account for the repeated measurements.