Any big ideas on imputing count/integer variables?

Hi all -

I wanted to check and see if there were any really great ideas for imputing count variables. This is for a project I’m doing where imputing counts is just gonna have to happen, and N is quite large.

Here’s what I’m considering doing:

  1. Define a set of ordered cut points.
  2. Map each cutpoint on to a set of count data. So cutpoint one is 1/2 sample average, cutpoint two is the sample average (i.e. in place of missing data), and cutpoint three is double sample average.
  3. Instead of trying to marginalize over all possible N, marginalize over these cutpoints to obtain an approximate uncertainty distribution as to whether missing value is sample average, less than sample average, or above sample average.
  4. Obviously more cutpoints would help make the uncertainty PMF more accurate.
1 Like

What distribution fits the data aside from the missing values? I.e. are you modelling it as Poisson, negative Binomial, etc ?

1 Like

There exists a continuous version of the negative binomial distribution (and a paper with that title)

In other words, it is a continuous distribution but if you happen to evaluate it at an integer, you get a negative binomial. So, my (unimplmented) big idea is to use it to impute continuous values and round them to the nearest integer and prove that would be basically the same thing as imputing them from a negative binomial directly but workable in Stan.


Hmmm. There’s also the exponential <-> geometric relationship, I wonder if it would be possible to try something similar as well.

I’m using the beta-binomial distribution.

Firstly, if the counts are reasonably large, then a (variance-stabilizing) square-root transformation may be an option. Then, everything that works for normal data might just work (and you can decide to round after imputation or something like that).

Secondly, there’s some reasonably good implementations for a latent normal model (e.g. the Amelia R package), but for some reason count data is not that commonly covered (e.g. Amelia does not cover it, but covers ordinal, categorical etc. really nicely). I’m not really sure whether there’s any real reason why Poisson with a latent normal random effect is particularly hard to deal with? I would have thought not.

Thirdly, a random effects Poisson model might be a pretty decent choice. Depending on what you mean by large, rstanarm can fit such a model pretty fast (as these things go with MCMC samplers, anyway). The negative binomial model (or rather Poisson with Gamma-random effect) is usually computationally a bit more tricky (but that has been worked on a good bit, e.g. Keene, O.N., Roger, J.H., Hartley, B.F. and Kenward, M.G., 2014. Missing data sensitivity analysis for recurrent event data using controlled imputation. Pharmaceutical statistics , 13 (4), pp.258-264. or Roger, J.H., Bratton, D.J., Mayer, B., Abellan, J.J. and Keene, O.N., 2019. Treatment policy estimands for recurrent event data using data collected after cessation of randomised treatment. Pharmaceutical statistics , 18 (1), pp.85-95. ; I’m also involved in another planned one on this topic).

I’m a bit confused why a beta-binoial would be appropriate. That’s more for a binary outcome when there’s a beta-distributed random effect on the probability scale across units, is it not?

@Bjoern I’m probably just naive but if his model is beta-binomial then presumably the imputation wants to proceed from the beta-binomial distribution rather than any other? I.e. to use a normal distribution for imputation he’d need to model the problem differently, right?

@saudiwin, is your observed and missing data distributed in the same way? Thus, is it an option for you to just model your observed data as planned and impute the missing values by predicting it in the generated quantities block using the posterior predictive distribution?

@emiruz, if these are indeed binomial outcomes (rather than count data), then a binomial imputation model does indeed make more sense. One might for example use a logistic regression model (modeling y successes out of n trials, which would be the setting in which beta-binomial would make sense) with a normally distributed random effect on the logit scale*.

* Reason for this suggestion: Whether you use a beta distributed random effect on the probabilitiy or a normal distributed random effect on the logit-probability scale is often pretty irrelevant (=gives more or less the same results; same argument about Poisson with gamma random effect on rate or normal random effect on log-rate). Unless there’s very strong reasons to choose one or the other (e.g. you pre-specified what you’d do, you were asked/promised to do one thing by a regulatory authority or the like), I’d normally go with computational convenience. Computational convenience with Gibbs-sampling/hand-calculation may often point towards conjugate solutions, while good sampling properties with Stan in my experience often are easier to achieve with normal random effects on log-/logit-scales (take that with a pinch of salt, this is experience with a few dozen examples in one domain, it may well not be a general rule). Plus, you’d be able to just use rstanarm to do this for you.

I guess another point is that with an explicit random effect (instead of integrating it out, as one does with beta-binomial or negative-binomial), it becomes easier to do certain popular missing-not-at-random analyses (in clinical trials we often do jump-to-reference, where you keep the subject specific random effect, but switch the covariates to the control [“reference”] group, which is very easy when you have an explicit random effect). It makes much less of a difference when you only want to impute under missing at random. That’s another good topic brought up by @emiruz. I.e. what is reasonable to assume for the missing data. Let’s give an example: let’s say you are interested in the number of Starbucks cafes a person visits per day, and all missing data are missing due to people going to prison and being unable to fill in your online form. The daily number of Starbucks cafes visited “probably” does not follow the same distribution as for people that are not in prison (which you assume when imputing under MAR - assuming MAR makes more sense, if you wanted to answer the question how many Starbucks cafes they would have visited, if they had not gone to prison). If you want the number they really visited, then you can either make an assumption (e.g. it is 0, if you are in prison - probably a pretty plausible assumption in this example), or you could try to get data for some of these people and impute based on that, or various other ideas that might be reasonable.

1 Like

The BDA3 method in the event that missing and observed are distributed the same is in effect to treat the missing as a parameter and fit it; which can’t be done for discrete variables in Stan but can be in pymc3/jags/nimble/Anglican/Figaro/etc. To that effect missing data presumably contributes to uncertainty. Just using the posterior predictive ultimately would not contribute to variance but at least the data is imputed from the right (assumed) distribution.

To me it seems that in this event, Stan offers no ideal solutions because parameters can’t be discrete and every solution will have to be either from the posterior or a post processing of a continuous solution. Am I wrong?

The trick is usually to have a hidden/latent state variable/random effect, from which the observed discrete variables are realizations. E.g. if you observed a yes/no and a count variable for every subject, you might assume that there’s a bivariate normal random effect with correlated components across subjects (first component is a random effect on the logit-odds of a yes, second component is a random effect on the log-rate for the count variable). If one of the two are missing, then under Missing at Random, you do not even need to do anything further than fit this model with no contribution to the likely from the one of the two that is missing. If you want some hand-crafted imputation under e.g. MNAR, then you may have to write your own imputation after fitting the model.

1 Like