I’ve been trying to wrap my head around how to best go about multiple imputation with Stan code, and I’ve read a number of different posts and vignettes on this forum, StackExchange, and various GitHubs. While all helpful, I have found that there is a fair amount of variability in recommendations and resources, but these recommendations tend to be very use specific while tutorials often focus on cases where only one variable has missing data or where all the missing data can be modeled via a multivariate normal distribution.
My hope in creating this thread is to get a sense for what the community thinks of with respect to implementing imputation models in Stan from problem conceptualization to implementation and verification. If such resources for a workflow already exist, then I’d greatly appreciate any links or recommended readings as I haven’t found anything quite as dedicated to the topic as I seemingly need in order to feel comfortable write a Stan model to accommodate a realistically complicated missing data scenario.
To ground this somewhat, I can share my general situation, though I don’t necessarily mean for this thread to be beholden to this singular context. I’m running item response theory models where there are a number of person-level and item-level covariates being used to explain/predict the person and item parameters. The missingness is strictly in predictors with all but a few covariates of interest missing at least some observations. The covariates that are missing include variables that could be modeled via continuous distributions (likely a simple normal or skew normal in some cases) as well as discrete distributions (including multinomial, ordered, and binary). There are auxiliary variables that can be used to help impute missing values in the covariates of interest as well with similar properties of missingness and distributional considerations.
From my general searches, the major recommendations that I’ve seen are listed below. For each approach, I’m including some thoughts/questions/comments to hopefully give some extra context on what isn’t necessarily clear to a very much so applied data analyst with only a very practical appreciation for concepts like marginalizing out a variable or Jacobian adjustments.
- Use other software like miceto complete the imputation and then pass the data to the Stan model –
This is a very intuitive solution in my mind as other resources for handling missing data typically have a lot of documentation about how to do so and what arguments to adjust for specific kinds of problems. When a model can be run with a package like brms, this is a fine plug-and-play, just-power-through kind of approach since there are built-in functions to help with all the data cleaning on the backend. Even then, how does one conduct appropriate posterior predictive checks when there are technically multiple datasets being used to fit the model? Another concern here is just the time cost of Bayesian analyses: how can this be made efficient so that models that may take many hours to fit on a single dataset aren’t taking days (or longer) to run multiple times over imputed datasets?
- Treat missing variables as parameters to be estimated and impute values for the missing values of each variable –
This seems to be the more common of the general tutorials and guidances that I have been able to find, and the documentation for these approaches is pretty good overall. When there are tutorials for estimating missing discrete variables, they tend to focus on the binary case, and it isn’t always clear (at least to me) how those methods would generalize to categorical or ordered cases. Additionally, when missing values are imputed essentially one variable at a time, it would seem highly inefficient. In essence, a multivariable formula is being used to estimate the linear component of each variable with missingness that is then used for the sampling statement over the combined observed and missing observations vector. When there are multiple covariates with missing data and one is using auxiliary variables in these imputations as well, this seems to very rapidly increase the number of parameters and equations being estimated. Is there a point at which this very straightforward approach is more efficient than something more complicated – e.g., some number of observations by some number of predictors vs auxiliary variables?
- Treat missing variables as parameters but impute values from a multivariate model such as with copulas –
This particular topic seemingly has gained a lot of attention on the forum, and the documentation by @spinkney and the development of these methods for the next brms update seemingly have driven this. At present, I’ve seen implementation of a multivariate gaussian copula with mixed data types: normal, binomial, and Poisson (github/spinkney). On the forum, it also seems that an ordered marginal has been added to this functionality (Speed-up for Mixed Discrete-Continuous Gaussian Copula - Reduce Sum?). It seems relatively trivial to model any particular marginal via a gaussian copula when the marginal distribution has a *_cdf definition in Stan, but many of the discrete distributions lack the CDF, meaning that users need to “do the math” to make the copula work. There are also vine copulas, which seem to be restricted to bivariate copulas but offer additional flexibility in terms of the specified relationship between the variables. In the case of multivariate missing data, is there a multivariate extension of the vine copula or does one need to simply estimate each pairwise bivariate vine copula? Since most discrete distributions can be treated as arising from a continuous latent variable, is it more efficient/ever appropriate to model missingness via a multivariate regression on this latent scale and then use the results to get the discrete marginals?
- Use gaussian process models to impute the multivariate missing data –
This is relatively new to me as I’ve not really ever used GP models before, but I have seen some literature on its use for data imputation (e.g., here and here). This also seems to have been raised on the forum before (Scalable Gaussian process inference - #8 by tillahoffmann). GPs seem to be very flexible, but are they a reasonable default approach to imputing missing data in Stan? They strike me as computationally intensive and a little black-boxy, though this could be my general lack of familiarity with them. I also don’t believe that I’ve ever seen an implementation of GPs for missing data in Stan, though this also might be a bias in terms of the places I tend to look and the kinds of problems that I encounter in neuropsychology versus big data analysis.
Looking forward to the discussion and recommendations that will hopefully come from this!