Bayesian inference after multiple imputation

So, reading BDA3 (as well as Zhou & Reiter 2010 and I think that’s also the brm_multiple approach, right?), it seems one good approach to Bayesian inference after multiple imputation is to

  1. Do lots of imputations (let’s say 100+).
  2. Fit a Stan model for each imptuation.
  3. Combine the MCMC samples.

However, with more complex model each model fit takes a while and if you need a lot of imputations, then the time to do all this rather quickly adds up (although this is of course extremely parallelizable).

Would it make more sense to randomly sample an imputed dataset for each iteration of the sampler (or using a mixture likelihood?!) while fitting a single model? Have you seen this implemented/know of similar approaches that actually do this successfully (=faster or somehow otherwise better than the first approach I outlined)?

You cannot sample random numbers during the sampling phase which leads to a change of the data set. The data being conditioned is immutable for good reasons.

So running many models in parallel is the best thing to do here…given one can throw this on a cluster, of course.

Mixture modeling will not make this any faster is my intuition here…in fact it will slow it down a lot very likely.

How about fitting all data sets in the same model and weighting when accumulating the log_pdf?

One obviously can’t simply use multiple data sets in the same model because this would lead to an underestimation of the parameters’ variance.
But would weighting the log_pdf values solve this? For instance, if one had 100 imputed data sets, could one then use

target += 0.01 * normal_lpdf(y | mu, sigma)

(where y ) and would get posteriors with the correct variance?

The motivation is that like this one could avoid the post-processing that comes with fitting multiple models and one could still use parallelisation to estimate this bigger model fast. (That might not be a viable solution if the original data set is very large).

1 Like

@Guido_Biele I suspect the approach of having a mixture likelihood should in principle work, but I fear you’d have to use the appropriate way of adding up mixtures (see https://mc-stan.org/docs/2_19/stan-users-guide/vectorizing-mixtures.html, as well as https://statmodeling.stat.columbia.edu/2017/08/21/mixture-models-stan-can-use-log_mix/). I.e. the problem is that you have one likelihood for one dataset and then combine it with equal weight with the likelihood for another dataset, which is not the same as throwing all the observations together. In one of the posts I quote, @wds15 showed how to do that:

real lmix[100];
lmix[1] = likelihood for dataset 1;
lmix[2] = likelihood for dataset 2;
…
target += log_sum_exp(lmix);

(not that that’s all that bad)

But I also hear him suggesting this may be computationally not the smartest (=fastest) thing to do. I also wondered about that for the exact reasons you quote. I wanted to not worry about post-processing etc. - although I admit I had until today not realized that the sflist2stanfit function exists (see https://mc-stan.org/rstan/reference/sflist2stanfit.html ). That makes the advice of @wds15 even more appealing (besides him usually been right about efficiency in sampling with Stan).

2 Likes

You are right, one has to exponentiate the likelihood before summing (and then log again), which makes the thing less efficient.

The situation where I still could see this being useful is when you have a relatively small subset of “rows” with missing data. What I mean is that if you have 1000 rows and missing data only in 100 of them, then fitting many models in which “90%” is fix data is not very efficient. However, setting up a model that calculates the likelihood for the “90% fix data” only once and multiple times for the imputed data takes its own time (but can be useful if one is experimenting with different models and wants to use a multiple processors to run multiple models instead of using it to run through MI data ;-)).

Anyhow, sflist2stanfit is surely a straightforward.