Bayesian inference after multiple imputation

So, reading BDA3 (as well as Zhou & Reiter 2010 and I think that’s also the brm_multiple approach, right?), it seems one good approach to Bayesian inference after multiple imputation is to

  1. Do lots of imputations (let’s say 100+).
  2. Fit a Stan model for each imptuation.
  3. Combine the MCMC samples.

However, with more complex model each model fit takes a while and if you need a lot of imputations, then the time to do all this rather quickly adds up (although this is of course extremely parallelizable).

Would it make more sense to randomly sample an imputed dataset for each iteration of the sampler (or using a mixture likelihood?!) while fitting a single model? Have you seen this implemented/know of similar approaches that actually do this successfully (=faster or somehow otherwise better than the first approach I outlined)?

You cannot sample random numbers during the sampling phase which leads to a change of the data set. The data being conditioned is immutable for good reasons.

So running many models in parallel is the best thing to do here…given one can throw this on a cluster, of course.

Mixture modeling will not make this any faster is my intuition here…in fact it will slow it down a lot very likely.

How about fitting all data sets in the same model and weighting when accumulating the log_pdf?

One obviously can’t simply use multiple data sets in the same model because this would lead to an underestimation of the parameters’ variance.
But would weighting the log_pdf values solve this? For instance, if one had 100 imputed data sets, could one then use

target += 0.01 * normal_lpdf(y | mu, sigma)

(where y ) and would get posteriors with the correct variance?

The motivation is that like this one could avoid the post-processing that comes with fitting multiple models and one could still use parallelisation to estimate this bigger model fast. (That might not be a viable solution if the original data set is very large).

1 Like

@Guido_Biele I suspect the approach of having a mixture likelihood should in principle work, but I fear you’d have to use the appropriate way of adding up mixtures (see 5.4 Vectorizing Mixtures | Stan User’s Guide, as well as https://statmodeling.stat.columbia.edu/2017/08/21/mixture-models-stan-can-use-log_mix/). I.e. the problem is that you have one likelihood for one dataset and then combine it with equal weight with the likelihood for another dataset, which is not the same as throwing all the observations together. In one of the posts I quote, @wds15 showed how to do that:

real lmix[100];
lmix[1] = likelihood for dataset 1;
lmix[2] = likelihood for dataset 2;
…
target += log_sum_exp(lmix);

(not that that’s all that bad)

But I also hear him suggesting this may be computationally not the smartest (=fastest) thing to do. I also wondered about that for the exact reasons you quote. I wanted to not worry about post-processing etc. - although I admit I had until today not realized that the sflist2stanfit function exists (see Merge a list of stanfit objects into one — sflist2stanfit • rstan ). That makes the advice of @wds15 even more appealing (besides him usually been right about efficiency in sampling with Stan).

2 Likes

You are right, one has to exponentiate the likelihood before summing (and then log again), which makes the thing less efficient.

The situation where I still could see this being useful is when you have a relatively small subset of “rows” with missing data. What I mean is that if you have 1000 rows and missing data only in 100 of them, then fitting many models in which “90%” is fix data is not very efficient. However, setting up a model that calculates the likelihood for the “90% fix data” only once and multiple times for the imputed data takes its own time (but can be useful if one is experimenting with different models and wants to use a multiple processors to run multiple models instead of using it to run through MI data ;-)).

Anyhow, sflist2stanfit is surely a straightforward.

1 Like

Apologies for coming late to this, but I’ve just found this thread and am intrigued by its content: I’d have thought that using a mixture model across multiple imputations would be popular, but I find it hard to find any papers that actually advocate doing this. Is this idea already out there in places I can’t find and/or does it have a name that isn’t what I’m using as a search term?

1 Like

@s.maskell , are the two not theoretically the same thing? When I discussed it with other people, the argument was usually not about theoretical validity or anything, but rather about convenience when doing the programming/implemention.

If you mix posteriors from separate fits for each imputed dataset that’s valid for enough imputations and importantly this is pretty straightforward: 1) no need to do anything fancy implementation wise (just write the model for complete data, so anything from rstanarm, brms whatever just works out of the box), 2) you don’t need to somehow index the imputations and wrap them in some loop that sums up the likelihoods, which is especially tedious because you first get the likelihood from all records, then exponentiate and then sum that up over all imputations and then you can go back to the log scale (of course you’d use some tricks like log_sum_exp, but it’s still tedious and not as simple as adding up “separate” log-likelihood contributions from each record), 3) Parallelization (across imputed datasets, but even within chain, which would need a totally different implementation for the mixture likelihood approach, I think) is straightforward and if you have a nice server with lots of core, the model fitting is really fast (i.e. unlike with the mixture likelihood, it will not need to depend on your number of imputations), and finally (perhaps least importantly) 4) there’s functionality in the standard packages (e.g. rstan) for combining posteriors together from multiple model fits so no extra work there.

The main downside of mixing the posteriors is that automated checks for non-convergence of the MCMC sampling will complain about your samples, because if there’s enough missing data to change the posterior at least a bit, they will (in a sense correctly) detect that what appear to be different chains (but are really the posterior samples for different imputations) appear to have sampled different distributions. So, you actually want to look at these for each multiple imputation separately. You certainly avoid that with the mixture likelihood, or you could randomly mix together the draws from the multiple imputations as if they were for a single chain (but that might again undermine some diagnostics like looking at the correlation of draws).

PS: Both are of course approximations to the undoubtably valid fully Bayesian approach of dealing with everything in one single go (i.e. the model has the missing data as latent parameters that for each MCMC sample get sampled anew), which is more or less equivalent to doing multiple imputation with the number of imputations equivalent to the number of desired MCMC samples (and only keeping one sample from each fit). I say “more or less equivalent”, because by doing the imputation separately first, we are not allowing the model we are really interested in to have any “backwards influence” on the imputation.

Thanks @Bjoern for the response. I don’t think that it follows that if the likelihood is an equally weighted mixture that the posterior will also be an equally weighted mixture: if p(y|x) = \sum_i p(y|x,M_i)p(M_i) then p(x|y) \propto p(y|x)p(x)=\sum_i p(y|x,M_i)p(M_i)p(x)=\sum_i p(x|y,M_i)p(y|M_i)p(M_i) which means you’d need to give each of the different per-imputation samplers a weight proportional to p(y|M_i), which is unlikely to be equal for all M_i.