Building importance sampler to refine approximate model

I have a very computationally intensive model. I am pretty sure there is not a simple solution to speed this up (e.g., the normal things like reparameterizing the model or using LKJ priors instead of inverse wisharts, etc…). However, I have been thinking about building an approximate model and then using importance sampling to refine the approximate posterior samples toward the exact posterior (I have heard this referred to as emulation). I am just starting to think about this and I would like some advice.

My first idea for how to do this would be to simply use the approximate posterior samples (for example using the variational inference provided in Stan) as samples from a proposal distribution and use an importance sampler to collect “exact” samples from the posterior of interest. In this case I am interested in ideas as to how I could efficiently build an importance sampler for a model written in Stan. I feel like this should be really easy as Stan already stores the log-posterior values for each sample it produces. Perhaps there is a way to quickly calculate the log-posterior for a sample generated in R and passed to some function in the Rstan interface?

I would be surprised if the stan-dev team has not already thought along these lines. Any advice?

Of note, I did see the PSIS in the loo package. This makes me think that functions to do what I am looking for may already exist.

In general importance sampling is extremely fragile. If your approximate distribution isn’t close to the true distribution then the importance weights don’t have enough power to reasonably correct anything, and indeed their high variance usually makes estimates even worse. Outside of a few dimensions it is very unlikely that a poor approximation like VB will offer anything near close enough for importance sampling to be of any use.

PSIS is nice because these failures manifest in clear diagnostics, in other words unlike naive impotence sampling methods you’ll get a big warning indicating that the result is not to be trusted.
You’ll see many of these warnings if you try to use PSIS to weight samples generated from an approximate algorithm.

Note also that importance sampling typically assumes independent samples, so for example if you were trying to weight samples generated from a Markov chain (say Stan output from a simpler model) then there are even more insidious issues that can arise.

+1

4 Likes

Thank you for your response. A few questions/comments:

  1. You could build an “importance sampler” that was essentially an MCMC chain where the proposals came from the approximate posteriors. Yes the samples are dependent but this is not so strange, it is essentially similar to the Laplace approximation.

  2. I completely agree with your skepticism, I think in general this is not such a good option; however, having seen the difference between my variational approximations and my posteriors from NUTS I think this method may provide a reasonable choice especially if done in blocks. That is I would actually build a metropolis within Gibbs scheme where metropolis proposals are from approximate posterior. This could get around the extremely low acceptance rate I would otherwise expect if I tried to update all parameters at once.

  3. I would note that this is not such a new idea and some others have found this works extremely well:
    http://www.tandfonline.com/doi/full/10.1080/01621459.2017.1345742

  4. Are there any internals of Stan that are exposed and would be of use?

The general idea however of “correcting” an approximate sample is a useful one.

Suppose that you have a sample from a diffuse distribution that blankets the typical set of the actual distribution you’re interested in p(x)

Take your sample x, and a copy of your sample y. Consider these two ensembles, one of which is supposed to take on p(x) and one is supposed to take on p^(1/T)(y) for some T greater than 1 (or alternatively with log probabilities lp(x)=lp1(x) and lp(y)/T = lp2(y))

Now exp(sum(lp1(x))+sum(lp2(y))) is your unnormalized density over the whole ensembles.

with probability say 80% choose one x and one y and try to swap them between the ensembles. accept if the overall lp increases, or with probability exp(dlp) if dlp is negative. For the other 20% of the time, select a random x and a random y and diffusion metropolize them.

After a while, all the points in your ensemble x that are deep in the tails of p(x) will be swapped into the higher temperature ensemble where they help maximize the overall probability, and your x ensemble will have diffused around in such a way that no points overlap each other etc (which would happen if you duplicated the ensemble and then just sorted them into two temperatures, and never diffused the points). Eventually the probability to swap a sample into the higher temperature ensemble declines. So, you can throw out the high temp ensemble, and dup your low temp ensemble, and repeat. After a few repeats, run the whole thing for a while to fully equilibriate and stop.

I’ve done this successfully recently in some experiments, and I think it has the flavor of importance sampling in that it morphs your sloppy (but fast?) sample towards the more correct distribution by ejecting points deep in the tails in favor of points that are less deep in the tails. Under the assumption that your sloppy sample already blankets the typical set, the fact that the metropolis diffusions move slowly like a random walk isn’t a big deal since you already have coverage.

see if you get any mileage out of this. Hope it helps.

Just to reiterate a core question of mine.

The entire C++ API is available, but if you want to do things in R or Python then you will probably find that there is not enough access to the underlying code to implement the approaches you have been throwing out.

I understand that proof by authority is wholly unsatisfying and so I encourage you to experiment and see if you recover the same results that we are predicting. Let me just offer the warning that the percentage of problems where these approaches work is essentially zero, so it would be best to consider those experiments as part of the learning experience rather than progress towards a practical solution!

More specifically, the problems for which these approaches work are usually “special” in a way that matches with the way the approximation is built. The INLA project is an example of this, where something that wouldn’t work in general works very well (ie generates essentially correct posteriors) for a restricted class of “special” models. I cannot, off the top of my head, think of a second example.

2 Likes

+1

GPstuff and some other GP software are other examples of having a restricted class of “special” models. For example, GPstuff has a logistic Gaussian process density estimation model, for which Gaussian approximation and PSIS work well even with 400-dimensional parameter space [1507.02646] Pareto Smoothed Importance Sampling. But it really is a special model.

1 Like

Bit (a lot) of a self-promotion: We have a paper relating to this kind of idea:

The basic idea is that often we have a model which has relatively small number of hyperparameters and potentially high dimensional latent structure. Then we can run a fast approximate MCMC targeting the marginal posterior of the hyperparameters (in the paper we use Gaussian approximations, in similar fashion as in INLA, but other approximations like variational Bayes, coarse discretization etc could also be used), and then perform importance sampling type weighting in postprocessing step in embarrassingly parallel fashion, where we simulate the latent variables as well. Of course there can be issues with importance weights if the approximating posterior isn’t good enough, but even this can dealt with to some extent by using tempering etc. In practice with latent variable models with Gaussian latent structure this methods seems to be super effective.

We are currently rewriting the paper to make it bit more accessible, but there are couple posters illustrating the idea in practice. The first one from SMC2017 introduces the basic idea and compares it with delayed acceptance MCMC. Here, compared to the current version of the paper, we use coarse discretization of continuous time process as a approximate model and then make the weighting based on much finer discretization.

Another one from useR2017 compares this approach with HMC and hybrid HMC-IS as well. Note that here it looks like combining HMC and IS might not be worth the trouble, but that is mostly because in that implementation the approximation is refined at each log-likelihood and gradient computation. With other types of approximations this might not be necessary and then combining HMC and IS can make sense (for example my R/Stan package walker uses global Laplace approximations based on the MLE estimates).

How did you program and parameterize the Stan models and how long did you run adaptation? What did you then measure as efficiency (total time including compilation and adaptation to just measuring marginal cost of larger effective sample size)? What were the settings that you used to do inference (number of sample iterations, number of warmup iterations, any step size or acceptance rate modifications from the default)?

How did you initialize both algorithms?

How did you verify that both algorithsm got the right answer in terms of posterior covariance? Did you compare posterior intervals or at least posterior standard deviations?

There is a small vignette here which gives more details of the experiment, including the Stan model: https://github.com/helske/stannis/blob/master/stannis_vignette.pdf. For the hybrid method the model file is in the same github repo (exec/x_llt_approx.stan and some common functions, and then some c++ functions for IS weighting).

For all cases, I ran the algorithms independently 100 times, each run consisting 20,000 warmup iterations and then additional 20,000 iterations. The measure of efficiency was then inverse relative efficiency, defined as average computation time (not including the compilation time but including warmup) multiplied by mean square error (comparing ESS is bit problematic as the way these are computed for the IS method differs from how Stan computes them). For pure HMC, I needed to modify the adapt_delta in order to get rid of (most) divergence warnings. All methods were initialized with true values for standard deviations for simplicity, and state variables in pure HMC I tried to initialize them with some “reasonable values” (for hybrid we don’t need these).

I did compare posterior means and standard deviations, and run preliminary long runs in order to make sure each method produced “same” results, and I have tested the non-Stan method previously quite exhaustively in order to be confident that they worked as intended.

I did this experiment mostly because I wanted to have some feeling how the combination of HMC and the approximate MCMC would work together and to learn Stan at the same time, without puttting too much effort in the details, so I don’t claim this to give definitive answers to anything. This was more like pilot test, I bet the pure-Stan version could be tweaked to be more efficient (for example since this I’ve heard some suggestions about modifying the warmup phase somehow in Stan). And the hybrid IS-HMC could be perhaps made faster by using the autodiff stuff of Stan etc. Also the results were highly influenced by the test model, for example for some stochastic volatility models HMC worked much better, and for some other Poisson models I couldn’t get pure HMC to converge at all due to severe divergences whereas hybrid version worked fine. I have a feeling that most of the computational differences I saw are due to the warm-up, but I haven’t though much about doing more rigorous experiments since this didn’t produce that interesting results; potential gains compared to the extra work don’t seem to be worth it.

You probably don’t need initis at all for Stan. Gibbs (or at least BUGS and JAGS) are a lot more sensitive to where they’re initialized.

Do you really need 20K warmup iterations?

You probably also don’t need to save all those transformed parameters—couldn’t they be defined as local variables in the model block?

Also, it appears you’re creating diagonal matrices (for which there is better syntax), but that’s rarely a good idea computationally. For example, we have quadratic forms specifically for diagonal vectors. it’s also much much better to work with Cholesky factors wherever possible, but I don’t know enough about these models or have enough time to sort out whether that’s possible.

The main thing you need to do for efficiency in Stan is to vectorize operations into matrix operations (you do a lot of them as loops) and also save repeated subexpressions. For example, in (xbeta[t] + mode[t]) - exp(xbeta[t] + mode[t]) you want to only compute xbeta + mode once. Every subexpression adds memory allocation and extra time to the reverse-mode derivative calculations.

I’m also not sure if you are going to run into places where the function you’re definining with conditionals isn’t continuously differentiable—this can make it very hard for HMC to move across the conditional boundary. It would most likely manifest as divergences when you run or underflow/overflow warnings.

It’s rarely efficient to explicitly create a diagonal matrix. I started looking at common_functions.stan, but there’s just too much in there that can be made faster for me to do a thorough review.

I don’t know why there’s a negative in your (log) Jacobian calculation—it’s just sum(theta) if you want to recover what’s going on under the hood because the transform is just exp(theta) to go from unconstrained to constrained.

This kind of thing can be bad news:

sum(log(approx_var_y));

It indicates approx_var_y was calculated on a linear scale and then logged—it’s usually much more stable to compute approx_var_y on the log scale from the get go as

    log_approx_var_y = -(xbeta + mode);

rather than using

approx_var_y = 1.0 ./ exp(xbeta + mode[1:n]);

and then taking a log. Also, you don’t need all those [1:n] if the structure is n elements long—that just does an unneeded copy (I probably should have checked for this degenerate case in the implementation as it’s only an integer comparison and not left it as a user optimization).

No idea how much efficiency improvement you could get by coding the common functions more efficiently.

Thanks, good hints about vectorization etc, this was my first stab with Stan and I basically just ported my Armadillo based C++ codes to Stan without much knowledge what would be most efficient way in Stan. The saving of the transformed variables and some other stuff are related to the fact that I do some postprocessing (IS weighting) outside of Stan where I need these variables. There is also some ugly stuff going on like the mode variable is of length n+1 (so I need [1:n], and I store something else in the last n+1 element in order to return multiple stuff from the function (as I understood that you couldn’t do pass-by-reference). The sum(log(approx_var_y)) thing is like that because I also need the non-log versions of the variances later. Diagonal stuff I knew weren’t optimal but here the those matrices are just 2x2 so I though that it wouldn’t make big difference.

I’m not certain, but I remember having some issues with Stan if the states were poorly initialized, like errors for non-finite initial posterior.

And yeah 20,000 warmup iterations was likely an overkill.

The most time consuming part (at least if this was a classic RWM implementation in C++) is the Kalman filter and smoother in the common functions, which are likely difficult to code much more efficiently unless there are some Stan specific things you could do, as due to the recursive nature they basically must be done in loops (although it would be possible to due similar stuff via large sparse matrices with similar computational costs).

edit: Actually, there is one thing which could perhaps speed up the hybrid method considerably: The approximation scheme is essentially Newton type optimization algorithm implemented with Kalman filter and smoother (this is based on Durbin & Koopman 2000 JRSS paper), but I wonder if you could find the same approximation more efficiently using the autodiff stuff from Stan.

Thank you for all the good conversation. Like Jouni Helske my model is actually a generalized dynamic linear model (Multinomial - conditionally Gaussian). @helske do you think your approach would work for multinomial observations?

I have some interests in the multinomial case as well due to potential future project, but I haven’t tested the method with multinomial models yet. In principle it should work: The approximation should be easy with the same Laplace approximation idea as in the Poisson case [1], and there are very efficient particle filters available for the weighting phase [2]. Of course in practice everything likely depends on how many categories and how many observations you have etc.

[1] See for example Durbin and Koopman: Time series analysis by state space models, 2012 Section 9.3.5 and 10.6. and Section 8.1 of our arxiv paper.
[2] The basic idea is to use so called psi-APF (Section 8.2 in our paper) with twisting functions based on the same Laplace approximations, this is closely related to the iterated auxiliary particle filter by Guerniero et al: https://arxiv.org/abs/1511.06286 (they use iterative scheme for finding optimal twisting functions, whereas we just plug in the approximation).

We can take the discussion off the list if you would like to discuss details of possible implementation etc.

Regarding multinomial models, just stumbled upon this: https://arxiv.org/pdf/1506.05843.pdf. Just skimmed through the main parts of the paper, but looks interesting.

Ya great paper. Unfortunately the stick breaking construction does not produce the distribution I need - logistic normal.

Awesome! I will look into those references.

Thank you!
Justin