# Optimization / adaptive importance sampling loop for rstan

I mentioned this a while ago (@avehtari) and just got around to unpacking and repacking the code into something semi functional for general purposes. It’s not something I plan to put any real focus on, but thought it might be helpful for those of us fitting complex / slow models, and or provide a base for further work. Here’s the description:

A small package containing a 3 step routine for fitting Stan models, and limited diagnostics. For complex models where the maximum a-posteriori estimate provides a useful starting point, this approach ‘may’ allow for faster sampling from the posterior. The optional step 1 uses differential evolution, linking to the DEoptim package. Step 2 then uses the Stan optimizer. Step 3 computes a more accurate (my limited experience) Hessian or approximation of the Hessian, and uses this either for: a) directly sampling from, for fast but inaccurate representation of the posterior. b) as the basis for an initial target distribution for an adaptive importance sampling procedure.

Install in R:
install.packages(“devtools”)
library(devtools)
install_github(“cdriveraus/stanoptimis”)

Most obvious shortcoming at the moment is the lack of any summary type function, I guess.

Example:

``````library(rstan)
library(stanoptimis)
scode <- "
parameters {
real y[2];
}
model {
y[1] ~ normal(0, 1);
y[2] ~ double_exponential(0, 2);
}
"

sm <- stan_model(model_code=scode)
fit <- sampling(sm, iter = 10000)
summary(fit)\$summary

## extract samples as a list of arrays
e <- extract(fit, permuted = TRUE)

optimis <- stanoptimis(standata = list(),sm = sm,isloops=10,issamples = 1000,cores=3)

apply(optimis\$rawposterior,2,mean)
apply(optimis\$rawposterior,2,sd)
isdiag(optimis)

plot(density(optimis\$rawposterior))
points(density(e2\$y))
}``````
1 Like

Thanks for sharing.

As with all approximate algorithm suggestions, we want to see diagnostics of how they compare to the exact fits. How are you evaluating whether this gets a good enough approximate answer?

Differential evolution tends not to scale well with dimension and importance sampling is hard to get right without a very good proposal distribution.

We’ve been looking into ideas like posterior (or interior to the algorithm) importance sampling corrections for calculating expectations with our ADVI or Laplace approximation output.

Good question, I should have clarified aspects related to this. I designed it for use cases that are often very loosely multivariate normal, in that the initial proposal is a t distribution based on the Hessian / approximation. When this initial distribution fails to cover any substantial region of posterior density, then the calculations will be biased, though the bias will decrease as the number of adaptive loops is increased. I guess some computation checking at what point it is ‘reasonable’ to include the adaptive loops in the final sampling procedure would be a useful addition. There is a diagnostic plot included that shows the adaptation of the means and sd of the T distribution from ‘Hessian based’ to ‘last set of samples drawn’ based, in my experience this provides a reasonable guide – but: I learnt about importance sampling while coding this, there are no guarantees, the primary intention is to allow quicker testing and model development. I did some reading on calculating Neff in an importance sampling context, and this is computed, but I didn’t find it convincing enough to report anywhere. For those interested, it’s in the isdiags subobject.

I primarily wanted this for use with an unscented Kalman filter and stochastic differential equations – the dimensionality is not too high (so differential evolution can be useful, though this step is optional) but each step is computationally heavy.

Would be happy to take suggestions, happy to see further development, or also happy to see a general Stan tool make this redundant! :)

You can calibrate standard errors empirically using simulation. Then you can work back from that to effective sample size.

That makes a big difference!

You’ll have to ask the experts about that! We are working on these kinds of algorithms, both max marginal posterior mode approximations and variational approximations (approximate marginal mean rather than mode based).

Bob – I don’t understand the Neff comment. How should one calibrate se’s using simulation when the true posterior is unknown? With MCMC or the like, if we have a chain where the samples are tightly clustered at a particular point, we can compare to other chains to assess whether this reflects the ‘true’ posterior. But with IS and only one chain as such, this is not possible.

Re dimensionality – I don’t see why this should affect the performance of the approach, so long as the MAP is informative (eg, hierarchical structures will only work if the lower level is integrated over). Do you expect otherwise?

I meant you can work backwards from the errors you get to what the `n_eff` is. That does assume knowing the true posterior sd, but presumably you can estimate that well with something like NUTS.

Dimensionality is always an issue. I wrote a case study on the “curse of dimensionality” to help people develop intuitions.

Ensemble methods (like differential evolution) scale terribly with dimension (I wrote a blog post on this topic). Importance sampling is also very difficult to scale in dimensionality because it gets harder and harder to get reasonable proposal distirbutions.

1 Like

Thanks for getting back on this.
You could use Pareto-k diagnostic and n_eff computation as presented in Pareto smoothed importance sampling paper