Speed up adaptation with Variational Approximation?

As usual I seem to be working with SLOW models, and this really makes debugging hard. ;-) I’ve been having trouble with some chains sticking in place at the initial conditions and never moving at all… where other chains seem to converge, etc. Part of it is probably poorly parameterized components to the model, but other issues are that it’s just a big model for a spatially variable dataset with lots of observations… lots of parameters etc.

Currently I’m doing an optimize step, then running the vb algorithm to get a sample, and then picking one of those samples as the initial location for a sampling statement. This at least seems to stop me from getting fully stuck.

It occurs to me though that the adaptation period for sampling could potentially be speeded up by using the variational approximation to inform the mass matrix. Is this something that has been tried by the stan team?

Right now, my model takes about 30 seconds to optimize (and I can use that point estimate to debug obvious problems, so that’s good), about 2 or 3 minutes to get 1000 samples from the vb algorithm (this also helps with finding obvious problems quickly), and then maybe a couple hours to give me 200 iterations of full Bayes, I have it do 150 warmups, and then pass the last 50 samples to me on 4 chains. This is annoyingly slow, and it’d be super cool to be able to get an initial mass matrix from VB and then run say 50 more warmups, and 50 real samples on 4 chains and have the whole thing take… 20 mins

Has anyone looked into the idea of using the VB approximation to short-cut some of the time it takes to estimate the warmup stuff?

You could read this question as a query about whether it would be useful to post a feature request allowing VB to be used for some kind of initial adaptation or if there’s a known reason why such a feature would be pointless.


All of this buisness is risky… having said that, I can understand well the motivation here. What you should try to do is to scale your parameters to be centered at 0 and have standard deviation 1. Say you have the locations as data variable loc and the scales in scale, then the dummy code for your Stan model should look like

parameters {
   real foo_raw;
transformed parameters {
   real foo;

  foo = loc + scale * foo_raw;
model {
  foo ~ normal( wahtever mean, whatever scale);

Of course, this assumes that your parameters are all on the unconstrained scale. With this trick you implicitly give Stan the mass matrix via the scale argument. As the transformation is a linear one and fixed by the data you can still assign the priors on the actual parameter and you can ignore the warnings on the Jacobian.

Such an approach did save me a ton of time already. BUT: for the final model you should probably run once more with loca=0 and scale=1 to make sure you are not tricked into a local trap of the likelihood.


1 Like

It would be super cool if ADVI wasn’t extraordinary bad at estimating variances, even for simple models. Stan’s adaptation routine is based on making the variance estimation as robust as possible to ensure the highest probability of high performance in the sampling regime and the cases the only cases where there are faster alternatives are cases that should be running extremely fast to begin with.

1 Like

I was afraid you would say that ;-)

My model is a spatial model over all 2000+ public use microdata areas in the census. It uses Fourier expansion to model the spatial average and a t distributed error parameter for each microdata area. Because it’s slow, I’m debugging on a small subset of the data so I only have 3 or 4 observations per area on average, and some have no observations at all.

It is slow to fit in part because the Fourier coefficients are not necessarily uni-modal and the fourier expansion intentionally induces nonlinear relationships between all of the 2000+ dimensions

Hey but Stan does actually fit it! It’s just slow to debug and slow to gain effective sample size.

BTW I found that optimizing the model and using the optimum point as the initial point was in fact much much better than letting Stan choose random inits, or trying to init by hand. (edit: and actually as I said, starting from a random VB sample was even better, doing that I haven’t had it get fully stuck and not move at all, which it would do about 1 out of 4 chains using random or manual inits)

If you’re using t-distributed errors AND trimming the data set to test the model you may well be ending up with multi-modal posteriors in models fit to the reduced data set. With no observations and t-distributed errors your group estimates may well get stuck in the tail (or underestimate variance). I’d switch to normal errors for the time being and add t-distributed errors as a final step, or at least do t-distributed errors with a large DF parameter (20?). Any of the above could kill adaptation.

Also for debugging it shouldn’t matter much if the model is slow since you should know after 10-30 iterations if you’ve made things better or not (check stepsize). I know in some sort of problematic models stepsize will crash and then recover to a reasonable value but it’s no thtat common.

Looking at the code, it looks like I did already switch to a normal error model, original it was t with a DOF parameter somewhere above 5 with a mode around 10. Anyway even with normal errors it gets stuck with bad inits.

I’ve tried to reparameterize so that the raw parameters have sd ~ 1, but I haven’t found that to help. The stepsize crashes down and getting a couple hundred samples takes several hours. Single gradient evals take on order 0.04 seconds.

I suspect it’s the fourier expansion that makes this all really slow and have divergent transitions etc. Probably I should rethink things, huge swaths of the country have no people at all, and then there are areas where people are squished in tight, like LA, SF, Boston, NY, and soforth. Tiny adjustments to the coefficients could have no effect on the function anywhere except in a tight region around a couple of major metro-area and suddenly you need a tiny stepsize…

That makes sense. I would work out the kinks with something like that in pieces. If the stepsize is crashing there’s almost no point in running it beyond that.

PyMC3 tries to do that. We haven’t done it because ADVI hasn’t been robust enough in most of the models we have.

If we could get to the typical set quickly, then we should be able to make Stan embarassingly parallel after that.

Yeah, quite honestly VB does a terrible job on my model. Not quite sure why. It also seems to do a terrible job on my revised model using a radial basis function expansion, but that model does seem to fit a bit faster and easier than the Fourier version.

ADVI seems to work well only on problems that are on unit scale. We’re working on the adaptation, but in the current version, ADVI has difficulty getting to the right place.

So, for posterity sake, after getting things to run a little smoother using a RBF expansion for the smooth spatial function and normal errors, I then returned to t-distributed errors, and found that Stan STRONGLY wanted to have small degrees of freedom (long tails). It refused to go above the lower bound of 5 that I’d set, so I had to unconstrain the DOF parameter and re-run, where I found that it was indeed coming out around 2.3 or so. Again, chains seem to get a little stuck and convergence issues are a problem. I will continue to debug and improve things.

Are these t-distributed errors on data or parameters?

t distributed errors on the differences between observed expense data, and a predictive formula involving spatial, temporal, and family size variables.

So you have some parameters with little to no data for spatial regions (the RBF coeff’s for low pop. regions) and t-distributed errors. I can imagine that the tails on those coefficients will end up heavy enough to be a problem but I’m not sure how to get rid of them (other than the standard non-centering etc…).

p.s.- if you could share your code for the RBF sometime I’m interested in how you implemented that.

I picked a random seed, and randomly chose 50 locations, and set xc[i],yc[i] to the x,y coords of those locations, then a parameter vector Coef are the coefficients for the RBF.

  for(i in 1:N){
    F[i] = 0; // my problem is basically zero centered already
    for(k in 1:NTerm){
      F[i] = F[i] + Coef[k] * exp(-((x[i]-xc[k])/a)^2-((y[i]-yc[k])/a)^2);
    F[i] = exp(F[i]+Err[i]);

As you can see I am modeling the logarithm. This quantity F is a multiplicative factor that adjusts a national average to the local conditions. a is a parameter that describes the scale of the RBF functions. Err is a parameter that has a normal prior that describes how much of an error there is between the smooth RBF prediction and the actual “average” results for that region. Later I use all of it like:

Data[i] ~ student_t(dof, MyFunction(F[Region[i]], …), errscale);

where Data is individual responses to surveys, and Region is the region the survey response comes from.

I upped my data sample to 40,000 survey responses, and this means that on average 12 or so responses per region, so it shouldn’t be the case that any regions are terribly data poor (zero or one or two observations for example)

In essence the RBF acts as a regularizer that intentionally makes nearby census regions cluster around some local average, where Err sets the scale over which they can vary from the local regional average, and the student_t allows for the fact that individual survey responses can be significantly outlying. see for example:


Further info for posterity.

It turns out unsurprisingly that randomly selected RBF centers don’t work that well, some of them inevitably are very close to each other, and this causes numerical issues, and then other areas have basically no points, and this causes lack of control for that region. In general, the data is pretty noisy, so the RBF doesn’t really remove a lot of the variation.

What I did was plot all the region centers, and then manually selected 18 or 20 locations spread around the map. This stabilized the RBF much more and now I get inference with about 50 warmups and 120 total samples that are quite reasonable in terms of traceplots etc.

In addition I set the error term in the regression to be t distributed on the parameter, so that the regional cost multiplier is RBF(x) + t_error(dof,0,scale) with unknown dof and scale. This t distribution is much lighter tailed than on the data, with something between 10 and 20 degrees of freedom, unlike the dof for the individual data, which winds up at about 2.3

With all these put into place, I get some fairly stable inferences with no warnings about divergent transitions, and this gives a nice plot of costs as a function of space using alpha for overplotting and a variously saturated red color as expenses climb into the 1.5-3 x the national average range.

Fitting a spatial function to noisy and irregularly spaced census data using full Bayes is no easy task!

Nice! Thanks for posting back.