Possible INLA optimization step concerns, sparsity requirements, and Stan features for large gaussian process inference

It seems Stan would still be the best alternative here, so what about things that can be done in Stan to accelerate these computations, is GPU acceleration default for all matrix computations?

No. In fact, none of the GPU stuff is exposed yet.

Thanks. I thought that was the case.

I would have a look at Template Model Builder. It basically extends the functionality of INLA to use automatic sparseness detection to do fast Laplace approximations of latent variables for arbitrary hierarchical models. It’s designed (according to the abstract) to do 10^6 random effects and 10^3 fixed effects with relative ease.

You can also plug the models directly into Stan via the tmbstan R package. So you can code up a model with N random effects vectors and mix and match how to do integration: combinations of the LA or NUTS via Stan, all without recompiling.

It seems to be gaining popularity in Ecology for spatial models like you describe, and may be worth a look. The bummer is you will have to rewrite the model in the C++ template language.

For the Stan dev team this might be a good way to test the LA vs NUTS before going to all the work of coding it up. We do this for a few models in the paper linked above, and found that full NUTS integration was faster (minESS/time) for sampling from fixed effects, but these models were fairly small. I think a more thorough exploration would be worthwhile, particularly on random field models, and potentially very useful to the Stan team.

2 Likes

Thanks for the heads up. That’s an interesting way to go on testing. I barely understood this Laplace approximation stuff when I went to the ADMB/TMB workshop.

Indeed, a lot of the art to these things seems to be in adaptation and implementation. At least that’s what we’re finding with our variational inference (and with NUTS itself).

I’m primarily concerned about figuring out whether it can get a good enough answer at this point.

@Bob_Carpenter No worries, I just wanted to bring to the attention of the Stan folks in case it can be useful.

I don’t know what you mean about tuning. What I mean is you take a model in TMB and run it through Stan in two ways: (1) with the LA turned on (thus NUTS sampling for only the fixed effects) and (2) with the LA turned off (thus sampling fixed and random effects with NUTS). It’s absolutely trivial to do – TMB does all the LA stuff without any special direction/code from the user. For instance you could try it on all the TMB examples and if it’s promising try converting a few key Stan examples to TMB.

This way you can test things like whether the LA is accurate (and in which cases) or how much it could speed up Stan models if that option were available to Stan users. For instance maybe you use the LA on 50,000 spatial random effects and NUTS for the dozen or so fixed effects and the model runs 10-100 times faster than currently in Stan. Or maybe it’s never faster and not worth putting into Stan. Both seem useful to know to me, without someone putting in the time to add it to Stan.

Again, it’s just a heads up and I’m just trying to contribute back to an open source project that I use all the time. I don’t presume to understand what goes on behind the scenes with Stan or what your priorities are.

Cheers

The speed and robustness of our approximate algorithms, like ADVI, depend heavily on tuning parameters, like how many Monte Carlo draws to estimate the nested gradients of expectations. Similarly, adaptation to parameter scales (or rescaling parameters in the model) has a big impact on optimization speed for our ADVI. We’ve found the same thing when implementing black-box max marginal likelihoods.

Now I understand better how TMB works. And yes, I meant we could usefully use it to evaluate how well some of these algorithms work before trying to build them ourselves.

Thanks. This is a big priority for @andrewgelman, and since he funds a lot of us, a big priority for several of us. Some of us are skeptical of approximate methods, others are gung ho.

We’re also working with @anon75146577, one of the INLA developers, to build out some of this functionality directly into Stan where we can do it well or analytically.

@Bob_Carpenter

It would be pretty cool to have as an option for Stan users. The nice thing about TMB is the whole thing is automated. You don’t have to do anything special in the model code nor known the sparseness a priori (as I understand limits INLA), just declare which parameters are random effects after compiling. In Stan it would be just like

stan(file=, data=, iter=, chains=, ...)

for full NUTS, and

stan(file=, data=, iter=, chains=, random='eta', ...)

to apply the LA and only use NUTS on the fixed effects.

As a user of both it would be really cool and useful to see this capability cross over into Stan like this. Hopefully the TMB stuff could accelerate this process. The details are in the paper.

It’s a long way off. The key problem is getting the optimisation fast and stable without persistent memory.

We’ve already got experience from GPStuff as well as some specific tests that it can work, but getting it to work within the Stan Math library is different.

If you put Gaussian priors (maybe with unknown variance) on the fixed effects they can and should be marginalised out with the LA block.

Is there a discussion somewhere of the persistent memory problem? We can add that in our samplers, though the math library itself should largely be stateless at the expose-functions-to-Stan level. Any state could just be saved on the side and passed in as an argument.

1 Like

No, what would you want to see to diagnose the memory problem? What’s the best way to go about it? Earlier I was running valgrind but just looking at total memory consumption of the program.

I think the best way would just start with allocation of the kernels and autodiff stack for that… is that a good place to start?

This has come up a couple times before (Algebraic sovler problems (Laplace approximation in Stan) [this is @anon75146577] , Is it possible to access the iteration/step number inside a Stan program?) for people who are solving optimization problems in Stan. I guess add to that list any use of the algebraic solver as well.

You want to use a good guess so your optimization runs quickly, and the natural good guess in this situation is the last solution you found. In the 2nd thread above Bgoodrich points out you can hack this in with some C++ and memory storage.

I don’t know for sure, but my guess is this is very important.

Lemme quickly point out that these optimizations are done every leapfrog step, so you really do expect your last solution to be reasonably close to your next one (it’s not going to be as big a difference as the solutions between different draws).

It’s going to be easier to change Stan Math to work with math than expecting math itself to adapt to the limitations of Stan Math :P. Bring it up in a meeting (or make an issue or a Discourse post) and get it on the problem radar

3 Likes

I don’t even know what the problem is, so it’s hard to suggest a diagnosis. By “persistent memory”, it sounded like @anon75146577 wanted to store things between log density evaluations, but I wans’t sure. That seems to be confirmed by @bbbales2’s reply.

That would be possible, but the issue would be how to do it at the math library level. You have to make sure that if there are multiple nested optimizations, each gets the right initialization. So you can’t just store it on the function itself without sensitivity to where it’s used in Stan.

And yes, this should be easier in stan than in math. We’d want to set up some kind of cache.

Huh, that is tricky, especially in the context of parallelization. I guess this goes back to Stan being general purpose and technically optimizations could come and go at every log density evaluation. Should talk about this at Stancon.

Sorry - I’ve been away from the discourse for a couple of days.

Yes, the problem is that to make the evaluation of the log-density and it’s derivatives efficient (and not dependent on the “worst case” optimization time across the whole parameter space), we need the option to use the optimum from the previous evaluation to as a starting point for the optimization in the new log-density evaluation.

Storing a static vector would work, but I’m not sure how to do it in the case where a Stan program calls the same likelihood function twice. There are reasons to do this (especially when combining data from two different sources). In that case it would naturally use the same static vector, which wouldn’t help.

I’m not sure when everyone gets to Helsinki, but if you want to meet up and talk about this over coffee on Tuesday afternoon, I’m keen.

And just to triple-spam the thread (sorry!) this ability to store doubles (or vectors of doubles) during a likelihood evaluation and then read those stored values during another likelihood evaluation would be extremely useful for the algebraic solver in general. It would both speed it up and probably stabilize it.

We never ended up talking about this, but I only came up with two kinda rudimentary designs:

  1. Pass in a “key” value to every optimization in the program. This key value is used internally for iteration-to-iteration storage.

So something like:

model {
  for(n in 1:N) {
    y = optimize(..., n); // The model specifies an identifier for each optimization
  }
  ...
}

So your optimize function might look like:

auto optimize(auto a, ..., int i) {
  Eigen::VectorXd initial_guess;

  load_guess(i, guess);
  ...
  save_guess(i, guess);
}

Where that map from ids -> initial guesses would live is up in the air, but I figure it wouldn’t be too difficult to nail it down.

The downside to this is that it’d be hard to get these indexes right if your optimizations were buried in weird places:

functions {
  vector myfunction(real a, real b, int n) { // Ugh -- have to carry index around manually
    return optimize(..., n);
  }
}
...
model {
  ...
  for(n in 1:N) {
    // What if myfunction sometimes used 2 optimizations and sometimes 1?
    y[n] = myfunction(a, b, n);
  }
  ...
}

So that’s not great. But in any case a wrong index got passed in and the initial guess didn’t exist or wasn’t the right size, then the regular initial conditions could be used.

  1. The other easy thing is maintain two stacks in each leapfrog step: one that gets written to (that will be used by the next leapfrog step), and one that gets read from (which was created in the last leapfrog step). So inside the C++ maybe you have something like:
auto optimize(auto a, ..., int i) {
  Eigen::VectorXd guess;

  pop_from_front_guess(guess);
  ...
  push_to_back_guess(guess);
}

In this way you assume the order and number of your Stan model function evaluations don’t change.

Neither thing is terribly inspirational, but either would probably cover the bulk of the use cases. I don’t think the second thing would work with threading (the first one should be do-able, just annoying). The first thing probably wouldn’t interact well with MPI though.

In either case the initial conditions to the model would be a bit weird. Maybe in terms of nomenclature they’d be more like fallback initial conditions.

In either case we could use the Parameter packs for (de)serialization stuff to make the optimization internals look nice.

The model class is designed to be immutable after construction with data.

Breaking this abstraction is going to require a lot of care because we’ve assumed for multi-threading, for example, that the model doesn’t change. Autodiff gets thread local, but the model is still immutable.

Relaxing this assumption about the model will be hard, but I think it’d be even harder at the math library level.

What exactly do you want to store from iteration to iteration and how will it be indexed so it can be found in the next iteration?

The way this got solved for MPI is to statically generate indexes for each call to MPI. But each usage is blocking there until it finishes, so we don’t need to worry about the loops—the ID is still unique and valid for MPI (at least under current usage).

The trick’s going to be managing the lifecycle of these. They can’t just accumulate statically forever or we’ll leak memory.