Possible INLA optimization step concerns, sparsity requirements, and Stan features for large gaussian process inference

optimization
mcmc

#1

Sorry for the long description but I promise (that I believe) this is going somewhere:

I am trying to infer parameters from a large gaussian process (or maybe this should be called a field, I’m not sure) model with correlated tasks. The total number of tasks is something we can prioritize and include a few dozen, few hundred, or ideally all of it (which would be a few thousand), and the number of observation for each task is in the tens, so inference involves a Cholesky decomposition of a matrix with a few to several thousand elements in each dimension (as well as sampling the latent gaussian function to use a negative binomial likelihood).

As a consequence, running 10k Stan iterations is taking well over a week (or several weeks) depending on whether I run it with all subgroups (for instance to make them share the binomial dispersion) or not, so I’m in need of ways to speed up Stan, or use approximate methods like INLA.

To that end I’ve been reading Rue & Martino (2009). The main limitation of INLA seems to be that the model must be a gaussian latent field for the Laplace approximation of \tilde{\pi}(x_i|\theta,y) to be reasonable, but since gaussian processes are the prototypic LGF this doesn’t seem to be a problem to me right now. Nevertheless several questions came up, so finally here they are:

  1. INLA requires a maximum-likelihood-like optimization step to find the mode of \tilde{\pi}(x_i|\theta,y) and \tilde{\pi}(\theta|y) around which to expand the Laplace approximation. Is this feasible for large parameter spaces?

  2. Sparsity is invoked for computational efficiency, but it doesn’t seem to be required for INLA, is that correct?

About 1., I was under the impression that was the reason why things like simulated annealing existed, and why MCMC approaches wouldn’t get stuck in the same way as regular maximum likelihood methods. And about 2. this seems to apply to any method where the model is a latent gaussian field, so if the covariance matrix happens to be sparse (it’s probably not) it would improve Stan’s performance as well.

Finally, (3.) in the case of Stan for that kind of model, what can be done except for reducing the number of samples? Is GPU acceleration already default for matrix operations, is there specific commands or customization needed?

I guess if you read until here I can only thank you for your patience as well.


#2

My understanding is that INLA works well up to about 15 parameters. But you can have a large intermediate matrix that is a function of just a few parameters.


#3

Right. That’s kind of connecting 1. and 2., but they could be unrelated too. In this case there are definitely more than 15 parameters to optimize in the first step, it’s \mathcal{O}(\#tasks) and that’s my main concern for using INLA. Because it’s a positive-semidefinite matrix maybe some linear constraints could help here, but we’re still talking thousands of parameters.
Lack of sparsity is an unbonus, but probably not a dealbreaker.


#4

INLA software can handle thousands of Gaussian latent variables, but only up to 15 hyperparameters, because INLA software uses CCD for hyperparameters. If MCMC is used for hyperparameters, then it’s possible to handle more hyperparameters. For example, GPstuff allows use of MCMC instead of CCD for hyperparameters, and @Daniel_Simpson would like to get Laplace approximation to Stan so that he could handle thousands of hyperparameters.


#5

Thanks, that makes sense, and it’s more or less what I suspected. But I’m afraid I’m in the regime of thousands of hyperparameters and gaussian variables.

Doesn’t sampling the hyperparameters defeat the purpose of doing INLA anyway, or put another way, isn’t that basically the Laplace approximation to the latent function described in machine learning literature like Rasmussen & Williams? I think they say there that either approach, as well as EP scale similarly in computational cost.
Or is this something different?


#6

It seems Stan would still be the best alternative here, so what about things that can be done in Stan to accelerate these computations, is GPU acceleration default for all matrix computations?


#7

No. In fact, none of the GPU stuff is exposed yet.


#8

Thanks. I thought that was the case.


#9

I would have a look at Template Model Builder. It basically extends the functionality of INLA to use automatic sparseness detection to do fast Laplace approximations of latent variables for arbitrary hierarchical models. It’s designed (according to the abstract) to do 10^6 random effects and 10^3 fixed effects with relative ease.

You can also plug the models directly into Stan via the tmbstan R package. So you can code up a model with N random effects vectors and mix and match how to do integration: combinations of the LA or NUTS via Stan, all without recompiling.

It seems to be gaining popularity in Ecology for spatial models like you describe, and may be worth a look. The bummer is you will have to rewrite the model in the C++ template language.

For the Stan dev team this might be a good way to test the LA vs NUTS before going to all the work of coding it up. We do this for a few models in the paper linked above, and found that full NUTS integration was faster (minESS/time) for sampling from fixed effects, but these models were fairly small. I think a more thorough exploration would be worthwhile, particularly on random field models, and potentially very useful to the Stan team.


#10

Thanks for the heads up. That’s an interesting way to go on testing. I barely understood this Laplace approximation stuff when I went to the ADMB/TMB workshop.

Indeed, a lot of the art to these things seems to be in adaptation and implementation. At least that’s what we’re finding with our variational inference (and with NUTS itself).

I’m primarily concerned about figuring out whether it can get a good enough answer at this point.


#11

@Bob_Carpenter No worries, I just wanted to bring to the attention of the Stan folks in case it can be useful.

I don’t know what you mean about tuning. What I mean is you take a model in TMB and run it through Stan in two ways: (1) with the LA turned on (thus NUTS sampling for only the fixed effects) and (2) with the LA turned off (thus sampling fixed and random effects with NUTS). It’s absolutely trivial to do – TMB does all the LA stuff without any special direction/code from the user. For instance you could try it on all the TMB examples and if it’s promising try converting a few key Stan examples to TMB.

This way you can test things like whether the LA is accurate (and in which cases) or how much it could speed up Stan models if that option were available to Stan users. For instance maybe you use the LA on 50,000 spatial random effects and NUTS for the dozen or so fixed effects and the model runs 10-100 times faster than currently in Stan. Or maybe it’s never faster and not worth putting into Stan. Both seem useful to know to me, without someone putting in the time to add it to Stan.

Again, it’s just a heads up and I’m just trying to contribute back to an open source project that I use all the time. I don’t presume to understand what goes on behind the scenes with Stan or what your priorities are.

Cheers


#12

The speed and robustness of our approximate algorithms, like ADVI, depend heavily on tuning parameters, like how many Monte Carlo draws to estimate the nested gradients of expectations. Similarly, adaptation to parameter scales (or rescaling parameters in the model) has a big impact on optimization speed for our ADVI. We’ve found the same thing when implementing black-box max marginal likelihoods.

Now I understand better how TMB works. And yes, I meant we could usefully use it to evaluate how well some of these algorithms work before trying to build them ourselves.

Thanks. This is a big priority for @andrewgelman, and since he funds a lot of us, a big priority for several of us. Some of us are skeptical of approximate methods, others are gung ho.

We’re also working with @Daniel_Simpson, one of the INLA developers, to build out some of this functionality directly into Stan where we can do it well or analytically.


#13

@Bob_Carpenter

It would be pretty cool to have as an option for Stan users. The nice thing about TMB is the whole thing is automated. You don’t have to do anything special in the model code nor known the sparseness a priori (as I understand limits INLA), just declare which parameters are random effects after compiling. In Stan it would be just like

stan(file=, data=, iter=, chains=, ...)

for full NUTS, and

stan(file=, data=, iter=, chains=, random='eta', ...)

to apply the LA and only use NUTS on the fixed effects.

As a user of both it would be really cool and useful to see this capability cross over into Stan like this. Hopefully the TMB stuff could accelerate this process. The details are in the paper.


#14

It’s a long way off. The key problem is getting the optimisation fast and stable without persistent memory.

We’ve already got experience from GPStuff as well as some specific tests that it can work, but getting it to work within the Stan Math library is different.

If you put Gaussian priors (maybe with unknown variance) on the fixed effects they can and should be marginalised out with the LA block.


#15

Is there a discussion somewhere of the persistent memory problem? We can add that in our samplers, though the math library itself should largely be stateless at the expose-functions-to-Stan level. Any state could just be saved on the side and passed in as an argument.


#16

No, what would you want to see to diagnose the memory problem? What’s the best way to go about it? Earlier I was running valgrind but just looking at total memory consumption of the program.

I think the best way would just start with allocation of the kernels and autodiff stack for that… is that a good place to start?


#17

This has come up a couple times before (Algebraic sovler problems (Laplace approximation in Stan) [this is @Daniel_Simpson] , Is it possible to access the iteration/step number inside a Stan program?) for people who are solving optimization problems in Stan. I guess add to that list any use of the algebraic solver as well.

You want to use a good guess so your optimization runs quickly, and the natural good guess in this situation is the last solution you found. In the 2nd thread above Bgoodrich points out you can hack this in with some C++ and memory storage.

I don’t know for sure, but my guess is this is very important.

Lemme quickly point out that these optimizations are done every leapfrog step, so you really do expect your last solution to be reasonably close to your next one (it’s not going to be as big a difference as the solutions between different draws).

It’s going to be easier to change Stan Math to work with math than expecting math itself to adapt to the limitations of Stan Math :P. Bring it up in a meeting (or make an issue or a Discourse post) and get it on the problem radar


#18

I don’t even know what the problem is, so it’s hard to suggest a diagnosis. By “persistent memory”, it sounded like @Daniel_Simpson wanted to store things between log density evaluations, but I wans’t sure. That seems to be confirmed by @bbbales2’s reply.

That would be possible, but the issue would be how to do it at the math library level. You have to make sure that if there are multiple nested optimizations, each gets the right initialization. So you can’t just store it on the function itself without sensitivity to where it’s used in Stan.

And yes, this should be easier in stan than in math. We’d want to set up some kind of cache.


#19

Huh, that is tricky, especially in the context of parallelization. I guess this goes back to Stan being general purpose and technically optimizations could come and go at every log density evaluation. Should talk about this at Stancon.


#20

Sorry - I’ve been away from the discourse for a couple of days.

Yes, the problem is that to make the evaluation of the log-density and it’s derivatives efficient (and not dependent on the “worst case” optimization time across the whole parameter space), we need the option to use the optimum from the previous evaluation to as a starting point for the optimization in the new log-density evaluation.

Storing a static vector would work, but I’m not sure how to do it in the case where a Stan program calls the same likelihood function twice. There are reasons to do this (especially when combining data from two different sources). In that case it would naturally use the same static vector, which wouldn’t help.