Gaussian process roadmap

bbbales2 · August 14, 2018, 3:07pm

Hopefully I don’t create too much misguided work here. If I do, we can close the thread.

But I suggested to @anon79882417 it’d be useful to have a roadmap for the Gaussian process stuff he’s working on, so it’d be easier for a random review (like myself!) to hop in and understand where his pull request fit into the big picture.

Summoning all concerned parties @Bob_Carpenter, @seantalts, @anon79882417

Since it doesn’t exist now, I don’t think it needs to appear overnight, but if there were examples of a Good Roadmap or something it could be helpful.

I should also maintain one on the adj_jac_apply thing @Bob_Carpenter and I are working on, but it’s easier to ask other people to do stuff haha.

seantalts · August 14, 2018, 3:26pm

I’d want to know things like:

What are the use cases or problems you’re trying to solve?
What is the solution you’re looking at and how does it break down into chunks? Where does your current PR fit into that?
How will this impact users?
How will this impact other developers?

Some of these can probably be answered very quickly but it’s nice even if you just say “nothing will change for developers” or something. Some of us are working on a design review policy and we should incorporate some of these questions, but it’s really a work in progress right now, sorry!

anon75146577 · August 14, 2018, 5:22pm

@avehtari and I did this a while back for sparse matrices sparse_matrix_wishlist.pdf (238.8 KB)

avehtari · August 14, 2018, 9:14pm

And despite the name in the end it discusses also GPs. I can make a draft for my wishlist for GPs which adds a few items to what that spare_matrix_wishlist.pdf has.

anon79882417 · August 16, 2018, 8:54am

Hey all -

Here are some Wiki’s in different places related to Stan’s GP library, as well as things I’ve done or come across. I’m hoping that we can unify these documents, and come up with a concrete plan:

1.1 @rtrangucci’s Roadmap for structured linear alegebra and GP covariance functions
1.2 @betanalpha’s Adding a GP Covariance Function

We have the following kernels that have PRs or are already merged into dev: gp_cov_exp_quad: prim/rev,
gp_dot_prod_cov: prim,
gp_periodic_cov:prim/rev
gp_matern52_cov:prim,
gp_matern32_cov:prim
and then gp_exponential_cov I have on my fork w/ prim,rev, but it needs edits prior to opening a PR, and I’ve been waiting for other PR’s to go through, as this has taken much longer than expected.
We need edit the documentation in the Stan manual, because some of the example models are incorrect, see this issue: Update GP models in Stan Reference Manual. This isn’t very labor intensive, and I’m happy to edit some of them if I could find the Rmarkdown document (this will not go through without peer review, of course).
Having a GP function: For large kernels and operations (summing or element-wise multiplication of them), the memory consumption of the Autodiff stack is too large. I’ve had the idea of using a shared_memory or a some memory mapping that would share the autodiff stack among different cores.
However, this might not be necessary (or at least to pursue immediately) for a couple reasons:
a. It looks like @bbbales2’s Adj_jac_apply could in part solve the memory consumption issue. Ben - I’m seeing in bullet (1.) you mention that “we do not need to compute the jacobian of your operator, just a vector^T jacobian product”… I gone through the code and your comments in adj_jac_apply.hpp, but would you mind elaborating a bit, perhaps how could this reduce memory consumption in the context of GPs in Stan? I’m sorry - sometime I don’t understand things on the first pass.
b. Having sparse matrices could also reduce memory consumption.
Alebra functions for Sparse Matrices. See @anon75146577’s document above in section “Sparse Matrix” support. There is a fair amount of intersect with Rob’s 1.1 wiki.

It’s important to think about workflow and efficiency. In case more that one person wants to work on this, so that we can do things in parallel, we can divide the above tasks into disjoint sets. All of the below bullet points are can be developed independently:

Finish the above set of kernels in prim and rev.
Implement some of the structured linear Algebra types as in Rob’s 1.1, leaving out the specialized toeplitz output for kernels until the above set of kernels are done in ‘rev’.
Implement a GP function (may be after @bbales updates the adj_jac_apply for matrix datatypes, this probably has to be a one man job)
Sparse Matrix types and operations, following requirements in the “Sparse Matrix Wishlist” (there are notes from @Bob_Carpenter in the section entitles “A Sparse Matrix Type”, that we would need to follow, i.e. something about a map from a sparse matrix to a vector).
Implement some of the ncp's of the kernels as in 1.2 above. (but instead of using NCP can’t we just sum a dot product kernel with another kernel, and it’s this the same as incuding a linear regression part into the mean function, something like that? Or can someone explain to me the importance of developing a ncp independently?).

The rng's 1.2 I’m not so sure about. A great feature of GP’s is that we can sum/multiply kernels. And for all combinations of sums and multiplications, we’d need to have a specialization gp_pred_rng. This would create a lot of work that might only be used in specific cases, and we could probably have the predictive posteriors for a given GP kernel combination done automatically if we develop a stan gp function. We could also specialize the matrix computation (i.e. if it’s sparse or a toeplits type) within the function, once we develop some more of the structured linear algebra functions. So I don’t think making gp_pred_rng’s for each kernel would be a good use of time.

Anyway, I think priority is finishing up initial set of kernels, sparse matrix type, and making sure we don’t have memory issues with large kernels.

thoughts??

bbbales2 · August 16, 2018, 9:21am

This does help! Some small comments now. I’ll make some more later but I wanna do some other coding.

Haha, it is the Way of The Pull Request around here. Good practice for the future, hopefully.

adj_jac_apply one thing that might be misleading about that – I keep saying it only creates one vari on the autodiff stack. For N outputs, I mean it creates one vari on the chaining autodiff stack and N - 1 on the nonchaining stack. So in itself isn’t any more efficient with memory on the outputs.

Where it could save memory is in avoiding all the temporary varis that might be created on the way.

Like:

x - y + z

Will create an intermediate vari for x - y before it creates the one for x - y + z. If you do the autodiff manually, you only have varis at the outputs.

Where it counts is you can avoid working with the full Jacobian of the function, which will be number of outputs * number of inputs. Normal autodiff already avoids the full Jacobian, it’s just implementing the Jacobian is usually the simplest way to think about doing custom autodiff.

So it’ll be important but is already happening in the prim stuff kinda.

Cool!

I’ve found it’s super annoying to do posterior predictives with GPs. But I haven’t spent much time thinking about what it might look like.

It makes it easier to sample the hyperparameters with small amounts of data is the basic drift of things. The best way to be convinced of this is to see it get rid of divergences in a model.

There’s a whole ton of stuff here, so we probably want to separate the long-term/short-term stuff a little and break it into pieces.

For breaking things into pieces, I’d probably do it in terms of places where I can answer the question “how I know I’m done”.

@rtrangucci, @avehtari, @anon75146577, if any of you have comments to share on this that would be good. You don’t gotta say anything either – I’m just sending beeps and boops

bbbales2 · August 16, 2018, 9:26am

Actually, I see quite a few people using GPs with Stan through brms these days. @paul.buerkner, do you have any opinions on what @anon79882417 is doing with the GPs (this post: Gaussian process roadmap)? Anything stand out to you?

paul.buerkner · August 16, 2018, 9:41am

I do agree with @anon79882417 that adding new kernels and optimizing their implementation (in terms of memory and speed) is of primary interest. This applies in particular to GPs run via brms, because all the Stan language functions that make GPs easier to use are irrelevant when Stan is just called via brms (unless these functions come with efficiency improvements, of course).

As soon as these kernels are available in rstan, I can make a few tweeks so that they can be used through brms as well.

anon79882417 · August 16, 2018, 10:10am

Every posterior predictive mean function and covariance function for latent f_* can be acquired with the following equations (for non-noisy predictions):

mean function: K_{(X_*,X)} K_{(X,X)}^{-1}f
covariance function: K_{(X_*,X_*)} - K_{(X_*, X)} K_{(X, X)} K_{(X,X_*)}.

Where X_* is test set X is train, and K_{(X, X*)} is covariance matrix of test and train.
In gp_pred, you can take the code in the manual, and replace the cov_exp_quad with the kernel you’re using. I hope that clarifies things. But this brings me to a point: having pred_rngs for each covariance function is not really useful because then we can’t use them when we’re summing kernels. For similar reasons (i.e. including dot product as the mean function), I don’t see why we should include ncp versions of the kernels. Too specialized, won’t get used enough. The labor/usefullness tradeoff isn’t very good.

~~
With that said, after a conversation w/ @avehtari that focuses in on stuff that’s more relevant:

We should finish impementing the specialized kernels in the special cases of the matern kernel.

It’s probably most important to first develop a functor type for GPs in Stan, that focuses on the memory issues (instead of using autodiff to create expression trees for each element in the matrix we should use a matrix exponential). And this function can also take care of predictive rngs as well.

The toeplits and basis stuff is for 1D time series with even spaced time points and for 1 or 2D GPs, so this will not have high impact and therefore has less priority.

We want something with more flexibility. So in summary: finish specializing the kernels and then develop a functor.

avehtari · August 16, 2018, 12:38pm

My personal opinion about use of GPs in Stan

Small to moderate size data (where uncertainty quantification is important).
Non-linear models with implicit interactions (See, e.g. leukemia example in BDA3. These are difficult or infeasible to do with splines etc.).
Hierarchical non-linear models and non-linear latent functions as modular part of bigger models (these limit which speed-up approximations can be used)
GAMs with GPs (easier to set priors than for splines)
Flexibility by allowing user defined covariance functions written as Stan functions (functor approach for building covariance matrix)
Laplace method for integrating over the latent values (this will make inference much faster, but it’s applicable only for restricted set of models)

Especially

We shouldn’t compete with specialized GP software that scale to bigger data (with less worry about uncertainty in covariance function parameters), but have restriction in model structure. That means we are not in a hurry to implement some complicated speed-up approximations which can be used only for very restricted models.
For many spatial and spatio-temporal data it’s better to use Markov models with sparse precision matrices as discussed in “Sparse Matrices for Stan” document (type of models commonly used with INLA software)
Wishlist for dense matrices from “Sparse Matrices for Stan” document lists some things which are already in progress
- Parallel (GPU and/or MPI) dense linear algebra
- General matrix algebra (sum, product, kronecker, matrix_divide, determinant) with reverse-mode derivatives
- Cholesky and derivative on GPU and general parallel
- See Stan manual Section 42.2 (Matrix arithmetic operations), 42.13 (Linear Algebra Functions and
  Solvers)
- Maybe also 42.4 (Elementwise Functions), 42.5 (Dot Products and Specialized Products)
The internal covariance matrix functions @anon79882417 is working on will make some non-linear models with implicit interactions, hierarchical non-linear models and non-linear latent functions as modular part of bigger models faster.
The Laplace method Charles is working in will make inference for combination of GP prior on latent function and log-concave likelihood faster.
A basis function representation of GPs Michael and Gabriel are working on will make GAMs with GPs, and 1D GPs as part of bigger models faster.
The covariance matrix approach (cov_exp_quad etc) is very limited in flexibility and turns out to use also lot of memory if many covariance matrices are combined- That’s why the wishlist has “Functor for GP specification”. “Sparse Matrices for Stan” lists
- Example would be f ~ GP(function cov_function, tuple params_tuple, vector mean, matrix locations ) for the centred parameterisation.
- For the non-centred parameterisation, you would need the back-transform f = GP_non_centered_transform(vector non_centered_variable, function cov_function, tuple params_tuple, vector mean, matrix locations ).
- A GP_predict() function that takes the appropriate arguments and produces predictions.
- Implemention would populate the matrix, do the Cholesky and compute the appropriate quantities
- See also Question about autodiff for potential GP covfun implementation

Note that my wishlist doesn’t have yet such speedup approximations as inducing point approaches, as they are quite complicated to implement and use.

anon75146577 · August 16, 2018, 4:04pm

This I don’t agree on but yes to everything else.

avehtari · August 16, 2018, 4:09pm

Do you disagree “GAMs with GPs” or just “easier to set priors than for splines”?

anon75146577 · August 16, 2018, 4:15pm

The bit on priors. Spines should have linear computational cost and only one tuning parameter. GPs scale cubicly and have 2.

bbbales2 · August 16, 2018, 5:55pm

It sounds like for the short term the GP development is all about efficiently doing the basic GP stuff.

@anon79882417, if you want to try to figure out how much memory things are using in a more granular way than just checking memory usage of the process, you can check the number of varis being allocated on the autodiff stack.

There are two big types in the autodiff system, vars and varis. Vars get shoveled around everywhere, and they can take up lots of space, but varis are the things that matter. @Bob_Carpenter calls it a pointer to implementation design pattern, but every var points to a certain vari (multiple var variables could point at the same vari) and it’s the varis that hold the values and adjoints that matter for autodiff.

At the end of evaluating a log probability, the only vars that might actually be left sitting around are the lp var and the ones on your parameters. Everything in the middle gets tossed, and it all comes down to the varis that were created in the process.

There are three types of varis, and they’re stored here: https://github.com/stan-dev/math/blob/develop/stan/math/rev/core/autodiffstackstorage.hpp#L49

You can get information about the number of varis sitting around with code like:

Eigen::Matrix<stan::math::var, Eigen::Dynamic, Eigen::Dynamic> x(2, 2), y(2, 2);
x << 2.0, 1.0, 0.0, -1.0;

std::cout << "Chaining stack: " << stan::math::ChainableStack::instance().var_stack_.size() << std::endl;
std::cout << "Non-chaining stack: " << stan::math::ChainableStack::instance().var_nochain_stack_.size() << std::endl;
y = stan::math::adj_jac_apply<MatrixSinFunctor>(x);
std::cout << "Chaining stack: " << stan::math::ChainableStack::instance().var_stack_.size() << std::endl;
std::cout << "Non-chaining stack: " << stan::math::ChainableStack::instance().var_nochain_stack_.size() << std::endl;

The output is:

Chaining stack: 4
Non-chaining stack: 0
Chaining stack: 5
Non-chaining stack: 4

This is showing that the high level parameters that we’re going to autodiff each have a vari and then adj_jac_apply itself creates 5 more. One that goes on the chainable autodiff stack and four that go on the non-chaining stack.

The difference in the chaining/non-chaining stack is more to do with performance, but chaining + non-chaining varis should mostly determine the memory usage of your program when using basic autodiff (things implemented in prim only).

What I said is misleading if you consider functions with custom reverse mode autodiff. Those varis are also allowed to allocate memory in the memalloc_ variable here: https://github.com/stan-dev/math/blob/develop/stan/math/rev/core/autodiffstackstorage.hpp#L52

But if you wanted to understand memory usage in different kinds of GP kernels, my suggestion would be write some code and just check before and after how many varis there are. You can also watch how much stuff gets allocated in memalloc_, but varis get allocated in there so it’ll be a little convoluted. That’s gonna tell you a lot.

You should be able to account for all the varis that get created, but it might be a bit finicky. Probably worth the time though if you want to optimize the memory usage.

The stuff Aki is talking about here (Question about autodiff for potential GP covfun implementation) is workable, for sure, but there’s probably value in getting all the basic GPs you listed in your #1 above implemented in prim and rev first?

avehtari · August 16, 2018, 7:51pm

No worries, mate. In 1D we use GPs with linear computational cost and we think we are able to do some fancy things with GPs better than with splines, but we’ll report more about that later. That was just now a placeholder in the roadmap.

anon75146577 · August 16, 2018, 8:38pm

So do I. But my experience is that the range parameter has a tendency to be effectively infinite, so a spline with a sum/integrate-to-zero constraint is fine (given that that’s effectively what you’re fitting with a Markov GP and a large range parameter)

Unless you were trying to sound passive aggressive here, probably avoid the Aussie vernacular.

avehtari · August 16, 2018, 8:42pm

No Markov. Let’s talk more in Helsinki.

Thanks for the info. I’ll keep learning.

bbbales2 · August 17, 2018, 5:20am

Is this something that should be part of the short term GP goals? Linear vs. cubic seems like a big deal.

Is like the stuff @Bonnevie is talking about here: Using GP for autocorrelation in a time-series model: memory pressure, GP kernel definition - #15 by Bonnevie ?

bbbales2 · August 17, 2018, 6:01am

Oh I guess this goes back to the small/medium models + flexibility thing. I guess the linear time GP stuff is kindof specialized.

avehtari · August 17, 2018, 8:07am

It is in the roadmap list Gaussian process roadmap - #10 by avehtari

No. We are testing a specific basis function approach.

Yes, but I think specialized solution is ok if it’s easy to implement as modular component and there is a popular model family like GAMs in this case. So we think we have a flexible linear time solution for 1D GPs, but we’ll make first Stan language implemantion case studies before thinking C++ level speedups.

Topic		Replies	Views
Adding Gaussian Process Covariance Functions Developers gaussian-process	81	7841	May 21, 2018
Making big kernels work General	34	2040	August 10, 2018
Could you please help to find out why my GP model got slower Modeling gaussian-process	12	1236	April 7, 2021
Stan on GPU: looking for model+dataset examples for empirical evaluation of speedups General	36	3449	March 5, 2018
Speeding up hierarchical models Developers	26	2521	June 5, 2020

Gaussian process roadmap

Related topics