Gaussian process roadmap

Hey all -

Here are some Wiki’s in different places related to Stan’s GP library, as well as things I’ve done or come across. I’m hoping that we can unify these documents, and come up with a concrete plan:

1.1 @rtrangucci’s Roadmap for structured linear alegebra and GP covariance functions
1.2 @betanalpha’s Adding a GP Covariance Function

  1. We have the following kernels that have PRs or are already merged into dev: gp_cov_exp_quad: prim/rev,
    gp_dot_prod_cov: prim,
    gp_periodic_cov:prim/rev
    gp_matern52_cov:prim,
    gp_matern32_cov:prim
    and then gp_exponential_cov I have on my fork w/ prim,rev, but it needs edits prior to opening a PR, and I’ve been waiting for other PR’s to go through, as this has taken much longer than expected.

  2. We need edit the documentation in the Stan manual, because some of the example models are incorrect, see this issue: Update GP models in Stan Reference Manual. This isn’t very labor intensive, and I’m happy to edit some of them if I could find the Rmarkdown document (this will not go through without peer review, of course).

  3. Having a GP function: For large kernels and operations (summing or element-wise multiplication of them), the memory consumption of the Autodiff stack is too large. I’ve had the idea of using a shared_memory or a some memory mapping that would share the autodiff stack among different cores.
    However, this might not be necessary (or at least to pursue immediately) for a couple reasons:
    a. It looks like @bbbales2’s Adj_jac_apply could in part solve the memory consumption issue. Ben - I’m seeing in bullet (1.) you mention that “we do not need to compute the jacobian of your operator, just a vector^T jacobian product”… I gone through the code and your comments in adj_jac_apply.hpp, but would you mind elaborating a bit, perhaps how could this reduce memory consumption in the context of GPs in Stan? I’m sorry - sometime I don’t understand things on the first pass.
    b. Having sparse matrices could also reduce memory consumption.

  4. Alebra functions for Sparse Matrices. See @anon75146577’s document above in section “Sparse Matrix” support. There is a fair amount of intersect with Rob’s 1.1 wiki.

It’s important to think about workflow and efficiency. In case more that one person wants to work on this, so that we can do things in parallel, we can divide the above tasks into disjoint sets. All of the below bullet points are can be developed independently:

  1. Finish the above set of kernels in prim and rev.
  2. Implement some of the structured linear Algebra types as in Rob’s 1.1, leaving out the specialized toeplitz output for kernels until the above set of kernels are done in ‘rev’.
  3. Implement a GP function (may be after @bbales updates the adj_jac_apply for matrix datatypes, this probably has to be a one man job)
  4. Sparse Matrix types and operations, following requirements in the “Sparse Matrix Wishlist” (there are notes from @Bob_Carpenter in the section entitles “A Sparse Matrix Type”, that we would need to follow, i.e. something about a map from a sparse matrix to a vector).
  5. Implement some of the ncp's of the kernels as in 1.2 above. (but instead of using NCP can’t we just sum a dot product kernel with another kernel, and it’s this the same as incuding a linear regression part into the mean function, something like that? Or can someone explain to me the importance of developing a ncp independently?).

The rng's 1.2 I’m not so sure about. A great feature of GP’s is that we can sum/multiply kernels. And for all combinations of sums and multiplications, we’d need to have a specialization gp_pred_rng. This would create a lot of work that might only be used in specific cases, and we could probably have the predictive posteriors for a given GP kernel combination done automatically if we develop a stan gp function. We could also specialize the matrix computation (i.e. if it’s sparse or a toeplits type) within the function, once we develop some more of the structured linear algebra functions. So I don’t think making gp_pred_rng’s for each kernel would be a good use of time.

Anyway, I think priority is finishing up initial set of kernels, sparse matrix type, and making sure we don’t have memory issues with large kernels.

thoughts??

2 Likes

This does help! Some small comments now. I’ll make some more later but I wanna do some other coding.

Haha, it is the Way of The Pull Request around here. Good practice for the future, hopefully.

adj_jac_apply one thing that might be misleading about that – I keep saying it only creates one vari on the autodiff stack. For N outputs, I mean it creates one vari on the chaining autodiff stack and N - 1 on the nonchaining stack. So in itself isn’t any more efficient with memory on the outputs.

Where it could save memory is in avoiding all the temporary varis that might be created on the way.

Like:

x - y + z

Will create an intermediate vari for x - y before it creates the one for x - y + z. If you do the autodiff manually, you only have varis at the outputs.

Where it counts is you can avoid working with the full Jacobian of the function, which will be number of outputs * number of inputs. Normal autodiff already avoids the full Jacobian, it’s just implementing the Jacobian is usually the simplest way to think about doing custom autodiff.

So it’ll be important but is already happening in the prim stuff kinda.

Cool!

I’ve found it’s super annoying to do posterior predictives with GPs. But I haven’t spent much time thinking about what it might look like.

It makes it easier to sample the hyperparameters with small amounts of data is the basic drift of things. The best way to be convinced of this is to see it get rid of divergences in a model.

There’s a whole ton of stuff here, so we probably want to separate the long-term/short-term stuff a little and break it into pieces.

For breaking things into pieces, I’d probably do it in terms of places where I can answer the question “how I know I’m done”.

@rtrangucci, @avehtari, @anon75146577, if any of you have comments to share on this that would be good. You don’t gotta say anything either – I’m just sending beeps and boops

Actually, I see quite a few people using GPs with Stan through brms these days. @paul.buerkner, do you have any opinions on what @anon79882417 is doing with the GPs (this post: Gaussian process roadmap)? Anything stand out to you?

I do agree with @anon79882417 that adding new kernels and optimizing their implementation (in terms of memory and speed) is of primary interest. This applies in particular to GPs run via brms, because all the Stan language functions that make GPs easier to use are irrelevant when Stan is just called via brms (unless these functions come with efficiency improvements, of course).

As soon as these kernels are available in rstan, I can make a few tweeks so that they can be used through brms as well.

Every posterior predictive mean function and covariance function for latent f_* can be acquired with the following equations (for non-noisy predictions):

mean function: K_{(X_*,X)} K_{(X,X)}^{-1}f
covariance function: K_{(X_*,X_*)} - K_{(X_*, X)} K_{(X, X)} K_{(X,X_*)}.

Where X_* is test set X is train, and K_{(X, X*)} is covariance matrix of test and train.
In gp_pred, you can take the code in the manual, and replace the cov_exp_quad with the kernel you’re using. I hope that clarifies things. But this brings me to a point: having pred_rngs for each covariance function is not really useful because then we can’t use them when we’re summing kernels. For similar reasons (i.e. including dot product as the mean function), I don’t see why we should include ncp versions of the kernels. Too specialized, won’t get used enough. The labor/usefullness tradeoff isn’t very good.

~~
With that said, after a conversation w/ @avehtari that focuses in on stuff that’s more relevant:

We should finish impementing the specialized kernels in the special cases of the matern kernel.

It’s probably most important to first develop a functor type for GPs in Stan, that focuses on the memory issues (instead of using autodiff to create expression trees for each element in the matrix we should use a matrix exponential). And this function can also take care of predictive rngs as well.

The toeplits and basis stuff is for 1D time series with even spaced time points and for 1 or 2D GPs, so this will not have high impact and therefore has less priority.

We want something with more flexibility. So in summary: finish specializing the kernels and then develop a functor.

My personal opinion about use of GPs in Stan

  • Small to moderate size data (where uncertainty quantification is important).
  • Non-linear models with implicit interactions (See, e.g. leukemia example in BDA3. These are difficult or infeasible to do with splines etc.).
  • Hierarchical non-linear models and non-linear latent functions as modular part of bigger models (these limit which speed-up approximations can be used)
  • GAMs with GPs (easier to set priors than for splines)
  • Flexibility by allowing user defined covariance functions written as Stan functions (functor approach for building covariance matrix)
  • Laplace method for integrating over the latent values (this will make inference much faster, but it’s applicable only for restricted set of models)

Especially

  • We shouldn’t compete with specialized GP software that scale to bigger data (with less worry about uncertainty in covariance function parameters), but have restriction in model structure. That means we are not in a hurry to implement some complicated speed-up approximations which can be used only for very restricted models.

  • For many spatial and spatio-temporal data it’s better to use Markov models with sparse precision matrices as discussed in “Sparse Matrices for Stan” document (type of models commonly used with INLA software)

  • Wishlist for dense matrices from “Sparse Matrices for Stan” document lists some things which are already in progress

    • Parallel (GPU and/or MPI) dense linear algebra
    • General matrix algebra (sum, product, kronecker, matrix_divide, determinant) with reverse-mode derivatives
    • Cholesky and derivative on GPU and general parallel
    • See Stan manual Section 42.2 (Matrix arithmetic operations), 42.13 (Linear Algebra Functions and
      Solvers)
    • Maybe also 42.4 (Elementwise Functions), 42.5 (Dot Products and Specialized Products)
  • The internal covariance matrix functions @anon79882417 is working on will make some non-linear models with implicit interactions, hierarchical non-linear models and non-linear latent functions as modular part of bigger models faster.

  • The Laplace method Charles is working in will make inference for combination of GP prior on latent function and log-concave likelihood faster.

  • A basis function representation of GPs Michael and Gabriel are working on will make GAMs with GPs, and 1D GPs as part of bigger models faster.

  • The covariance matrix approach (cov_exp_quad etc) is very limited in flexibility and turns out to use also lot of memory if many covariance matrices are combined- That’s why the wishlist has “Functor for GP specification”. “Sparse Matrices for Stan” lists

    • Example would be f ~ GP(function cov_function, tuple params_tuple, vector mean, matrix locations ) for the centred parameterisation.
    • For the non-centred parameterisation, you would need the back-transform f = GP_non_centered_transform(vector non_centered_variable, function cov_function, tuple params_tuple, vector mean, matrix locations ).
    • A GP_predict() function that takes the appropriate arguments and produces predictions.
    • Implemention would populate the matrix, do the Cholesky and compute the appropriate quantities
    • See also Question about autodiff for potential GP covfun implementation

Note that my wishlist doesn’t have yet such speedup approximations as inducing point approaches, as they are quite complicated to implement and use.

3 Likes

This I don’t agree on but yes to everything else.

Do you disagree “GAMs with GPs” or just “easier to set priors than for splines”?

The bit on priors. Spines should have linear computational cost and only one tuning parameter. GPs scale cubicly and have 2.

It sounds like for the short term the GP development is all about efficiently doing the basic GP stuff.

@anon79882417, if you want to try to figure out how much memory things are using in a more granular way than just checking memory usage of the process, you can check the number of varis being allocated on the autodiff stack.

There are two big types in the autodiff system, vars and varis. Vars get shoveled around everywhere, and they can take up lots of space, but varis are the things that matter. @Bob_Carpenter calls it a pointer to implementation design pattern, but every var points to a certain vari (multiple var variables could point at the same vari) and it’s the varis that hold the values and adjoints that matter for autodiff.

At the end of evaluating a log probability, the only vars that might actually be left sitting around are the lp var and the ones on your parameters. Everything in the middle gets tossed, and it all comes down to the varis that were created in the process.

There are three types of varis, and they’re stored here: https://github.com/stan-dev/math/blob/develop/stan/math/rev/core/autodiffstackstorage.hpp#L49

You can get information about the number of varis sitting around with code like:

Eigen::Matrix<stan::math::var, Eigen::Dynamic, Eigen::Dynamic> x(2, 2), y(2, 2);
x << 2.0, 1.0, 0.0, -1.0;

std::cout << "Chaining stack: " << stan::math::ChainableStack::instance().var_stack_.size() << std::endl;
std::cout << "Non-chaining stack: " << stan::math::ChainableStack::instance().var_nochain_stack_.size() << std::endl;
y = stan::math::adj_jac_apply<MatrixSinFunctor>(x);
std::cout << "Chaining stack: " << stan::math::ChainableStack::instance().var_stack_.size() << std::endl;
std::cout << "Non-chaining stack: " << stan::math::ChainableStack::instance().var_nochain_stack_.size() << std::endl;

The output is:

Chaining stack: 4
Non-chaining stack: 0
Chaining stack: 5
Non-chaining stack: 4

This is showing that the high level parameters that we’re going to autodiff each have a vari and then adj_jac_apply itself creates 5 more. One that goes on the chainable autodiff stack and four that go on the non-chaining stack.

The difference in the chaining/non-chaining stack is more to do with performance, but chaining + non-chaining varis should mostly determine the memory usage of your program when using basic autodiff (things implemented in prim only).

What I said is misleading if you consider functions with custom reverse mode autodiff. Those varis are also allowed to allocate memory in the memalloc_ variable here: https://github.com/stan-dev/math/blob/develop/stan/math/rev/core/autodiffstackstorage.hpp#L52

But if you wanted to understand memory usage in different kinds of GP kernels, my suggestion would be write some code and just check before and after how many varis there are. You can also watch how much stuff gets allocated in memalloc_, but varis get allocated in there so it’ll be a little convoluted. That’s gonna tell you a lot.

You should be able to account for all the varis that get created, but it might be a bit finicky. Probably worth the time though if you want to optimize the memory usage.

The stuff Aki is talking about here (Question about autodiff for potential GP covfun implementation) is workable, for sure, but there’s probably value in getting all the basic GPs you listed in your #1 above implemented in prim and rev first?

1 Like

No worries, mate. In 1D we use GPs with linear computational cost and we think we are able to do some fancy things with GPs better than with splines, but we’ll report more about that later. That was just now a placeholder in the roadmap.

So do I. But my experience is that the range parameter has a tendency to be effectively infinite, so a spline with a sum/integrate-to-zero constraint is fine (given that that’s effectively what you’re fitting with a Markov GP and a large range parameter)

Unless you were trying to sound passive aggressive here, probably avoid the Aussie vernacular.

No Markov. Let’s talk more in Helsinki.

Thanks for the info. I’ll keep learning.

Is this something that should be part of the short term GP goals? Linear vs. cubic seems like a big deal.

Is like the stuff @Bonnevie is talking about here: Using GP for autocorrelation in a time-series model: memory pressure, GP kernel definition - #15 by Bonnevie ?

Oh I guess this goes back to the small/medium models + flexibility thing. I guess the linear time GP stuff is kindof specialized.

It is in the roadmap list Gaussian process roadmap - #10 by avehtari

No. We are testing a specific basis function approach.

Yes, but I think specialized solution is ok if it’s easy to implement as modular component and there is a popular model family like GAMs in this case. So we think we have a flexible linear time solution for 1D GPs, but we’ll make first Stan language implemantion case studies before thinking C++ level speedups.

2 Likes

I think we need a roadmap at a slightly lower level of granularity to understand the flow of PRs.

I would be more comfortable if we developed one of these covariance functiosn to completion—that is we work depth first on the first one, then when that one’s done to everyone’s satisfaction, model the remaining ones after the first one.

It goes through with PRs just like everything else. The doc sources are in stan-dev repo under src/docs/.

No worries—it took me a few passes to get it into my head what the abstraction was and then several more passes to get it implemented.

This we can do moving forward. It’ll be easier if we get functional types and lambdas into Stan itself.

Same with a lot of the other things and ragged arrays.

It’ll probably be at least a year until we get either of those—@mitzimorris will be working on them after the refactor gets finished (hopefully soon).

Thanks for laying out how to probe memory. This calculates the amount of memory used in our arena, which is the relevant location.

A var is a simple pointer to implementation. Because there are no virtual methods, it takes up exactly one pointer (8 bytes everywhere we run). Also, we only use var on the stack—they’re never allocated on the heap.

The vari on the other hand, all have a vtable pointer (virtual chain() method means it’s virtual), a value and an adjoint, so are minimum 24 bytes and are allocated in our arena.

what about gp_exp_quad_cov, or gp_periodic_cov? this has prim and rev implementations completed.

The reason we need a GP function is so that we have function that manages the autodiff memory arena for a given combination of kernels (meaning sums, or element-wise multiplication of them).

Thanks for clarifying. I didn’t realize any of them were done yet. I thought there were still basic changes being made to the framework.

Does this mean we can expect one PR per covariance function going forward or are there more low-level pieces that need to go in?

I didn’t follow enough of the details in the original. What is shared_memory? You mean something built on top of MPI? Is there a signature for what the GP function would like like from a Stan user perspective? I find the best way to motivate features is to write out their functionality from a client perspective. Here, that’d be someone in the math lib or writing a Stan model.

We’ll do one PR for the set of matern kernels, and the the dot product, so we have this set of kernels, as this will be easily integrated with brms.

So Stan’s autodiff stack is allocated on the heap. And the heap is all RAM, allocated on CPUs. If I have a process that consumes a bunch of memory, it’s up to the application/software to allocate the memory it between cores correct (at least in C++/C)? and this is the purpose of MPI? My understanding of MPI was that it parallelizes log probability evals. But can I instead use MPI to share memory of models that consume lots of memory? Anything I’m not understanding?