Gaussian process roadmap

The bit on priors. Spines should have linear computational cost and only one tuning parameter. GPs scale cubicly and have 2.

It sounds like for the short term the GP development is all about efficiently doing the basic GP stuff.

@anon79882417, if you want to try to figure out how much memory things are using in a more granular way than just checking memory usage of the process, you can check the number of varis being allocated on the autodiff stack.

There are two big types in the autodiff system, vars and varis. Vars get shoveled around everywhere, and they can take up lots of space, but varis are the things that matter. @Bob_Carpenter calls it a pointer to implementation design pattern, but every var points to a certain vari (multiple var variables could point at the same vari) and it’s the varis that hold the values and adjoints that matter for autodiff.

At the end of evaluating a log probability, the only vars that might actually be left sitting around are the lp var and the ones on your parameters. Everything in the middle gets tossed, and it all comes down to the varis that were created in the process.

There are three types of varis, and they’re stored here:

You can get information about the number of varis sitting around with code like:

Eigen::Matrix<stan::math::var, Eigen::Dynamic, Eigen::Dynamic> x(2, 2), y(2, 2);
x << 2.0, 1.0, 0.0, -1.0;

std::cout << "Chaining stack: " << stan::math::ChainableStack::instance().var_stack_.size() << std::endl;
std::cout << "Non-chaining stack: " << stan::math::ChainableStack::instance().var_nochain_stack_.size() << std::endl;
y = stan::math::adj_jac_apply<MatrixSinFunctor>(x);
std::cout << "Chaining stack: " << stan::math::ChainableStack::instance().var_stack_.size() << std::endl;
std::cout << "Non-chaining stack: " << stan::math::ChainableStack::instance().var_nochain_stack_.size() << std::endl;

The output is:

Chaining stack: 4
Non-chaining stack: 0
Chaining stack: 5
Non-chaining stack: 4

This is showing that the high level parameters that we’re going to autodiff each have a vari and then adj_jac_apply itself creates 5 more. One that goes on the chainable autodiff stack and four that go on the non-chaining stack.

The difference in the chaining/non-chaining stack is more to do with performance, but chaining + non-chaining varis should mostly determine the memory usage of your program when using basic autodiff (things implemented in prim only).

What I said is misleading if you consider functions with custom reverse mode autodiff. Those varis are also allowed to allocate memory in the memalloc_ variable here:

But if you wanted to understand memory usage in different kinds of GP kernels, my suggestion would be write some code and just check before and after how many varis there are. You can also watch how much stuff gets allocated in memalloc_, but varis get allocated in there so it’ll be a little convoluted. That’s gonna tell you a lot.

You should be able to account for all the varis that get created, but it might be a bit finicky. Probably worth the time though if you want to optimize the memory usage.

The stuff Aki is talking about here (Question about autodiff for potential GP covfun implementation) is workable, for sure, but there’s probably value in getting all the basic GPs you listed in your #1 above implemented in prim and rev first?

1 Like

No worries, mate. In 1D we use GPs with linear computational cost and we think we are able to do some fancy things with GPs better than with splines, but we’ll report more about that later. That was just now a placeholder in the roadmap.

So do I. But my experience is that the range parameter has a tendency to be effectively infinite, so a spline with a sum/integrate-to-zero constraint is fine (given that that’s effectively what you’re fitting with a Markov GP and a large range parameter)

Unless you were trying to sound passive aggressive here, probably avoid the Aussie vernacular.

No Markov. Let’s talk more in Helsinki.

Thanks for the info. I’ll keep learning.

Is this something that should be part of the short term GP goals? Linear vs. cubic seems like a big deal.

Is like the stuff @Bonnevie is talking about here: Using GP for autocorrelation in a time-series model: memory pressure, GP kernel definition ?

Oh I guess this goes back to the small/medium models + flexibility thing. I guess the linear time GP stuff is kindof specialized.

It is in the roadmap list Gaussian process roadmap

No. We are testing a specific basis function approach.

Yes, but I think specialized solution is ok if it’s easy to implement as modular component and there is a popular model family like GAMs in this case. So we think we have a flexible linear time solution for 1D GPs, but we’ll make first Stan language implemantion case studies before thinking C++ level speedups.


I think we need a roadmap at a slightly lower level of granularity to understand the flow of PRs.

I would be more comfortable if we developed one of these covariance functiosn to completion—that is we work depth first on the first one, then when that one’s done to everyone’s satisfaction, model the remaining ones after the first one.

It goes through with PRs just like everything else. The doc sources are in stan-dev repo under src/docs/.

No worries—it took me a few passes to get it into my head what the abstraction was and then several more passes to get it implemented.

This we can do moving forward. It’ll be easier if we get functional types and lambdas into Stan itself.

Same with a lot of the other things and ragged arrays.

It’ll probably be at least a year until we get either of those—@mitzimorris will be working on them after the refactor gets finished (hopefully soon).

Thanks for laying out how to probe memory. This calculates the amount of memory used in our arena, which is the relevant location.

A var is a simple pointer to implementation. Because there are no virtual methods, it takes up exactly one pointer (8 bytes everywhere we run). Also, we only use var on the stack—they’re never allocated on the heap.

The vari on the other hand, all have a vtable pointer (virtual chain() method means it’s virtual), a value and an adjoint, so are minimum 24 bytes and are allocated in our arena.

what about gp_exp_quad_cov, or gp_periodic_cov? this has prim and rev implementations completed.

The reason we need a GP function is so that we have function that manages the autodiff memory arena for a given combination of kernels (meaning sums, or element-wise multiplication of them).

Thanks for clarifying. I didn’t realize any of them were done yet. I thought there were still basic changes being made to the framework.

Does this mean we can expect one PR per covariance function going forward or are there more low-level pieces that need to go in?

I didn’t follow enough of the details in the original. What is shared_memory? You mean something built on top of MPI? Is there a signature for what the GP function would like like from a Stan user perspective? I find the best way to motivate features is to write out their functionality from a client perspective. Here, that’d be someone in the math lib or writing a Stan model.

We’ll do one PR for the set of matern kernels, and the the dot product, so we have this set of kernels, as this will be easily integrated with brms.

So Stan’s autodiff stack is allocated on the heap. And the heap is all RAM, allocated on CPUs. If I have a process that consumes a bunch of memory, it’s up to the application/software to allocate the memory it between cores correct (at least in C++/C)? and this is the purpose of MPI? My understanding of MPI was that it parallelizes log probability evals. But can I instead use MPI to share memory of models that consume lots of memory? Anything I’m not understanding?

MPI is mostly for sharing data between processes that aren’t running in a shared memory architecture (where all the cores can read each other’s memory). You can do MPI on a shared memory computer though, no problem.

Technically yeah, you can use things like MPI to manage on multiple computers more data than would fit in a single computer’s memory, but Stan’s MPI is more for taking advantage of parallel calculations instead of memory.

With regards to memory efficiency. Just make the small problems as efficient as you can, and choose a solution that scales well. Then whatever the Maximum Sized System that works is the Maximum Sized System that works.

1 Like

That’s right. Though hopefully it will mostly spend its time in cache, as RAM is super expensive compared to memory closer to the CPU.

Depends what you’re doing. Memory is usually shared by the cores on a single CPU. But all memory won’t be equally close to all cores in all cases and there’s contention if there’s too much memory traffic.

No. The purpose of MPI is to support multi-process parallelism as opposed to threading, which deals with multiple threads within a single process.

In Stan, we’ve implemented the map_rect function for map-reduce. The typical use case is to parallelize the likelihood as that’s the bit of the log density that scales with data.

This isn’t going to help with scalability in Stan (yet, anyway), because the memory all gets loaded into a single process before being distributed to the cores doing the MPI computations. This does help with memory locality per core, though, which is a big deal.

Is there a proof of concept to show that a specialised implementation of a sum/elementise product GP is faster than specialised implementations of the parts + the ordinary autodiff for the sum? (NB add doesn’t have a specialised mat reverse mode and this would be much easier to add)

1 Like

This will be trivial with the new cool @bbbales2 stuff (the Jacobian is the identity). I will hold him down and force him to teach me how it works next week. (He’s one of the handfull of stan-devs I haven’t met IRL and I am excited)

I think you have missed part of the discussion and “GP function” here is not about specialised but very flexible and generic, but that is not happening soon

and thus this is reasonable step before “covariance function” approach.

No I got that. But I’m not sure where the speed saving is in a functor that only composes density evaluation but not gradients. I suspect for combinations of models that are currently coded up you won’t get any sort of speed up with this.

We could do a “banded GP” implementation for GPs that give banded covariance structures. That would be fast and if their sums and products are computed in the Stan program (and the rev mode functions are appropriately specialized) and the resulting density could be passed in.

Remember the GP wiki: We need construction and explicit derivatives for it to be fast. And in some cases (like banded matrices) we can specialize the non-centering (and its inverse) to be efficient as well.

But this seems to be only talking about constructing the covariance matrix. Alone, that is not a saving. (Or, perhaps, it’s not obvious to me where the saving is)

Birthday GP fails now as the expression tree uses too much memory when summing and multiplying covariance matrices. It’s possible that I’m mistaken, but it seems the functor approach would save memory. The main benefit would also be flexibility, but I would assume we could get similar thing as with ODEs that ODE can be defined as a user defined function and the speed up comes from having specialized derivative computations for ODE solvers.

Let’s talk more next week

My guess would be that this will be fixed with the specialization of sum and elementwise product. We can probably check it next week.

They compute the exact derivative while solving the ODE (because maths). There’s no way to do that with GPs. So I can’t see how it wouldn’t do a hideously inefficient element by element differentiation