Gaussian process roadmap

I think we need a roadmap at a slightly lower level of granularity to understand the flow of PRs.

I would be more comfortable if we developed one of these covariance functiosn to completion—that is we work depth first on the first one, then when that one’s done to everyone’s satisfaction, model the remaining ones after the first one.

It goes through with PRs just like everything else. The doc sources are in stan-dev repo under src/docs/.

No worries—it took me a few passes to get it into my head what the abstraction was and then several more passes to get it implemented.

This we can do moving forward. It’ll be easier if we get functional types and lambdas into Stan itself.

Same with a lot of the other things and ragged arrays.

It’ll probably be at least a year until we get either of those—@mitzimorris will be working on them after the refactor gets finished (hopefully soon).

Thanks for laying out how to probe memory. This calculates the amount of memory used in our arena, which is the relevant location.

A var is a simple pointer to implementation. Because there are no virtual methods, it takes up exactly one pointer (8 bytes everywhere we run). Also, we only use var on the stack—they’re never allocated on the heap.

The vari on the other hand, all have a vtable pointer (virtual chain() method means it’s virtual), a value and an adjoint, so are minimum 24 bytes and are allocated in our arena.

what about gp_exp_quad_cov, or gp_periodic_cov? this has prim and rev implementations completed.

The reason we need a GP function is so that we have function that manages the autodiff memory arena for a given combination of kernels (meaning sums, or element-wise multiplication of them).

Thanks for clarifying. I didn’t realize any of them were done yet. I thought there were still basic changes being made to the framework.

Does this mean we can expect one PR per covariance function going forward or are there more low-level pieces that need to go in?

I didn’t follow enough of the details in the original. What is shared_memory? You mean something built on top of MPI? Is there a signature for what the GP function would like like from a Stan user perspective? I find the best way to motivate features is to write out their functionality from a client perspective. Here, that’d be someone in the math lib or writing a Stan model.

We’ll do one PR for the set of matern kernels, and the the dot product, so we have this set of kernels, as this will be easily integrated with brms.

So Stan’s autodiff stack is allocated on the heap. And the heap is all RAM, allocated on CPUs. If I have a process that consumes a bunch of memory, it’s up to the application/software to allocate the memory it between cores correct (at least in C++/C)? and this is the purpose of MPI? My understanding of MPI was that it parallelizes log probability evals. But can I instead use MPI to share memory of models that consume lots of memory? Anything I’m not understanding?

MPI is mostly for sharing data between processes that aren’t running in a shared memory architecture (where all the cores can read each other’s memory). You can do MPI on a shared memory computer though, no problem.

Technically yeah, you can use things like MPI to manage on multiple computers more data than would fit in a single computer’s memory, but Stan’s MPI is more for taking advantage of parallel calculations instead of memory.

With regards to memory efficiency. Just make the small problems as efficient as you can, and choose a solution that scales well. Then whatever the Maximum Sized System that works is the Maximum Sized System that works.

1 Like

That’s right. Though hopefully it will mostly spend its time in cache, as RAM is super expensive compared to memory closer to the CPU.

Depends what you’re doing. Memory is usually shared by the cores on a single CPU. But all memory won’t be equally close to all cores in all cases and there’s contention if there’s too much memory traffic.

No. The purpose of MPI is to support multi-process parallelism as opposed to threading, which deals with multiple threads within a single process.

In Stan, we’ve implemented the map_rect function for map-reduce. The typical use case is to parallelize the likelihood as that’s the bit of the log density that scales with data.

This isn’t going to help with scalability in Stan (yet, anyway), because the memory all gets loaded into a single process before being distributed to the cores doing the MPI computations. This does help with memory locality per core, though, which is a big deal.

Is there a proof of concept to show that a specialised implementation of a sum/elementise product GP is faster than specialised implementations of the parts + the ordinary autodiff for the sum? (NB add doesn’t have a specialised mat reverse mode and this would be much easier to add)

1 Like

This will be trivial with the new cool @bbbales2 stuff (the Jacobian is the identity). I will hold him down and force him to teach me how it works next week. (He’s one of the handfull of stan-devs I haven’t met IRL and I am excited)

I think you have missed part of the discussion and “GP function” here is not about specialised but very flexible and generic, but that is not happening soon

and thus this is reasonable step before “covariance function” approach.

No I got that. But I’m not sure where the speed saving is in a functor that only composes density evaluation but not gradients. I suspect for combinations of models that are currently coded up you won’t get any sort of speed up with this.

We could do a “banded GP” implementation for GPs that give banded covariance structures. That would be fast and if their sums and products are computed in the Stan program (and the rev mode functions are appropriately specialized) and the resulting density could be passed in.

Remember the GP wiki: We need construction and explicit derivatives for it to be fast. And in some cases (like banded matrices) we can specialize the non-centering (and its inverse) to be efficient as well.

But this seems to be only talking about constructing the covariance matrix. Alone, that is not a saving. (Or, perhaps, it’s not obvious to me where the saving is)

Birthday GP fails now as the expression tree uses too much memory when summing and multiplying covariance matrices. It’s possible that I’m mistaken, but it seems the functor approach would save memory. The main benefit would also be flexibility, but I would assume we could get similar thing as with ODEs that ODE can be defined as a user defined function and the speed up comes from having specialized derivative computations for ODE solvers.

Let’s talk more next week

My guess would be that this will be fixed with the specialization of sum and elementwise product. We can probably check it next week.

They compute the exact derivative while solving the ODE (because maths). There’s no way to do that with GPs. So I can’t see how it wouldn’t do a hideously inefficient element by element differentiation

@anon79882417 Are you gonna be in Helsinki? You should be in on this since you’re doing the hard part.

No worries, mate. We can fix that right up

It’s like nerdy Pokemon.

We should maybe write a small Markdown document on how to use your new function with add as a case study. Because it’s far and away the easiest one :p

Yeah, I’ll be here but I had to take a Friday 8/31 flight out because it was much cheaper. I’m available through Thursday 8/30, pretty much any time.

EDIT: I just looked at the schedule, so it might be hard to find time to get together, but hopefully there is time around some of these events. If anyone’s in town earlier, Monday and Tuesday I’m around Aalto’s CS department working. Just one meeting Tuesday 10am (EST) and 17:00 (EEST)

I talked to Aki. I’m gonna come by sometime in the afternoon Tuesday. Seeyah then!

1 Like

Yes, please! Analytic gradients for the NCP Choleskys will be a bit tricky but once one is done it will be much easier to transfer to other kernels and the speedup will be significant.

combindding GPU, sparse matrix and lazy evaluation sounds great.

I wrote a quick spec as a github issue. Talking with various people here I understood something that I hadn’t before: that the functor can be autodiff’d through to help compute the derivatives.

For the evaluation of the GP log-density and its gradients, the maths is easy (linked in Rasmussen and Williams). But we need to do the same math to get the derivative

\frac{\partial}{\partial \theta_k} \left (\mathbf{L}(\mathbf{\theta})^T \mathbf{z} \right),

where \mathbf{\Sigma(\theta)} = \mathbf{L}(\theta)\mathbf{L}(\theta)^T is the cholesky of the covariance matrix [\Sigma(\theta)]_{ij} = c(x_i,x_j) and \mathbf{z} is a constant vector. We have the derivates of all of the constitute parts (matrix-vector product wrt matrix, cholesky wrt matrix, matrix wrt parameters), but we need to chain them together properly.

2 Likes

Some “for completeness notes” for the github issue (thanks to the magic of discourse latex). The notation here is not necessarily consistent with the rest of the thread but is consistent with the issue.

  1. If y(x) is a GP and \mathbf{y} is a vector with [\mathbf{y}]_i = u(x_i), then
\mathbf{y} \sim N(\mathbf{0}, \mathbf{C})

for some covariance matrix \mathbf{C}.
2) The partial with respect to any parameter \theta_j that’s used to define \mathbf{C} of the log prior density for \mathbf{u} is (Equation 5.9 in R&W)

\frac{\partial}{\partial \theta_j} \log p( \mathbf{y} | x, \theta) = \frac{1}{2} \operatorname{tr}\left[ \left(\mathbf{\alpha \alpha}^T - \mathbf{C}^{-1}\right) \frac{\partial \mathbf{C}}{\partial \theta_j}\right]

3.) The derivative w.r.t. a component of x is

\frac{\partial}{\partial x_j} \log p( \mathbf{y} | x, \theta) = \frac{1}{2} \operatorname{tr}\left[ \left(\mathbf{\alpha \alpha}^T - \mathbf{C}^{-1}\right) \frac{\partial \mathbf{C}}{\partial x_j}\right]
3 Likes