Question about autodiff for potential GP covfun implementation


#1

I start a new thread to continue the discussion from here Gaussian Process Predictive Posterior Function

I think this is specially for @Bob_Carpenter

Assume we have data x as a NxD matrix, with N the number of observations and D the number of dimensions.

We would have a user defined function

real my_gp_covfun(vector x1, vector x2, vector params)

where vector x1, and vector x2 have length D, and they would usually be two rows of x. vector params includes parameters of the covariance function.

We would have a function added in math and exposed in language

matrix gp_cov(function covfun, matrix x, vector params)

which we would call as

C = gp_cov(my_gp_covfun, x, params)

or

L = gp_cov_chol(my_gp_covfun, x, params)

(or something like this)

Inside gp_cov and gp_cov_chol we would call the user defined function, e.g. my_gp_covfun to construct the covariance matrix by calling it N(N+1)/2 times (need to compute only triangular matrix) with different row combinations of matrix x.

Question: If we would write other parts of gradient computation in C++ similar to what now is done inside cov_exp_quad, can we use autodiff to compute just the gradients of user defined function my_gp_covfun when we need the derivative of the covariance matrix with respect to one covariance function parameter in params? If this works, then we can continue with hand written matrix derivative operations to get the scalar gradient for that parameter, and then repeat for the the next covariance function parameter in params. This could lead to less memory use and possibility to use user defined covariance functions would give a lot of flexibility in defining models.


Gaussian process roadmap
Gaussian process roadmap
#2

When we’ve done this for the algebraic solvers, definite integrators, and ODE integrators, we also allowed two additional data arguments, one real[] and one int[]. I don’t know if there’s a use for that here, but it might be worth thinking about. There’s no other good way to provide data to my_gp_covfun.

Also, are the vectors x1 and x2 required to be data? If so, there’s now a data qualifier you can put in front of their variable declarations, e.g., my_gp_covfun(data vector x1, data vector x2, vector params).

These are all minor points. The whole design is fine and will work with autodiff as written, however the above choices are made.

We could use nested autodiff to compute the gradient of my_gp_covfun w.r.t. params internally if that’s what you’re asking. I don’t see from the design why you’d want to do that, though.

You’re probably going to want to think about coding this using the adjoint-Jacobian product abstraction for which @bbbales2 has an active PR on stan-dev/math. It encapsulates all the fussing with the autodiff stack to make sure there’s a single virtual function call that computes the adjoint-Jacobian product and it takes care of all the stack bookkeeping. The advantage is that you don’t need to ever explicitly store the Jacobian itself.

I didn’t understand this point.


#3

I really think that Aki’s point is that we’re trying to design a function that reduces the memory requirements of the autodiff.

In order to do so, we want to be able to approximate the memory requirements for GP models ahead of time, and make sure this implementation will minimize memory requirements.

Looking at Autodiff paper 8.6 Memory Usage we have: “storage cost in the area is as follows for a function with K operands,” and let’s assume for simplicity we’re looking at only exponential quadratic kernel (but this is one of the less arithmetically complex kernels), let

k_n = \sigma^2 exp(\frac{d(x, x')}{2l^2})

and consider kernel: K = k_1 + k_2 + ... + k_8

where for each k_n we have 7 operands. Then the memory usage of one of these Kernels per element is then: 24 + 7 * 8 = 80 bytes. Then, if we have 8000x8000 kernel (and we only use the lower triangular including diagonal with n * (n - 1) / 2 + n elements), and we have sum of 8 kernels, our memory consumption for autodiff for a problem like this is: 80 * (
32004000) * 8 ~= 24 GB, this is more ram than most common CPUs have. Ideally we’d want to get argmax to be somewhere around 12 GB (or 8GB).

Am I missing anything in the calculation? Anything else I should consider when approximating memory requirements for Stan programs (not just autodiff)?

It’s also common to sum and multiply kernels, adding further memory requirements for autodiff… for bday with 7 kernels (2 of which are exp_quad .* periodic, so really 9 kernels) I’m getting memory leaks with 32GB of ram.

Nested Autodiff could be a good idea.

It looks like there is some kind of nested implementation for the autodiff stack as there’s some conversation about it PR 181. I see there’s a “reasonable speed increase” (albeit a bit arbitrary because we know nothing about the model or the size of the dataset), so I’m assuming there’s already some kind of framework here already. However, before any coding is done, it’s important to know that Nested autodiff would give us the kind of memory usage reduction we want. Has anyone worked out an approximate calculation as to the memory consumption of nested autodiff? This would be a good first step. I’ll look into it a bit later.

I think this is what he means:

Another option would be to calculate the function value and gradient WRT each function value at each kernel (perhaps work this out for all common kernels), as @syclik mentioned in the link below. This way we’d only have a single expression on the expression graph. In the example above, this would reduce memory requirements to (I think?): 32 * 32004000 * 8 ~= 8 GB, which is better. But adding or multilpying only a few more kernels could make it increase.

Conversation with information about autodiff memory consumption: bypassing autodiff.

Ultimately, we’re worried about memory consumption of the autodiff here. Thoughts?


#4

The best thing to do is go into the actual memory class and instrument it. It has methods to measure how much memory has been allocated that we use in some of the unit tests.

Some of the matrix stuff is done more efficiently than the basic expressions.

Whether memory leaks or not isn’t dependent the amount of memory you have. There shouldn’t be any autodiff memory leaks in our org.