Derivatives of kernel with respect to columns of X

Hey everybody, hopefully easier question than it seems (to me)!

I’m looking for a way of obtaining the derivatives of a kernel with respect to columns of the input X. As I understand it, I could “easily” get the derivatives of the kernel with respect to all inputs X_ij, resulting in a matrix (NxN)x(NxN), being X a matrix NxD. The derivatives I need are a matrix with shape NxNxD instead. Maybe I overlooked some simple rule to get the derivative of the columns based on the derivatives of all X_ij? Nonetheless, I’m very new to Stan internals and it’s the first time going down to C++ to code something with Stan::math (haven’t had the need so far), so I’m surely missing some functionality of the math library.

So far as a temporary solution I have just coded the derivatives by hand for the squared exponential kernel, but if I want to scale to other kernels it’s probably easier to just use autodiff (if possible of course).

Any suggestion appreciated, maybe @charlesm93 knows something? (I know you were working with derivatives of kernels during the summer). Thanks!

Alex

1 Like

In C++ or in a Stan program?

In Stan, we almost always wind up with a single log density, so never explicitly have to express a Jacobian in any of our calculations—they can remain implicit in the backward pass of autodiff where we do an adjoint-Jacobian product then increment the adjoints of the operands to a function. So you usually want to code that adjoint-Jacobian efficiently rather than creating the whole Jacobian. For example, if we have a N x N matrix inverse, the Jacobian is size O(N^4), but we can compute the entire adjoint-Jacobian product in O(N^3).

1 Like

Thanks for the answer. I’m doing it in C++ do I can lift the function to R and use it from other places, hope this helps in clarifying the question.

But what do you want to do with it. Do you literally need the full Jacobian?

The top-level overview of how our autodiff works is the arXiv paper. Basically, if you can write your function in templated C++ that only relies on functions we have implemented, then you can autodiff it with Stan.

Yes I have read that paper but I should probably read it again. I want to compute the full jacobian of the kernel matrix with respect to all columns d. Even worse, I need to compute the hessian, but that should be clearer when I implement the jacobian. That should be a matrix (NxN)xD. When looking at the paper I saw the function Stan::math::jacobian that accepts the function and the parameters. The problem is that I’m not entirely sure how to frame my function to fit that. Currently I have implemented a basic k_ij functor that computes the kernel for two vectors x_i and x_j. Now, I have to frame it in such a way that Stan::math::jacobian only computes D derivatives and not one for every X_ij. Is that clearer? maybe I can achieve this by another way like direct differentiation without using stan::math::jacobian?

Hi Alejandro!

There’s code I wrote for the embedded Laplace approximation that does exactly what you’re looking for; in particular see this file, lines 168 - 196 and 252. Note that this code lives on an experimental branch of Stan, so you won’t find it in any released version (yet).

You’re right about using Jacobian(). Unfortunately, this (higher-order) function places constraints on the signature of f, the function which gets differentiated. Specifically, f needs to take in and return a vector, and Jacobian() then computes the full Jacobian \partial f_i / \partial x_j. In C++, f is a structure with an operator (y) that returns f(y), where y are the elements for which you require sensitivities / derivatives.

So you need to coerce the kernel function, K: x, y \rightarrow K(x, y) into a functor f_x: y^* \rightarrow K^*(x, y), where y^* stores the elements of y as an Eigen::Vector, and K^* those of K.

In my code lines 168 - 196 build a structure around K:

  template <typename K>
  struct covariance_sensitivities {
    /* input data for the covariance function. */
    std::vector<Eigen::VectorXd> x_;
    /* number of latent variables. */
    int theta_size_;
    /* structure to compute the covariance function. */
    K covariance_function_;
    
    covariance_sensitivities (const std::vector<Eigen::VectorXd>& x,
                              int theta_size,
                              const K& covariance_function) :
    x_(x), theta_size_(theta_size),
    covariance_function_(covariance_function) { }

    template <typename T>
    Eigen::Matrix<T, Eigen::Dynamic, 1>
    operator() (const Eigen::Matrix<T, Eigen::Dynamic, 1>& phi) const {
      return to_vector(covariance_function_(phi, x_, theta_size_));
    }
  };

and line 252 calls Jacobian:

jacobian_fwd(f, value_of(phi), covariance_vector, diff_cov);

Note that Jacobian exists both under the rev and fwd directories, respectively to do reverse and forward mode derivatives. Which mode to use depends on your map f: \mathbb R^n \rightarrow \mathbb R^m, specifically the dimensions m and n. Comparing both is an easy and enlightening exercise. In this particular example, there is one autodiff method we don’t use in Stan which could be quite useful, which is taping, something worth thinking about.

To complement @Bob_Carpenter’s reference on autodiff, see my review on autodiff.

EDIT: the code is a prototype, there are a lot of things that can be improved, and it’d be worth discussing how.

3 Likes

Hi Charles!

You, sir, are a master.

I will take a look at your code and play with it (I did look at your branch because I remembered you were working on something related, but completely missed the golden point). That will make my life so much easier, I was just thinking of manually adding support for the kernels I need by hardcoding the derivatives. This would scale so much better. From this I believe I can easily go to my next step, computing the hessian (just wrap jacobian_fwd and do jacobian_fwd over it is my first thinking).

Looking at your edit, and after I take couple looks at the code, I would be very much up for that discussion on improving this prototype, I want to stick this piece into projpred in the near future and performance may be a concern for some models.

Thank you very much.

I would be very much up for that discussion on improving this prototype

Awesome.

From this I believe I can easily go to my next step, computing the hessian (just wrap jacobian_fwd and do jacobian_fwd over it is my first thinking).

It’s worth spending some time thinking this through. You’ll want to do one reverse and one forward mode pass for the Hessian, and exploit the fact autodiff computes directional derivatives. It’s not straightforward, but this is a well known problem – my guess is a google search will spit out the solution.

1 Like