So question regarding the matrix autodiff. Say I have a function that’s dependent on some parameters. Let’s say we have a squared exponential kernel:
\sigma^2 exp(\frac{d(x, x')}{2l^2}), with for example the partial derivative WRT \sigma:
\frac{d}{d\sigma} = 2\sigma exp(\frac{d(x,x')}{2l^2}).
This will have a different value of the lower triangular, for example if the covariance matrix is 3x3 there are 3(2)/2 = 3 elements. So what’s the Jacobian look like for this function? If we have some function f, that’s dependent on x, \sigma and l so that we have f(x, \sigma, l), then we can write out the Jacobian as follows:
J_f = [\frac{df}{dx} \frac{df}{d\sigma} \frac{df}{dl}],
But this function exists on a matrix, so really, we have these evaluations at each position in the matrix. So really, J_f above will have \frac{n(n+1)}{2} evaluations, for a symmetric matrix (the diag and either the lower or upper triangular).
So, for the specialization, do I need to calculate the partial derivative with respect to each parameter, at each point in the covariance matrix? If so, that brings me to another question. The current rev
mode implementation of this covariance function looks like this, just the part where we’re chaining the partial derivatives:
virtual void chain() {
double adjl = 0;
double adjsigma = 0;
for (size_t i = 0; i < size_ltri_; ++i) {
vari *el_low = cov_lower_[i];
double prod_add = el_low->adj_ * el_low->val_;
adjl += prod_add * dist_[i];
adjsigma += prod_add;
}
for (size_t i = 0; i < size_; ++i) {
vari *el = cov_diag_[i];
adjsigma += el->adj_ * el->val_;
}
l_vari_->adj_ += adjl / (l_d_ * l_d_ * l_d_);
sigma_vari_->adj_ += adjsigma * 2 / sigma_d_;
}
But here’s what I noticed. If we need different gradient evaluations at each point in the covariance matrix, for say, sigma, why am I doing this operation: adjsigma += prod_add;
. The partial deritive for a matrix, w.r.t. \sigma is only one value? This will then be the same value at all points in the lower triangular of the covariance matrix, correct? Am I missing something?
Alternatively shouldn’t initialize sigma_vari_
as a vector of vari
that has a different value at each point in the covariance martix?
Anything incorrect in my understanding?