Algebraic Solver Differentiation Speedup

Hi all,

I’m a first-year Ph.D. student in Sharad Goel’s lab at Stanford. @charlesm93 has generously been explaining an idea he has for speeding up differentiation in the algebraic solver. (I.e., the topic posted here.) Over the next few weeks, I’m planning to implement this alternative method of propagating derivatives through implicit solutions to algebraic equations.

The basic idea is to use the implicit function theorem to get an “almost adjoint” method. Charles has a nice writeup here. Just at a high level, given a target scalar j(u, v), with u \in \mathbb{R}^p, v \in \mathbb{R}^n, and v depending implicitly on u via f : \mathbb{R}^{p + n} \to \mathbb{R}^n by requiring that f(u, v) = 0, we can calculate

\frac {dj} {du} = \frac {\partial j} {\partial u} + \frac {\partial j} {\partial v} \frac {dv} {du}

and further simplify, using the implicit function theorem,

\frac {dv} {du} = - \left[ \frac {\partial f} {\partial v} \right]^{-1} \frac {\partial f} {\partial u}.

The key insight is that if we can calculate (with a matrix solve)

\eta = - \left[ \frac {\partial f} {\partial v} \right]^{-T} \left[ \frac {\partial j} {\partial v} \right]^T,

then it only requires one additional reverse-mode sweep to calculate

\frac {\partial j} {\partial v} \frac {dv} {du} = \eta^T \frac {\partial f} {\partial u}.

The hard part is calculating the Jacobian \left[ \frac {\partial f} {\partial v} \right]. This can either be done with n forward-mode sweeps, or possibly gotten for free from the solver itself.

Looking forward to working with all of you! Hopefully this will result in some speedup.

9 Likes

@jgaeb I think everyone on the Stan dev team agrees this is a feature we should try out. The next step is to post an issue on

and then work out the details. I propose we start with the newton solver, although both solvers should admit the same differentiation.

1 Like

In order to implement the final reverse-mode pass with \eta, based on the conversation here and here, it seems like the right thing to do is to use adj_jac_apply(). Unfortunately, the file stan/math/rev/functor/adj_jac_apply.hpp no longer seems to exist. Based on this PR, it was merged into develop at some point, but I haven’t been able to determine whether it was moved or removed.

@bbbales2 — Is that still the right way to calculate the sensitivities of f relative to u given the initial cotangent \eta (i.e., \eta^T \tfrac {\partial f} {\partial u})? I think I understand how to accomplish this explicitly, but ideally, I’d like to avoid calculating the whole Jacobian of f, and there doesn’t seem to be any way of avoiding calculating \tfrac {\partial f} {\partial v}, since it has to be inverted.

Thanks!

Edit:

So I found in the 3.4.0 release notes that adj_jac_apply() has been replaced by reverse_pass_callback(), although I’m still a little confused about how to replicate the functionality of adj_jac_apply() with reverse_pass_callback(). My understanding was that with adj_jac_apply one could do something like adj_jac_apply(Function f, RowVectorXd s), where f is the function whose sensitivities we want to calculate and s is the initial cotangent, and it would return the sensitivities. I’ve been looking at this blog post, but I still haven’t really figured out how to accomplish the former with the latter.

Ooo oo, adj_jac_apply has been replaced. Now we do this thing called reverse_pass_callback. (And I see you have found this)

Check out the implementation of dot_self (so dot product of a vector and itself): math/dot_self.hpp at develop · stan-dev/math · GitHub

This is the matrix multiplication code: math/multiply.hpp at develop · stan-dev/math · GitHub.

There’s kindof a lot going on there. I’m headed to bed, but can follow up tomorrow. Can you describe to the function you want to write? Like how many arguments it has and what return type you want and such? I can (hopefully) write template code for you that works with reverse_pass_callback then.

I’m afraid the docs for all this are a work in progress, but they are here (though they might be more confusing than useful).

1 Like

Great, thank you! Following up tomorrow would be great!

(Just a head’s up that I’m using slightly different notation below from the original post. Here y is the input and x is the output of the implicit function. This is slightly confusing but matches the current code in algebra_solver_powell.hpp better. Also, a lot of this is for me to wrap my head around what’s going on — I obviously still have a lot to learn! — so my apologies if it’s a little long or some of the terminology isn’t quite right.)

Suppose we have some function f : \mathbb R^{n + p} \to \mathbb R^p, and we define the function g : \mathbb R^n \to \mathbb R^p by setting g(y) \in \mathbb R^p to be the unique x such that f(y,x) = 0.

Then, if j(y, g(y)) is the function whose gradient we’re trying to extract, we can do so with the formula given by the implicit function theorem:

\bar y = \frac {dj} {dy} = \frac {\partial j} {\partial y} - \bar x \left[ \frac {\partial f} {\partial x} \right]^{-1} \left[ \frac {\partial f} {\partial y} \right]

where \bar x = \tfrac {dj} {dx}.

So, in the chain method for the algebra_solver_vari in stan/math/rev/functor/algebra_solver_powell.hpp, that means we need to do four things:

  1. Get the sensitivities \bar x . This is just going to be stored in the x_size_-size array of varis, theta_ in our algebra_solver_vari object. So, concretely, I think we should just be able to call:
    VectorXd x_bar_(x_size_);
    for (int i = 0; i < x_size_; ++i) {
      x_bar_[i] = theta_[i]->adj;
    }

(I’m not sure if this is actually the best way to initialize x_bar_, but that seems like a question for a later stage!)

  1. We need to contract \bar x and \left[ \tfrac {\partial f} {\partial x} \right]^{-1}. This we can just do with Eigen, e.g.,
    MatrixXd(x_size_, x_size_) Jf_y = fx.get_jacobian(theta_dbl);
    VectorXd(x_size_) eta_ = - Jf_y.partialPivLU.solve(x_bar_);

(The functor fx is just f with its y parameters all fixed to whatever the values of the inputs are, and fy is just f with its x parameters all fixed to whatever the values of the solution are.)

  1. We need to contract \eta = - \bar x \left[ \tfrac {\partial f} {\partial x} \right]^{-1} and \left[ \tfrac {\partial f} {\partial y} \right]. Here, I would like to do something like
    bar_y_ = f.adj_jac_apply(f, eta_);

but need to replace this step with a callback. (Another possibility @charlesm93 discussed with me was actually just getting \left[ \tfrac {\partial f} {\partial y} \right] from the solver itself, in which case you could just use straightforward linear algebra, but that actually seems a little more complicated, so I’d like to try implementing it the “lazy” way first!)

  1. The last step is just to assign the result to the varis that make up y:
    for (int j = 0; j < y_size_; ++j)
        y_[j]->adj_ += dot(x_bar_, Jf_.col(j));

I’ll look at all the docs you link (and maybe come back and edit this a little bit if that ends up resolving some of my confusion), but please let me know if that doesn’t make sense or answer your question!

You can compute \eta^T \frac{\partial f}{\partial y} with a nested reverse pass:

So if f takes a vector input and spits out a vector, do something like:

{
  stan::math::nested_rev_autodiff();
  Eigen::Matrix<stan::math::var, Eigen::Dynamic, 1> y = Eigen::VectorXd::Random(5);
  auto x = stan::math::eval(f(y)); // use eval to evaluate
                                   // expressions to plain types
  x.adj() = eta;
  stan::math::grad();
  std::cout << y.adj() << std::endl; // this is the result!
}

Does that make sense?

Edit: Also is that the question you’re after here

2 Likes

Yes, I think this is exactly what I was looking for. Thanks!

1 Like

I have a few questions regarding @charlesm93’s writeup attached in the first post (thank you for the nice summary @jgaeb); mostly on expressions that I am not used to.

  1. Why is \delta named cotangent vector? Member of cotangent space is a linear map from T_{x}M -> R. So do we name partial derivative of scalar w.r.t n-sized vector, a cotangent vector because of its size, or are they any deeper meaning?

  2. When do you use a dagger? Is this a transpose?

  3. From below, what is the relationship between using basis vector and replacing with “.”? Is this because LHS of line2 (Alg1) is independent of basis vector?

When we compute the full Jacobian, we use basis vectors as our initial vectors, and replace the last term with “·”

1 Like

Hi @hyunji.moon

  1. I’m not sure about the nomenclature for \delta. Someone who knows more than me would have to weigh in to really answer, but I would guess that it’s because if f : \mathbb{R}^n \to \mathbb{R}^m, then any y \in \mathbb{R}^m gives rise to a unique map from the tangent space at x to \mathbb{R}. In particular, if \mathbf{t} is any tangent vector at x, then (assuming x is a regular point) there is some \mathbf{x} \in \mathbb{R}^n such that \mathbf{t} = J_x \mathbf{x}, and then we can set
\text{map}_{\mathbf{y}}(\mathbf{t}) = \mathbf{y}^T J_x \mathbf{x}.
  1. Yeah, the dagger is just transpose. (I posted an early version, so there are one or two places where a dagger is missing because of a typo.)

  2. That notation is just shorthand for computing the matrix

\begin{bmatrix} D^n_{\text{fw}} (f, u, \mathbf{e}_1) & \cdots & D^n_{\text{fw}} (f, u, \mathbf{e}_n) \end{bmatrix}.

Hope that’s helpful! (And anyone please feel free to chime in if I got anything wrong!)

1 Like

Because just like forward mode autodiff works with tangent vectors, reverse mode audodiff works with cotangent vectors. Indeed reverse mode autodiff is just a sequence of cotangent vector pullbacks and forward mode autodiff is just a sequence of tangent vector pushforwards. For a more in-depth discussion on this connection between automatic differentiation and geometry see https://arxiv.org/pdf/1812.11592.pdf.

2 Likes

So, for some reason, when I attempted to implement this, algebraic_solver_vari.solve() seems to be calling itself recursively when I get to the line where stan::math::grad() is called, which is eventually causing a seg fault.

The way I’ve tried to implement a first pass of your suggestion above is:

/**
 * The vari class for the algebraic solver. We compute the  Jacobian of
 * the solutions with respect to the parameters using the implicit
 * function theorem. The call to Jacobian() occurs outside the call to
 * chain() -- this prevents malloc issues.
 */
template <typename Fy, typename T, typename Fx>
struct algebra_solver_vari : public vari {
  /** number of parameters */
  const int y_size_;
  /** value of parameters */
  const Eigen::Matrix<T, Eigen::Dynamic, 1> y_val_;
  /** System functor (f_) w.r.t. inputs (y_) */
  Fy fy_;
  /** array of parameters */
  vari** y_;
  /** number of unknowns */
  const int x_size_;
  /** value of unknowns */
  const Eigen::VectorXd& x_val_;
  /** Hybrj functor solver (f_) w.r.t. outputs (x_) */
  Fx fx_;
  /** array of unknowns */
  vari** x_;
  /** Jacobian of f w.r.t. outputs (x_) */
  const Eigen::MatrixXd Jf_x_;

  algebra_solver_vari(Fy& fy, const Eigen::Matrix<T, Eigen::Dynamic, 1>& y,
                      Fx& fx, const Eigen::VectorXd& x)
      : vari(x(0)),
        y_size_(y.size()),
        y_val_(y),
        fy_(fy),
        y_(ChainableStack::instance_->memalloc_.alloc_array<vari*>(y_size_)),
        x_size_(x.size()),
        x_val_(x),
        Jf_x_(fx.get_jacobian(x_val_)),
        fx_(fx),
        x_(ChainableStack::instance_->memalloc_.alloc_array<vari*>(x_size_))
  {
    using Eigen::Map;
    using Eigen::MatrixXd;

    for (int i = 0; i < y_size_; ++i) {
      y_[i] = y(i).vi_;
    }

    x_[0] = this;
    for (int i = 1; i < x_size_; ++i) {
      x_[i] = new vari(x(i), false);
    }
  }

  void chain() {
    using Eigen::Matrix;
    using Eigen::MatrixXd;
    using Eigen::VectorXd;
    using Eigen::Dynamic;

    // Compute (transpose of) specificities with respect to x.
    VectorXd x_bar_(x_size_);
    for (int i = 0; i < x_size_; ++i) {
      x_bar_[i] = x_[i]->adj_;
    }

    // Contract specificities with inverse Jacobian of f with respect to x.
    VectorXd eta_ = - Jf_x_.transpose().fullPivLu().solve(x_bar_);

    // Contract with Jacobian of f with respect to y using a nested reverse
    // autodiff pass.
    VectorXd y_bar_(y_size_);
    {
      stan::math::nested_rev_autodiff();
      Matrix<var, Eigen::Dynamic, 1> y_nrad_ = y_val_;
      auto x_nrad_ = stan::math::eval(fy_(y_nrad_));
      x_nrad_.adj() = eta_;
      stan::math::grad();
      y_bar_ = y_nrad_.adj();
    }

    for (int j = 0; j < y_size_; j++) {
        y_[j]->adj_ += y_bar_[j];
    }
  }
};

I think I must be initializing y_bar_ or y_nrad_ incorrectly, or somehow the way I’m using fy_ means that something is getting added to the nested autodiff stack that shouldn’t be there? (If it’s helpful, in the way that this algebra_solver_vari is being constructed, fy_ is a system_functor which takes a vector input of size y_size_ and applies f to that input and a cached copy of the output x.)

Can you put this code up in a branch somewhere? Also what test are you running?

Not obvious to me what’s breaking. I’d have to run the code. Sounds like something weird though.

In general these sorts of member functions can cause problems too, but not this one:

const Eigen::VectorXd& x_val_;
const Eigen::MatrixXd Jf_x_;

The first it’s likely the thing x_val_ references is out of scope during the reverse pass.

Jf_x_ will work, but it will leak memory. Things that inherit from vari don’t have destructors. But send this over and let me first look at this recursive thing and then we can handle these other points.

2 Likes

Great, thank you so much! Here’s a link to the branch on my fork of the main repo where I’ve been making these changes: GitHub - jgaeb/math at scratch.

As I’ve been trying to figure out exactly what I need for the implementation, I’ve added a bunch of members, some of which I now realize are superfluous. Once the chain method is working, I’ll try to work on dealing with the memory leaks!

(Again, I really appreciate you jumping in to help—I hope I’m not wasting too much of your time!)

Edit:

I accidentally posted this not as a reply before.

Also—I’ve been running math/test/unit/math/rev/functor/algebra_solver_test.

Would you mind opening up a draft PR? It might be easier to discuss this on github instead of copy pasting lines back and forth

1 Like

Aaah, that was tricky. Change:

stan::math::nested_rev_autodiff();

To something like:

stan::math::nested_rev_autodiff nested;

We used to use the syntax:

start_nested();
... // autodiff code
recover_memory_nested();

We switched to using a class for this so that the nested start/stop are handled in the constructor/destructor of nested_rev_autodiff:

{
  nested_rev_autodiff nested;
  ... // autodiff code
}

I think this works better with exceptions and stuff cause C++ handles the cleanup.

With nested_rev_autodiff(); the object gets created and immediately destructed and the function evaluation isn’t going onto the nested autodiff stack.

Also +1 on what @stevebronder said about the pulls. You can open a draft pull request and it makes it easier to dig through the source and leave comments and stuff.

3 Likes

Great, thanks so much for your help! That seems to have solved the issue.

I’ve opened a draft pull-request here. I couldn’t find anything specific in the wiki about how to format draft pull requests (some of the pieces, e.g., tests, aren’t there yet), so I just filled out the PR information informally for the time being while I get the code into shape. If there are guidelines I should follow more closely or missing info that you’d like to see at this point, just let me know and I can edit / delete as necessary.

1 Like

Cool! I’ll have a look through it tomorrow and see if anything generally sticks out.

At what stage would you call the development? Ready to benchmark? Or still debugging edge cases and stuff? Or ready to be polished up and merged?

1 Like

Definitely still debugging edge cases and stuff. I think once it actually compiles and runs (and doesn’t leak memory!) @charlesm93 and I are going to try to do some benchmarking. I expect to be at that point probably some time next week?

We’re indeed at a stage where we can start running benchmarks. We welcome suggestions, should people have problems they are interested in solving.

Our first example is a steady state model for pharmacokinetic. A patient takes a drug at a regular time interval: at the first the average drug mass in the patient’s system increases but after a few cycles it reaches a steady state. The drug mass y evolves over time according to the ODE

y_1' = - k_1 y_1 \\ y_2' = k_1 y_1 - k_2 y_2

with initial conditions (y_1^0, y_2^0). Often times, y_1^0 is the initial drug dose and y_2^0 = 0. The system is then readily solved by

y_1 = y_1^0 e^{-k_1 t} \\ y_2 = y_1^0 \frac{k_2}{k_1 - k_2} k_2(e^{-k_2t} - e^{-k_1 t}).

There is also an analytical solution when y_2^0 \neq 0, but it’s more complicated and I’m blinking on it, right now. We will need the latter solution when computing the system at steady state. Solving for the steady state means finding the initial conditions at the beginning of a cycle, such that we return to this state at the beginning of the next cycle. See https://charlesm93.github.io/files/2018-Margossian.pdf.

We measure the drug concentration, c, over time (for example over one or two cycles). Our likelihood distribution is, for example,

p(c \mid \theta) = \mathrm{Normal}\left (\frac{y_1}{V}, \sigma^2 \right),

where V is the volume of the central compartment and \theta the model parameters, here \theta = (k_0, k_1, V).
For many algorithms we care about, we need to compute

\nabla_\theta \log p(c \mid \theta).

We’ll do this with the adjoint and non-adjoint algebraic solvers.

The system can be scaled by introducing more patients and expanding the ODE. This ODE would only be pairwise coupled, but this can still help us learn about the scalability of our differentiation methods. I ran a similar experiment in the review for autodiff (https://arxiv.org/pdf/1811.05031.pdf, page 23).

Once we have this up and running, we can try a more challenging example where the ODE doesn’t admit an analytical solution and where we have more parameters and states per patient. I’m thinking a Friberg-Karlsson semi-mechanistic model – although I’m not sure patients would reach steady states in realistic applications.

Tagging some folks you might be interested in these experiments: @yizhang @wds15 @billg

Some values we can use for the simulation:

k1 = 2
k1 = 1
sigma = 1
y0  = 1000  // dose per pill / cycle
time = (0.083, 0.167, 0.25, 1, 2, 4)
ii = 5   // time between doses / cycles
y_obs = (58.1913, 104.996, 140.298, 206.566, 120.457, 23.1845)

The observed values were not simulated at steady state, but that shouldn’t really matter, since we’re primarily concerned in evaluating the derivatives, not doing a full model fit.

5 Likes