Higher-Order Autodiff, testing framework & RHMC


#1

@seantalts, @rok_cesnovar, @tadej

To continue our discussion - We’d be interested in finishing the deployment of (higher-order) AD and testing framework, at least to the point where it wouldn’t interfere with something like including RHMC into Stan.

Regarding mixed-mode testing - @bbbales2, @Bob_Carpenter, so, if I understand correctly, AD testing is under review, but does not does not test mixed-mode/higher-order? But you you think it would be relatively easy to extend what has been done to higher-order testing?.. And in terms of fwd support, the basic functions/pdf’s are supported, but most of the advanced stuff isn’t (solvers, integrators, etc…)?

As @betanalpha pointed out there are also two things that need to be dealt with in Stan itself:

  • how to gracefully fail when you use something without forward mode support and
  • how to compile higher-order support only when needed, because it would be slow.

Any ideas/comments on these would be appreciated. Is there something else that needs to be done?


#2

For this we should try to unify with MPI, threads, and GPU (though I think there are at least 2 slightly different ways of doing it in those 3 subsystems). Basically putting something like STAN_RHMC=true or CXXFLAGS += -DSTAN_RHMC in the makefile would work.


#3

We should probably start with getting it off the ground. However, for RHMC it could well be that multi-threading is an absolute killer feature given that we can probably parallelize the higher-order derivatives (one thread per direction needed)?


#4

Oh sorry, I just meant to mimic an existing makefile convention from one of those. We don’t have to refactor other stuff as part of RHMC.


#5

One open question is how we decide to do higher-order autodiff. The usual approach would be to do two passes of first-order autodiff. This is what is currently lurking in Stan with fwd and rev mode. That said @betanalpha is well under the way with a geometric theory of higher-order autodiff, which prescribes several implementation techniques. (I’m stil reading the draft he sent me, so I don’t have all the details yet).

More ambition, but I think implementing the latter could lead to better results. At least, it would be interesting to figure out whether it does.


#6

Honestly, I would much more like to see RHMC in action sooner rather than later… unless the current plan to nest things is not feasible.

I would really hope that RHMC is able to sample smoothly through horeshoe prior problems; and I much more have that available sooner rather than later.

(but it is beyond me to judge what is the faster way to getting this to work)


#7

At some point we are due to sit down and rethink the autodiff architecture, especially moving towards higher-orders. That shouldn’t necessarily obstruct any project like this, but we should keep in mind that the there many ways to proceed on higher-order methods.


#8

Off topic: This NIPS paper about efficient higher-order autodiff is on the frontpage of HN today: http://www.matrixcalculus.org/matrixcalculus.pdf and it has some impressive performance claims about it (2 orders of magnitude on the CPU, doesn’t require multiple passes), but who knows how implementable it is.


#9

This paper just argues that using array-valued* adjoints makes implementing matrix calculations in autodiff easier, which is well known and something that we already take advantage of in some of our better implemented matrix operations. They also only claim an extension to higher-orders in a throw away sentence without any useful discussion.

  • Beyond first order they are not tensors, just higher-dimensional arrays. One should be careful to not throw Ricci calculus around while getting the geometry vocabulary wrong!

#10

Absolutely. The Hessian is \mathcal{O}(n^2) made up of n independent \mathcal{O}(n) operations. So yes, it’ll be embarassingly parallelizable up to the dimensionality of the parameter vector.

var and fvar<var> work the same way—they both make a single forward pass to build an expression graph (in the former for the value, in the latter for a derivative w.r.t. some input), and then do a single reverse-mode pass to propagate the chain rule.

That’s a long way from having our math lib implemented.

Michael already had a previous improvement you can see in the stan-dev/nomad repo. I’m not sure it was calling the proper higher-order autodiffs in speed tests vs. Stan–I think it may have just been forward all the way down, which imposes a nasty \mathcal{O}(n) penalty.

Here’s the outstanding PR from @bbbales2. It does not actually include any forward mode testing, but solves the problem of testing variadic functions. The testing framework I have in place for for the basic operators and arithmetic shows how to do the testing at all orders. These just need to be put together and then applied to all of our higher-order operations.


#11

Just to summarize (@Bob_Carpenter and others, correct me if I’m wrong) - the main obstacles in the way of having RHMC in Stan are:

  • completing the higher-order AD tests so mixed-mode AD can be included in Stan and
  • providing an interface to Michael’s RHMC implementation and again testing everything.

But there is no reason why the existing approach to higher-order AD wouldn’t be appropriate? Even if it might not be the most efficient.

If that’s the case, we’ll have a serious go at this.

By the way, does anyone have a readily available and as simple as possible Stan model that uses the algebraic/ODE solver or anything else that is guaranteed to break higher-order AD?


#12

Parallel autodiff will probably speed up the current mixed operations. RMHMC will still be slow as hell - it requires multiple eigendecompositions at each leapfrog step, which would make using RMHMC several times more expensive than using a GP model with the same number of parameters. Which is to say that the higher order doesn’t have to be the fastest possible - if we ignore compilation time, the cost of computing the derivatives will fairly quickly be swamped by the linear algebra cost.


#13

That’s actually our main goal - to see how much of that we can speed up on the GPU. Getting it all to work is just an intermediary step. :)


#14

This is the primary obstacle. There are two paths forward, both of which assume the primitive instantiations (double, int) have been tested.

  1. We do exhaustive tests of each derivative instantiation:

    • (rev) rev
    • (fwd) fvar<double> and fvar<fvar<double>>
    • (mix) fvar<var> and fvar<fvar<var>>

    We only just figured out how to solve the variadic polymorphism problem it involved with @bbbales2 active stan-dev/math PR. This needs to be combined with the pattern I used for testing the core C++ mathematical operators using the finite difference functionals. All of our derivative functionals like gradient() and hessian() are implemented with (a) forward only, (b) mixed, and © [seriously, Discourse?] finite differences.

  2. We test fvar<double> against rev (or finite diffs) and then assume by induction that it will work elsewhere. This is sound as long as all the relevant function calls are implemented and correct. If they build, these will already have been tested by induction. I’m pretty sure @syclik agreed this would be an OK route to test. What’s slowed us down so much before is trying to write out all the mixed tests by hand. This only makes sense when there are not specializations of any functions for mixed mode—those would need to be independently tested and thus form a pretty strong argument against this approach, because specializing the mix implementations is where we’d get single-thread performance gains (which of course stack with parallelization often superlinearly because of memory locality improvements, etc.).

This includes

  • creating service interfaces in stan-dev/stan, and
  • implementing these services with callbacks through the low-level user-facing interfaces (CmdStan, RStan, PyStan), and the real hard one,
  • a build system for the interfaces which does not require instantiating higher-order autodiff (in the C++ sense of “instantiate”—that’s what causes the object code to be optimized and assembled).

@betanalpha should be deciding what exactly needs to be tested on the algorithmic side.

It’s the standard approach to higher-order autodiff and it works. Stan’s implementation has been tested very thoroughly outside of multivariate matrix functions. Having said that, we want to change the fundamental behavior and

  • remove the NaN fiddling in the constructors

because it’s really slowing things down.

To see how everything works, look at the functional interfaces like the two forms of hessian() (one uses only forward mode and the other reverse nested in forward).


#15

Yup. In agreement.

Btw, the minimum level of testing I’d want before releasing is for each function:

  1. We test one case in which we know the function value and we verify we’re getting the correct value + derivatives.
  2. We test one error condition to verify that it errors in the way it should. (Meaning it throws the right exception, it has the correct message, it doesn’t seg fault.)

That won’t tell us that the implementation is great across the domain of inputs, but I think that’s the minimum level before I’d say we could release it.

We’re going to think through the (software) integration issues, though. We will need to figure out how to conditionally compile with higher-order autodiff. I haven’t verified if we can compile with the new suggested Windows compilers. We couldn’t with the older g++ 4.6.3.


#16

I have attempted to make RHMC work within cmdstan. My changes to stan and cmdstan are here https://github.com/bstatcomp/stan/tree/RHMC and here https://github.com/bstatcomp/cmdstan/tree/RHMC.

With these changes i managed to compile simple models such as bernoulli or linear model. However in both parameter values sometimes become NaN. In bernoulli model this happens once in a few runs, while in linear model sigma always increases until it reaches infinity.

Can someone (maybe @betanalpha) check if I did something stupid, or is this a problem with RHMC?


#17

I’d actually want a bit more. I want to test the derivative versions vs. the base versions for value and exception behavior and with finite diffs for derivatives. What I think we can cut down is that if fvar<double> works and var works, then fvar<var> should work by construction. The bottleneck in the past has been testing all the higher-order mixed cases.

I think it’s inevitable that we won’t be able to fully support Windows—we just don’t have the resources. I think it’s OK to have things that only work on Linux/Mac OS X.

Cool! Thanks for sharing. I hope @betanalpha has some idea of what’s going on here. It would be nice to have a working RHMC if only to be able to compare it more broadly. I believe @betanalpha said there are some residual tuning issues that haven’t been fully flehsed out.