Higher-Order Autodiff, testing framework & RHMC

@seantalts, @rok_cesnovar, @tadej

To continue our discussion - We’d be interested in finishing the deployment of (higher-order) AD and testing framework, at least to the point where it wouldn’t interfere with something like including RHMC into Stan.

Regarding mixed-mode testing - @bbbales2, @Bob_Carpenter, so, if I understand correctly, AD testing is under review, but does not does not test mixed-mode/higher-order? But you you think it would be relatively easy to extend what has been done to higher-order testing?.. And in terms of fwd support, the basic functions/pdf’s are supported, but most of the advanced stuff isn’t (solvers, integrators, etc…)?

As @betanalpha pointed out there are also two things that need to be dealt with in Stan itself:

  • how to gracefully fail when you use something without forward mode support and
  • how to compile higher-order support only when needed, because it would be slow.

Any ideas/comments on these would be appreciated. Is there something else that needs to be done?

2 Likes

For this we should try to unify with MPI, threads, and GPU (though I think there are at least 2 slightly different ways of doing it in those 3 subsystems). Basically putting something like STAN_RHMC=true or CXXFLAGS += -DSTAN_RHMC in the makefile would work.

We should probably start with getting it off the ground. However, for RHMC it could well be that multi-threading is an absolute killer feature given that we can probably parallelize the higher-order derivatives (one thread per direction needed)?

Oh sorry, I just meant to mimic an existing makefile convention from one of those. We don’t have to refactor other stuff as part of RHMC.

One open question is how we decide to do higher-order autodiff. The usual approach would be to do two passes of first-order autodiff. This is what is currently lurking in Stan with fwd and rev mode. That said @betanalpha is well under the way with a geometric theory of higher-order autodiff, which prescribes several implementation techniques. (I’m stil reading the draft he sent me, so I don’t have all the details yet).

More ambition, but I think implementing the latter could lead to better results. At least, it would be interesting to figure out whether it does.

Honestly, I would much more like to see RHMC in action sooner rather than later… unless the current plan to nest things is not feasible.

I would really hope that RHMC is able to sample smoothly through horeshoe prior problems; and I much more have that available sooner rather than later.

(but it is beyond me to judge what is the faster way to getting this to work)

1 Like

At some point we are due to sit down and rethink the autodiff architecture, especially moving towards higher-orders. That shouldn’t necessarily obstruct any project like this, but we should keep in mind that the there many ways to proceed on higher-order methods.

1 Like

Off topic: This NIPS paper about efficient higher-order autodiff is on the frontpage of HN today: http://www.matrixcalculus.org/matrixcalculus.pdf and it has some impressive performance claims about it (2 orders of magnitude on the CPU, doesn’t require multiple passes), but who knows how implementable it is.

This paper just argues that using array-valued* adjoints makes implementing matrix calculations in autodiff easier, which is well known and something that we already take advantage of in some of our better implemented matrix operations. They also only claim an extension to higher-orders in a throw away sentence without any useful discussion.

  • Beyond first order they are not tensors, just higher-dimensional arrays. One should be careful to not throw Ricci calculus around while getting the geometry vocabulary wrong!
1 Like

Absolutely. The Hessian is \mathcal{O}(n^2) made up of n independent \mathcal{O}(n) operations. So yes, it’ll be embarassingly parallelizable up to the dimensionality of the parameter vector.

var and fvar<var> work the same way—they both make a single forward pass to build an expression graph (in the former for the value, in the latter for a derivative w.r.t. some input), and then do a single reverse-mode pass to propagate the chain rule.

That’s a long way from having our math lib implemented.

Michael already had a previous improvement you can see in the stan-dev/nomad repo. I’m not sure it was calling the proper higher-order autodiffs in speed tests vs. Stan–I think it may have just been forward all the way down, which imposes a nasty \mathcal{O}(n) penalty.

Here’s the outstanding PR from @bbbales2. It does not actually include any forward mode testing, but solves the problem of testing variadic functions. The testing framework I have in place for for the basic operators and arithmetic shows how to do the testing at all orders. These just need to be put together and then applied to all of our higher-order operations.

Just to summarize (@Bob_Carpenter and others, correct me if I’m wrong) - the main obstacles in the way of having RHMC in Stan are:

  • completing the higher-order AD tests so mixed-mode AD can be included in Stan and
  • providing an interface to Michael’s RHMC implementation and again testing everything.

But there is no reason why the existing approach to higher-order AD wouldn’t be appropriate? Even if it might not be the most efficient.

If that’s the case, we’ll have a serious go at this.

By the way, does anyone have a readily available and as simple as possible Stan model that uses the algebraic/ODE solver or anything else that is guaranteed to break higher-order AD?

Parallel autodiff will probably speed up the current mixed operations. RMHMC will still be slow as hell - it requires multiple eigendecompositions at each leapfrog step, which would make using RMHMC several times more expensive than using a GP model with the same number of parameters. Which is to say that the higher order doesn’t have to be the fastest possible - if we ignore compilation time, the cost of computing the derivatives will fairly quickly be swamped by the linear algebra cost.

3 Likes

That’s actually our main goal - to see how much of that we can speed up on the GPU. Getting it all to work is just an intermediary step. :)

This is the primary obstacle. There are two paths forward, both of which assume the primitive instantiations (double, int) have been tested.

  1. We do exhaustive tests of each derivative instantiation:

    • (rev) rev
    • (fwd) fvar<double> and fvar<fvar<double>>
    • (mix) fvar<var> and fvar<fvar<var>>

    We only just figured out how to solve the variadic polymorphism problem it involved with @bbbales2 active stan-dev/math PR. This needs to be combined with the pattern I used for testing the core C++ mathematical operators using the finite difference functionals. All of our derivative functionals like gradient() and hessian() are implemented with (a) forward only, (b) mixed, and (c) [seriously, Discourse?] finite differences.

  2. We test fvar<double> against rev (or finite diffs) and then assume by induction that it will work elsewhere. This is sound as long as all the relevant function calls are implemented and correct. If they build, these will already have been tested by induction. I’m pretty sure @syclik agreed this would be an OK route to test. What’s slowed us down so much before is trying to write out all the mixed tests by hand. This only makes sense when there are not specializations of any functions for mixed mode—those would need to be independently tested and thus form a pretty strong argument against this approach, because specializing the mix implementations is where we’d get single-thread performance gains (which of course stack with parallelization often superlinearly because of memory locality improvements, etc.).

This includes

  • creating service interfaces in stan-dev/stan, and
  • implementing these services with callbacks through the low-level user-facing interfaces (CmdStan, RStan, PyStan), and the real hard one,
  • a build system for the interfaces which does not require instantiating higher-order autodiff (in the C++ sense of “instantiate”—that’s what causes the object code to be optimized and assembled).

@betanalpha should be deciding what exactly needs to be tested on the algorithmic side.

It’s the standard approach to higher-order autodiff and it works. Stan’s implementation has been tested very thoroughly outside of multivariate matrix functions. Having said that, we want to change the fundamental behavior and

  • remove the NaN fiddling in the constructors

because it’s really slowing things down.

To see how everything works, look at the functional interfaces like the two forms of hessian() (one uses only forward mode and the other reverse nested in forward).

Yup. In agreement.

Btw, the minimum level of testing I’d want before releasing is for each function:

  1. We test one case in which we know the function value and we verify we’re getting the correct value + derivatives.
  2. We test one error condition to verify that it errors in the way it should. (Meaning it throws the right exception, it has the correct message, it doesn’t seg fault.)

That won’t tell us that the implementation is great across the domain of inputs, but I think that’s the minimum level before I’d say we could release it.

We’re going to think through the (software) integration issues, though. We will need to figure out how to conditionally compile with higher-order autodiff. I haven’t verified if we can compile with the new suggested Windows compilers. We couldn’t with the older g++ 4.6.3.

I have attempted to make RHMC work within cmdstan. My changes to stan and cmdstan are here https://github.com/bstatcomp/stan/tree/RHMC and here https://github.com/bstatcomp/cmdstan/tree/RHMC.

With these changes i managed to compile simple models such as bernoulli or linear model. However in both parameter values sometimes become NaN. In bernoulli model this happens once in a few runs, while in linear model sigma always increases until it reaches infinity.

Can someone (maybe @betanalpha) check if I did something stupid, or is this a problem with RHMC?

I’d actually want a bit more. I want to test the derivative versions vs. the base versions for value and exception behavior and with finite diffs for derivatives. What I think we can cut down is that if fvar<double> works and var works, then fvar<var> should work by construction. The bottleneck in the past has been testing all the higher-order mixed cases.

I think it’s inevitable that we won’t be able to fully support Windows—we just don’t have the resources. I think it’s OK to have things that only work on Linux/Mac OS X.

Cool! Thanks for sharing. I hope @betanalpha has some idea of what’s going on here. It would be nice to have a working RHMC if only to be able to compare it more broadly. I believe @betanalpha said there are some residual tuning issues that haven’t been fully flehsed out.

Hello everyone, I have been assigned the task (by @Erik_Strumbelj) of pushing forward with the higher order AD testing. I’ve looked into the code of @bbbales2 for testing variadic functions. As far as I can see there is still some work that needs to be done regarding his code. On which a lively debate has arisen in the past few days on github. Manly on how the parameters should be handled and issues with compiling it on different OS (I did manage to compile the tests on Windows). Next step, as @Bob_Carpenter mentioned is to merge it with the framework, which is already in place for the basic operators and functions.

My question is where could I help to make things go faster. I assume it is not yet feasible to start writing tests, since the framework is not yet fully defined?

Welcome! Before trying to guide you to what would be useful in getting this up and running, a few of questions.

  • How good is your templated c++? (Does looking at template metaprogrammimng phase you? Does our code base or Eigen’s code base look like something you understand?)

  • How familiar are you with Stan’s automatic differentiation?

  • Are you familiar with terms like linking and translation units?

There really isn’t a debate happening. It’s more of a prototyping to a solution. We didn’t have a full spec before we started and we’re evaluating a prototype that @bbbales2 put together.

For us to be confident that higher order autodiff works properly, we have a few things we need to do:

  • test every math function (right now we don’t have confidence that the forward mode will implement the directional derivative correctly when we are able to compute the gradient properly)

  • find a way to build it properly on all the OSes we care about. This is not a solved problem.

Things that can be done immediately without waiting for a framework for testing:

  • add tests for higher order derivatives by hand. These would not an idle effort. We would use these test cases when we have a framework. It would also give you a feel for what the test framework would require.

  • reimplement functions so they work with higher order AD or write template specializations.

Welcome! All our discussion about higher order AD happens on these forums, so if you have any questions, please ask here. (preferably new threads that are specific)

I just realized this isn’t true. I think if we verified that fvar<var> works, then by construction, fvar<fvar<T>> should work.

I don’t think we were always careful with using the partials_return_type<T> metaprogram. We’ve probably have functions that explicitly set the type of the adjoint (in reverse mode) term as double, which would be the wrong type for fvar<var> or fvar<fvar<T>>. Testing for just fvar<double> would mask this error.

Unfortunately, it wouldn’t be a compile-time error when running with fvar<var> since double can be promoted to var. We might be able to stop it if we cleverly stopped operands_and_partials from allowing that promotion. That might be worth it.

(@Davor_Sluga… let me know if you’re able to follow that technical conversation. I’m trying to estimate how familiar you are with the Math library so we can make rapid progress.)