Higher-Order Autodiff, testing framework & RHMC

Wow. While that is a lot of testing, it actually test the implementation. It just tests the different error conditions, exhaustively. It doesn’t actually try to take the derivative or check the value of the function. So we know it won’t crash, but we don’t know if it’s a correct implementation.

I don’t think taking away the boilerplate code would have helped with the testing of this function. There should have been a little more guidance at the pull request stage.

For: grad_F32_test.cpp
That’s 30 different test cases in there. Each one is ~30 lines long. And you’re right, if we had the test framework in place, each test case would have been ~5 lines long instead (so maybe around ~150 lines). But we could have easily used the existing Google Test framework to have built that for this test.

This one has all the arguments being the same type, so this is pretty easy. (That’s how the function was written.) If it actually allowed different types, it’d be a lot harder.

Thanks for the thourogh explanation. I definitely need to write a few test cases to get familiar with the overall approach. I tried to implement the test for the F32, and looks like it has some problems as you mentioned, or maybe I am missing something (also: probably we should move with this discussion somewhere else). I wrote a test for the function value and it fails to compile with

error: no matching function for call to 'F32(stan::math::fvar<double>&, double&, double&, double&, double&, double&)'
  fvar<double> r = F32(a1,a2c,a3c,b1c,b2c,zc);

the test code being:

TEST(Fwd, F32) {
	using stan::math::fvar;
	using stan::math::F32;
	using stan::math::exp;

	fvar<double> a1(1.0, 1.0);
	fvar<double> a2(1.0, 1.0);
	fvar<double> a3(1.0, 1.0);
	fvar<double> b1(1.0, 1.0);
	fvar<double> b2(1.0, 1.0);
	fvar<double> z(0.6, 1.0);
	
	double a1c = 1.0;
	double a2c = 1.0;
	double a3c = 1.0;
	double b1c = 1.0;
	double b2c = 1.0;
	double zc = 0.6;
	
	fvar<double> r = F32(a1,a2c,a3c,b1c,b2c,zc);
	EXPECT_FLOAT_EQ(F32(1.0,1.0,1.0,1.0,1.0,0.6), r.val_);
}

meaning that it can’t instantiate with different argument types or at least that is how I understand it?

Next, I tried to tackle the distance function for the fwd mode, here is the full code. I would really appriciate some comments, to see if I’m on the right track. Well, this test also fails, the problem being the value of the derivative, if you set both argument values to be equal. As I understand, the explicit derivative d/dx1 of the distance function should be (x1-x2)/abs(x1-x2). Now, this should result in NaN if you put the same value for x1 and x2, but it results in the derivative being 0. Why is that?

I think this is fine, unless we spend the whole time talking about F32, but I don’t think we will.

Yes, you’re missing something critical here. The function implemented is here:

If you take a look at the definition of the function:

template <typename T>
 T F32(const T& a1, const T& a2, const T& a3, const T& b1, const T& b2,
      const T& z, double precision = 1e-6, int max_steps = 1e5)

you’ll see that there’s only one template type, T, and all the arguments (besides for precision and max_steps) are of type T. That means that you use fvar<double> for the first argument and double for the rest… they have to all be the same.

Would it be possible to put this on GitHub so I can make inline comments? It’s hard to do this over pastebin.

You’ll have to dig through the implementation.

For forward mode, this starts here:

Inside dot_product, you see that there’s just element-wise multiplication. 0 * 0 has a derivative of 0.

So… that’s the implementation.

The question you’re asking is if that’s the correct implementation; that’s a good question to be asking. At discontinuous points, we have to make a decision. Do we go with NaN, 0, inf, or something else that makes sense / doesn’t make sense.

I don’t remember what we decided here. It really would have helped to have more documentation in the function or a test that shows the behavior on this boundary, but we don’t have that. Once upon a time, I think we thought that at single discontinuities, under some cases, it makes sense that we return a finite derivative (like 0) that won’t mess up sampling. I think this is a case where that might hold. In other cases, throwing an exception makes sense, but here, I think it’s reasonable to want to take the distance and have vectors be the same resulting in 0. I don’t think we really want the derivatives / gradients to be destroyed at that point due to passing in duplicate vectors. So 0 makes sense here?

It’s an open question, though. Thoughts?

Would you consider a template default policy as an answer?

Yes, but I don’t think it’s worth the maintenance. I could be convinced otherwise.

I think we could be explicit and say what happened a at that boundary.

Thanks, for the answers.

Yeah, I see that now, but how do you then test the directional derivatives for each parameter, if you can’t vary the argument type?

Sure thing, I put the code on a github repo.

Yes, I did that, but the implementation for the distance between vectors is different from the distance function between scalars, which just takes the absulute value of the difference, and I went testing the latter in forward mode: https://github.com/bstatcomp/math/blob/add_tests/stan/math/prim/scal/fun/distance.hpp

@Davor_Sluga: sorry about the delay.

You have to run the function many times. You set the tangent to 1 for the derivative you want (everything else 0).

Thanks. Looks pretty good. It might help using better names of the variable names inside the test. And cosmetic, but it matters: use spaces instead of tabs.

The functionals we implement like jacobian, hessian, gradient, etc., can be helpful to see how to compose all these autodiff operations. They do everything in a safe, encapsualted way.

@syclik sure thing, I’ll fix my editor settings. Btw, I made a script which goes through all prim, rev, and fwd functions (scalar and matrix), checks for which mode the function was implemented and checks if there is a test written for each case. If the function or test exists it prints out SLOC (source code lines so I have some indication of the complexity of each function/test). This way we can quickly see what is missing and what is bloated. Here is the output.

Cool. That looks super useful. A lot of the checks for these functions are in the mix directory and test all of the instantiations other than double. That includes all the core operators and almost all of the unary vectorized functions.

@Davor_Sluga: how’s it going? Were you waiting on me for something? Btw, you can find me on the Stan slack channel if you need quicker interaction: Mc-stan community slack

Slowly, because I didn’t have enough time on my hands for this. I wrote a few tests and I’ll open an issue on github in the next few days and put my code up and I’ll be gratefull if you’ll be able to check it. Great, I’ll probaly ping you here and there on slack.

Great! If I don’t respond quickly, feel free to keep pinging me. I want to make sure I don’t miss this.

An update on our efforts to get RHMC into Stan.

@henine has been running experiments with RHMC with the Softabs metric and the symplectic integrator from the Softabs paper. Our experience so far is that the two steps of the integrator that require root solving are often very unstable unless we set the regularization parameter of the metric relatively high. But then the exploration is not very efficient.

In other words, the regularization parameter is difficult to tune. Things do not improve if Newton’s method is used instead of fixed-point iteration.

A couple of questions (@betanalpha, I’m hoping you can shed some light on this):

  1. Is this kind of behavior to be expected and, if so, is this approach even viable for general-purpose use in Stan?

  2. It seems like we need a better integrator. Any ideas? What if we replace the integration scheme with one where all the steps (or at least the steps the involve the log probability) are explicit? There seem to be some useful results on explicit integration of non-separable Hamiltonians: https://journals.aps.org/pre/abstract/10.1103/PhysRevE.94.043303

@bbbales2, @Bob_Carpenter mentioned that you might be interested in this topic, so I’m pinging you as well.

Ah! Yes, thanks for pinging. I ran a couple experiments recently too. Summoning @avehtari as well.

Yeah, that’s my experience too. The fixed point iterations are rough. Making them more stable is probably gonna be pretty problem/metric dependent.

I think there are mechanisms to make things like fixed point methods work better https://epubs.siam.org/doi/10.1137/10078356X. Given the name is acceleration, I do not know if it does much for stability of the solve. I haven’t tried it. If you guys get the time and inclination I’d be curious if it does anything.

I was reading some stuff about these implicit solves (Hairer). I think when the solve fails (cause of a bad initial guess or questionable method), the book called it a divergence. The fun thing here that I think @betanalpha said is that the implications of one of these solves failing is different from the sort of thing we call a divergence. The things we call divergences can happen even when the solves are going well. So there are divergences that are messing up our divergences? Hooray!

Anyway I ended up changing the metric around a bit so I could do Newton solves easily. I’m surprised both that you figured out how to do Newtons with softabs and that they didn’t help. How’d you do the Newtons?

There are also these line search and trust point method things that are used to stabilize these sorts of solves. My buddy @arya did some stuff with implicit integrators in HMC and these were important for him.

In these solves, you don’t try to do the full implicit hop. You break things down into smaller hops. It’s usually pretty easy to tell when things are blowing up so you can detect and keep decreasing your hop size.

You’re still solving the same eqs in the end, it just takes longer. The issue with these is even allowing super super super tiny steps (2^{-10} \epsilon), sometimes things still just explode anyway.

I want to swap a few around and see what they do. The difficulty for me now is it’s been hard to understand the nuts and bolts of what the integrator is doing when it’s wrapped up in Stan’s stuff. @arya wrote a package in R for doing HMC/NUTS outside of Stan that he used to investigate the behavior of implicit midpoint that should be useful here.

There might be some retooling to do – I dunno. I need to look into this. I’ll keep you (and @henine) posted. Of course if you want to dig into it yourself, hit up @arya directly I’m sure he’ll point you to where it’s at.

The book on symplectic integrators is https://www.cambridge.org/us/academic/subjects/mathematics/computational-science/simulating-hamiltonian-dynamics?format=HB . It’s really accessible and concise. You should definitely grab that.

Back to the metric, the thing I’ve been trying to work with is a metric where you say the metric is oriented in a fixed direction and the scale is changing according to the curvature of the Hessian of the log density in each direction. So kinda writing it out like an eigendecomposition (for a 2x2):

M^{-1}(q) = [v_1, v_2] \begin{bmatrix} -v_1^T \nabla^2 L(q) v_1 & 0 \\ 0 & -v_2^T \nabla^2 L(q) v_2 \end{bmatrix} [v_1, v_2]^T

There’s no math going in to justify that. I’m just enforcing it. In the odd case that the posterior did have an eigendecomposition where the directions of changing curvature were fixed it’d be okay as long as you can identify those directions (I do this in adaptation by computing the actual Hessian at a random point).

Anyway, you can compute the gradients of this pretty easily cause the [v_1, v_2] matrix is constant, and the terms like this v_1^T\nabla^2 L(q) v_1 can be computed with two fvars or a finite difference, so you can get the gradients with a reverse mode. You hit each of those with a softabs first to keep em’ pointed in the right direction though.

For P parameters, it takes P reverse modes, so it’s as expensive as the softabs metric. The cost is you don’t change directions, the advantage is you can work with the terms that involve derivatives of the metric very easily and do things like Newton solves.

Anyway, not sure how useful that assumption is or not. Certainly in the current state of things, I can get it to run centered eight-schools and all but the results are not encouraging in terms of Neff/sample, so I probably have buggy wuggies somewhere. There are some others I’d like to try, but the integrator has been such a hassle that I want to break it out and examine it more closely.

I’m not even sure there isn’t a bug in my current implementation of the integrator. Just need to look at these things step by step and think and maybe switch solve types and such.

I’m also glad someone else hit the same issues with softabs lol. Makes me feel better about not missing something, though I 'spose we both could be.

I can try to push my code up to a repo later if you guys want to see but it’s kindof a mess.

2 Likes

Are you using the code already in Stan or writing your own?

There are three interacting scales here – the root-finding length scale, the symplectic integration length scale, and the SoftAbs metric length scale. One of the biggest research problems in moving forwards with something like HMC with a Riemannian-Gaussian disintegration (in other words RHMC…) is understanding that interaction in order to be able to tune them all properly. On top of that there’s the issue of understanding how to diagnose poor tuning and isolating which of those scales is the problematic one.

This is why I was giving the ominous warnings earlier.

Explicit integrators for these systems are not going to be viable. They’re far too unstable even if they preserve the symplectic structure exactly, and any compromise of that symplectic structure leads to immediate problems. The implicit integration step is really necessary to stabilize the numerical approximate in a principled way, but then you have to deal with the failure of the fixed point integrator as you’ve seen.

There’s a reason why I haven’t been pushing this hard for some time!

2 Likes

I stuck my nose in multiple books on multivariate derivatives and produced what I think are derivatives of the SoftAbs related functions. (I would hesitate to claim that they were entirely correct, but I did verify them using finite differences.)

I was testing it on a simple linear regression model with unknown variance, which produced areas where one of the eigenvalues of the Hessian became 0 and exploded the function values in the Newton method. Further examination revealed function shapes that look a lot like those found in papers on functions that are pathological for Newton. Playing with the SoftAbs parameter can help soften the problems there, but I couldn’t get the sampler to behave unless I regularized SoftAbs into what is essentially a constant diagonal metric (even worse than diag_e).

As a side note, there also appears to be a potential for multiple roots (multiple fixed points). I’m not sure how to interpret that, but I would assume it doesn’t stand up to a physical interpretation.

For now I am, instead, focusing on explicit integrators.

I started out with the implicit leapfrog solver in Stan. I modified it with Newton to not much gain.

I’m not sure is see the relation between implicit integrators and scale robustness. The iterative nature of implicit integrators’ internals may help with adjusting to the offsets and scaling, but I don’t think that necessarily precludes explicit (or at least mostly explicit) integrators.

The biggest problem is see with fixed-point iteration is that it has some fairly strict convergence requirements, which, I feel, would be hard to guarantee for models in general (even if they could be hammered into shape for the momenta).

Either way, from what I can currently see, Newton and its derivatives will probably not be viable for arbitrary models, unless we find a very good method of finding initial values and possibly constraints that make sure it always converges into the correct root and that it even has a path to the correct root.

All of the SoftAbs derivatives are described in [1212.4693] A General Metric for Riemannian Manifold Hamiltonian Monte Carlo.

Multiple fixed points occur when the step size is too big. Hairer et al has lots of discussion of the convexity requirements for implicit integrators, although some of it is in the implicit Runge-Kutta section rather than the symplectic integrator section.

Explicit integrators are great, but only if the component Hamiltonians are well-behaved. Otherwise the modified Hamiltonian ends up too unstable for reasonable step sizes. Unfortunately in the case of non-separable Hamiltonians the explicit integrators that one can derive are not particularly well-behaved – the component updates require exponentials of the momenta which lead very quickly get out of hand.

For example see this preliminary version of the original RHMC paper, [0907.1100] Riemannian Manifold Hamiltonian Monte Carlo. This early version worked out an explicit integrator but it ended up being way too unstable to use in practice and the authors used an implicit integrator for the final version of the paper.

What Stan flags as a divergence is when the Hamiltonian is not converged by the leapfrog integrator. In other words, the solve of the Hamiltonian dynamics is failing.

I believe that Ben was referring to the solves internal to each implicit integrator update. The idea is that the shadow Hamiltonian of the implicit integrator may be well behaved, so the the corresponding numerical trajectories are fine and non-divergent, but you can’t actually implement those numerical trajectories in practice because the implicit integrator updates can’t be realized numerically.

This is the challenge with implicit integrators – in some sense you’re improving stability, and the behavior of the shadow Hamiltonian, by moving curvature into the implicit updates. That’s great if you can solve the implicit updates exactly, or close enough numerically. But what does it mean when the implicit updates fail?