# AutoDiff & min/max functions

Hi all,

That is my first post there and I really like Stan :-)

Just a few remarks on AutoDiff min/max implementations. It is a topic I already have ruminated some time ago, my blog post.

Autodiff min/max implementations

IMHO the fvar code for min/max:

`````` template <typename T>
inline fvar<T> fmax(const fvar<T>& x1, const fvar<T>& x2) {

using ::fmax;

if (unlikely(is_nan(x1.val_))) {

if (is_nan(x2.val_))

return fvar<T>(fmax(x1.val_, x2.val_), NOT_A_NUMBER);

else

return fvar<T>(x2.val_, x2.d_);

} else if (unlikely(is_nan(x2.val_))) {

return fvar<T>(x1.val_, x1.d_);

} else if (x1.val_ > x2.val_) {

return fvar<T>(x1.val_, x1.d_);

} else if (x1.val_ == x2.val_) {

return fvar<T>(x1.val_, NOT_A_NUMBER);

} else {

return fvar<T>(x2.val_, x2.d_);

}

}
``````

can be greatly simplified:

``````template <typename T>
inline fvar<T> fmax(const fvar<T>& x1, const fvar<T>& x2)
{
using ::fmax;

if(x2.val_ == fmax(x1.val_, x2.val_))
{
return fvar<T>(x2.val_, x2.d_);
}

return fvar<T>(x1.val_, x1.d_);
``````

}

I have added some tests to explicitly check the behavior toward NaN values:

using stan::math::fvar;

fvar x(stan::math::NOT_A_NUMBER, stan::math::NOT_A_NUMBER);
fvar y(3.0, 4.0);

fvar a = fmax(x, y);
EXPECT_FLOAT_EQ(3.0, a.val_);
EXPECT_FLOAT_EQ(4.0, a.d_);
}

using stan::math::fvar;

fvar x(1.0, 2.0);
fvar y(stan::math::NOT_A_NUMBER, stan::math::NOT_A_NUMBER);

fvar a = fmax(x, y);
EXPECT_FLOAT_EQ(1.0, a.val_);
EXPECT_FLOAT_EQ(2.0, a.d_);
}

using stan::math::fvar;

fvar x(stan::math::NOT_A_NUMBER, stan::math::NOT_A_NUMBER);
fvar y(stan::math::NOT_A_NUMBER, stan::math::NOT_A_NUMBER);

fvar a = fmax(x, y);
EXPECT_TRUE(stan::math::is_nan(a.val_));
EXPECT_TRUE(stan::math::is_nan(a.d_));
}

Also note that the implementation:

if(x2.val_ == fmax(x1.val_, x2.val_))

insures the arbitrary, but documented, behavior

min(a,b) returns b if a.val=b.val

Another thing concerning fvar constructors:

If one want to use fvar in a nested way (for forward mode n-order diff), IMHO the constructors have to be fixed:

`````` template <typename TV, typename TD>
fvar(const TV& val, const TD& deriv)
: val_(val), d_(deriv) {
if (unlikely(is_nan(val)))
d_ = val;
}
``````

must be modified as follow:

`````` template <typename TV, typename TD>
fvar(const TV& val, const TD& deriv)  : val_(val), d_(unlikely(is_nan(val)) ? val : deriv)
{}
``````

This is not equivalent, in one case the op= is used, in the second one, constructors are truly called recursively (propagating NaN values for instance).

Reverse mode

I have not touched this part as I am not enough famialiar with the code, however I think the same simplifications hold.

Avalaibility

I have created a branch fminfmax_fwd_code_cleaning in my GitHub repo, runTests.py is OK

Bibliography

A new framework for the computation of Hessians
I did not have the time to examine the details yet but the method looks very interesting for 2nd order reverse mode computations. I have seen a toy implementation there (not tested): https://github.com/BachiLi/had

Regards,
Vincent P.

Hi Vincent,

First: awesome! Great blog post. When we started this project, we learned a lot of the differences with math on paper and floating point math the hard way.

I’m going to add you as a collaborator on the math repo. It’s a little easier dealing with branches directly on the math repo instead of forks because we test upstream dependencies. With forks, the git hash doesn’t exist on the stan-dev owned repo, so tests fail and we need to do some work by hand. You’ll be able to directly branch and push branches to the repo.

Regarding fmin and fmax: great! I think Bob may have simplified a little bit of it with the latest pull request that got merged (for removing use of `math.h`), but I don’t recall exactly what happened to fmin and fmax. Feel free to make a pull request out of this. By the way, we’ve stopped using `math.h` in our math library. We were having to branch for MSVC, which was making the behavior different with that compiler. Bob’s done the tedious work of removing our last uses of it in the math library. See: https://github.com/stan-dev/math/issues/370

One last point: we haven’t really been using forward-mode autodiff. If you want to help with that, we’d love any help we can get. Our reverse-mode is tested (can’t determine how well or poorly), but our forward-mode isn’t fully functioning. Doing exactly what you did is very useful to the development of Stan! Thank you!

Thank you for your comment Daniel, I am happy if it can be useful.

I do not think I will be able to contribute a lot to the stan-dev/math repo the next few months because I m quite busy for my work at least until december :-(. However I will try to get more familiar with the code.
I will do a pull request for my fvar code cleaning, feel free to see if it is ok or not for merging.

Concerning AutoDiff, I think the top priority must be the reverse mode that allows efficient computation of the gradients. I am not familiar with the reverse mode implementation used by Stan, but most of the time, the frameworks allowing reverse mode computations also allow forward mode ones (mainly useful to compute directional derivatives and the main application I see is to solve nonlinear system of equations with Hybrid Newton-GMRES like methods).

My point of view is:

• purely forward mode computation allows an “easy” implementation of n-order derivatives, but IMHO can not be really efficient as soon as we have more than some dozens of variables: so it is risky to spend too much time there unless you have identified an application requiring only few variables.
• reverse mode is OK and effective for first-order derivatives but is tricky for higher orders. It is a critical component that must be optimized. For 2nd order I have great hopes for the method I cited in the previous message (A new framework…). However I have to understand the details and to test it.

Vincent

Thanks for what you’ve done already. I like your update for the fvar
constructor, but the fmin/fmax change won’t provide the same results
because of the way the original code provides NaN tangents if any of
the values are NaN. It means our testing is insufficient if your changes
pass. I’m not sure that was the right design for this function, but
that’s why it’s defined the way it is.

As to plans and focus, we’ve spent much more time on reverse mode than
forward, especially for vectorized densities. You can check out our
arXiv paper for more details:

https://arxiv.org/abs/1509.07164

We’ve been focusing on matrix derivatives most recently.

We need Hessians for some of our algorithms and we will be
getting that with reverse nested inside forward. At least
it’s better than finite differences. We even need third-order
derivatives for Riemannian Monte Carlo, for which we use
fvar<fvar > types.

• Bob

Hi Bob,

You are right, I had not noticed the change tangent = NaN if one of a or b is already NaN.
I can confirm that ./runTests.py test/unit/ passes with my modifications; hence a test is missing to check this behavior.
My point of view here is to consider fmax(a,b) as a switch that forward a or b without any further modification. For the moment I do not see any pros or cons concerning these two approaches. Maybe mine is simpler.

Thank you for the clarification concerning reverse nested inside forward, I understand your point.
What is not clear for me is what is the most effective way to compute these derivatives, but it is only for personal investigations. (I just realized that my link to the article I cite is not free, I added some free links at the end)

Vincent

vincent-picaud Developer
October 4
Hi Bob,

You are right, I had not noticed the change tangent = NaN if one of a or b is already NaN.
I can confirm that ./runTests.py test/unit/ passes with my modifications; hence a test is missing to check this behavior.
My point of view here is to consider fmax(a,b) as a switch that forward a or b without any further modification. For the moment I do not see any pros or cons concerning these two approaches. Maybe mine is simpler.

Thank you for the clarification concerning reverse nested inside forward, I understand your point.
What is not clear for me is what is the most effective way to compute these derivatives, but it is only for personal investigations.

It depends on a bunch of factors: will they be reused?
do you have essentially free parallelism up to the dimensionality
of the problem? what is the sparsity pattern? how much memory
do you have relative to the size of all the partials in the computation
graph?

(I just realized that my link to the article I cite is not free, I added some free links at the end)

Vincent

Hessian Matrices via Automatic Differentiation
Higher-order Reverse Automatic Differentiation

The latter’s a tricky business. We’ve done some experimentation
with something like that in the nomad repo on stan-dev. It’s about
an order of magnitude faster than reverse nested inside of forward,
but requires more memory.

• Bob