Precomputed Jacobians or vector-Jacobian products?

I had previously discovered precomputed_gradients for a function such as real foo(vector theta);, and now I was looking to generalize to more complex arguments vector theta, real a, matrix b, and if I understand the corresponding test correctly, precomputed_gradients(val, _, _, ops, grads) can be used for that. However, what to do if the output is not scalar, e.g. vector foo(vector theta) or matrix bar(vector theta)? For instance, the inner time-stepping loop of a SDE model calls a function such as

vector diff_coupling(matrix weights, vector x) {
  vector[rows(x)] out = rep_vector(0, rows(x));
  for (i in 1:rows(x))
      out[i] = weights[, i]' * (x - x[i]);
  return out;
}

which generates I guess at least rows(x) AD graph nodes, while the vector-Jacobian product for the vector x op is just a vector-matrix multiply. From reading the Stan Math paper, it seems the only choice is then to implement a vari.chain() with the vector-Jacobian product, and then instantiate it in a function in the model header. Are there any shortcuts like precomputed_gradients for this or this is how it works?

A related, less concrete question is whether it is also worth coding these partials by hand for time-stepping models, to improve performance (especially when there are many time steps). Bugs aside, I would assume so, but then I went looking at the source for gaussian_dlm_obs_log and saw it’s using AD as well. My guess is this is Ok since the time spent in Eigen vector and matrix arithmetic dominates pointer chasing time in AD (or the C++ compiler optimizes all that away?), but it would be nice to know if this guess is off and how to know when it’s worth reimplementing in C++ with AD vs reimplementing in C++ with partials done by hand. Thanks in advance for any advice.

1 Like

There are better alternatives for that. Look for the reverse pass callback facility in Stan math. The softmax function uses it, so head there. This is very new stuff and others may give you more help to get started.

1 Like

I agree with @wds15. Use the revere_pass_callback. There is a blog post by @stevebronder about this: https://blog.mc-stan.org/2020/11/23/thinking-about-automatic-differentiation-in-fun-new-ways/

Take a look at softmax or inverse for functions with one argument, though in your case you will have more than one argument so maybe take a look at multiply.

For Stan Math functions with more arguments we need to cover all cases [(arg1 data, arg2 var), (arg1 var, arg2 data), (both var)]. If you are writing a custom c++ function you have a pretty good idea which arguments are going to be var so no need to cover other cases as well.

1 Like

@maedoc yeah the doc isn’t ready yet for this stuff yet. You’ll want to use the develop version of Math if you’re going to work on this stuff, cause it changes rapidly (for the better).

The API doc is here and it’s updated to develop I believe: https://mc-stan.org/math/. So any time you’re curious about a weird looking function, check there. That’ll get you the doxygen in the code.

Here’s something like your function. I didn’t check it or finish it, but you can start here:

template <typename T1, typename T2,
          require_matrix_t<T1>* = nullptr,
          require_col_vector_t<T2>* = nullptr,
          require_any_st_var<T1, T2>* = nullptr>
auto myfunc(const T1& m, const T2& v) {
  using inner_ret_type = Eigen::VectorXd;
  using ret_type = return_var_matrix_t<Eigen::VectorXd, Mat1, Mat2>;
  if (!is_constant<T1>::value && !is_constant<T2>::value) {
    arena_t<promote_scalar_t<var, T1>> arena_m = m;
    arena_t<promote_scalar_t<var, T2>> arena_v = v;
    arena_t<ret_type> ret = arena_m.val() * arena_v.val() - (arena_v.val().array() * arena_m.val().rowwise().sum().array()).matrix();
    reverse_pass_callback([ret, arena_m, arena_v]() mutable {
      arena_m.adj() += ... something;
      arena_v.adj() += ... something;
    });
    return ret_type(ret);
  } else if (!is_constant<T1>::value) {
    arena_t<promote_scalar_t<var, T1>> arena_m = m;
    arena_t<promote_scalar_t<double, T2>> arena_v = v;
    arena_t<ret_type> ret = arena_m.val() * arena_v - (arena_v.array() * arena_m.val().rowwise().sum().array()).matrix();
    reverse_pass_callback([ret, arena_m, arena_v]() mutable {
      arena_m.adj() += ... something;
    });
    return ret_type(ret);
  } else if (!is_constant<T2>::value) {
    arena_t<promote_scalar_t<double, T1>> arena_m = m;
    arena_t<promote_scalar_t<var, T2>> arena_v = v;
    arena_t<ret_type> ret = arena_m * arena_v.val() - (arena_v.val().array() * arena_m.rowwise().sum().array()).matrix();
    reverse_pass_callback([ret, arena_m, arena_v]() mutable {
      arena_v.adj() += arena_m.val().transpose() * ret.adj_op(); // Maybe?
    });
    return ret_type(ret);
  }
}
  1. First new thing are the requires. Instead of typing types explicitly we use C++ template SFINAE sorta stuff to accept arguments. This is cause for any argument there are a ton of different types that work. The first here says, “Make sure the first argument is a matrix”. The second says “Make sure the second is a column vector”. The last says, “Make sure at least one is an autodiff type (has scalar type var)”. Search for these in the upper right of https://mc-stan.org/math/

  2. return_var_matrix_t – there are actually two types of autodiff matrices in Stan now (design doc for the second one here), and this picks the right return type given the types of the input autodiff variables.

  3. There are a lot of arena_t<T> expression in the code. arena_t<T> says “give me a variable equivalent to the type T that is stored in the autodiff arena”. This is used to save variables in the forward pass that are needed in the reverse pass.

  4. Because your function as two arguments and requires at least one argument to be an autodiff type, then you need to handle the three combinations of double/autodiff types. That’s what the if/else stuff with is_constant is handling. (is_constant<T1>::value == TRUE means this is not an autodiff type).

  5. reverse_pass_callback takes a lambda argument. This function will be called in the reverse pass. It’s the equivalent of chain, but thanks to lambda captures it’s way easier to write. Capture everything by copying. arena_t types are cheap to copy by design. The code inside this is responsible for incrementing the adjoints of the input variables (which you saved copies of them in arena_m and arena_v. The mutable thing is required on the lambda so that you can write the adjoints (otherwise you’ll get an error about the adjoints being const).

  6. There are a lot of accessors .val() and .val_op() and .adj() and .adj_op() that are Stan extensions to Eigen types (so you won’t find them in the EIgen docs). val() and adj() give you the values and adjoints of a Stan matrix type. Sometimes with multiplications you’ll get compile errors, and in that case use .val_op() and .adj_op() (I’m not sure the difference and hopefully this gets cleaned up cuz it’s confusing)

  7. How we test these things is the expect_ad function in test/unit/math/mix. The assumption of expect_ad is that you’ve written a version of your function that works with doubles that you trust. It compares finite difference gradients to the ones computed with autodiff up through third derivatives I think. The tests for elt_multiply (elementwise multiplication) are here, look at them as an example. The expect_ad_matvar tests the new matrix autodiff type.

  8. Here’s a recently completed new function as an example of all this

I would start with a one argument version of your function and work from there. Just set the vector or the matrix equal to a constant inside the function. The one argument versions of this are much simpler. The multiple argument thing adds some annoying complexity (that isn’t so bad once you know what it is but can be pretty annoying to work through).

4 Likes

Thanks for the replies! marking @bbbales2 answer as solution as it anticipates several other questions in mind after having grepped around in the develop branch a bit. Teh blog post makes it look quite easy but I see why the arena is important for container types. I will follow up with this function, work through some multiargument cases and see if additional confusions come up.

1 Like

A couple years ago I did some investigation (no actual coding/benchmarking ) into if it could be worthwhile hand-coding the partials of state space models and drew the conclusion that for time invariant matrices there exists tricks that potentially could result in better performance (see e.g. https://doi.org/10.1007/s00180-012-0352-y). But that for time-varying matrices (which was what I needed), AD is probably just as performant as hand-coded partials.

6 Likes

Oh yeah be careful not to return an arena_t (this can lead to silent errors). We should add tests to test_ad to make sure that doesn’t happen, but for now you need to check that manually.

1 Like

It took me a while to try this out, but for a nonlinear state space model (174 ODEs with Heun scheme), a C++ version of the ODEs+Heun scheme is 30-50% faster than plain Stan, but in my case, I’m not building the whole matrices at every time step but doing VJPs through the ODEs themselves which may be more compact memory-wise ¯_(ツ)_/¯

3 Likes