Closures discussion

The closures design doc was accepted in December but there’s been no activity since then. I think it’s time to do this.

Tagging everyone who participated in the design-doc discussion:
@Bob_Carpenter, @seantalts, @wds15, @Matthijs, @bbbales2

Unfortunately the design-doc has no plan for how to implement it. The implementation section is only a sentence about C++ lambdas. It will not work because lambdas cannot autodiff correctly in higher-order functions.

Probably the easiest C++ implementation is to convert captured variables into extra parameters (“lambda-lifting”) and piggyback on the variadic arguments proposal. I think I could make this work by the time Stan math supports variadic arguments.

But let’s consider supporting proper closures directly in the math library.
Currently the higher-order functions take a “functor” struct.

struct foo_functor__ {
  template <typename T0__>
  typename boost::math::tools::promote_args<T0__>::type
  operator()(const T0__& x, std::ostream* pstream__) const {
    return foo(x, pstream__);
  }
};

The functor struct contains no data and is default-constructible. ODE solvers store the struct and call it with either var or double arguments depending on where they need autodiff.

The proposed variadic arguments are implemented by passing along an additional opaque tuple.
The parameter pack tuple is accessed with helper functions

  • count_vars(...) – total number of autodiffable vars
  • value_of(...) – create an identical double-only tuple (same data, no autodiff)
  • accumulate_adjoints(double*, ...) – copy the gradients into a vector
  • save_varis(vari**, ...) – copy the vari pointers into a vector
  • deep_copy_vars(...) – allocate new independent varis for nested autodiff

A “closure” object would combine the callable functor and the parameter pack.
The closure then exposes an API that should be equivalent to the above.

Here’s what I think it would look like.

The user defines a closure and passes it to a higher-order function hof

data { real c; ... }
parameters {
  real m;
  real s;
  ...
} model {
  real foo(real x) {
    return -(x-m)^2/s + c;
  }
  ... = hof(foo, ...);

The compiler translates it to C++

foo_fn__ foo(c, m, s);
... = hof(foo, ...);

where the class is

class foo_fn__ {
  double& c;
  var& m;
  var& s;
 public:
  int num_vars__;
  foo_fn__(double& c_, var& m_, var& s_)
    : c(c_), m(m_), s(s_), num_vars__(2) { }
  template<typename T>
  var operator()(T x) {
    return -square(x-m)/s + c;
  }
  // double interface
  class foo_fn_dbl__ {
    double& c;
    double m;
    double s;
   public:
    foo_fn_dbl__(double& c_, double m_, double s_)
      : c(c_), m(m_), s(s_) { }
    template<typename T>
    double operator()(T x) {
      return -square(x-m)/s + c;
    }
  };
  foo_fn_dbl__ value_of_() {
    return foo_fn_dbl__(c, value_of(m), value_of(s));
  }
  // vari interface
  void set_zero_adjoints() {
    m.vi_->set_zero_adjoint();
    s.vi_->set_zero_adjoint();
  }
  void accumulate_adjoints(double* gradients) {
    gradients[0] += m.adj();
    gradients[1] += s.adj();
  }
  void save_varis(vari** varis) {
    varis[0] = m.vi_;
    varis[1] = s.vi_;
  }
  foo_fn__ deep_copy_vars() {
    // captured vars are references, can't store the copies in the struct
    // place the copies e.g. on the nested ad stack
    var* m_ = ChainableStack::instance_->memalloc_.alloc_array<var>(1);
    *m_ = var(new vari(value_of(m), false));
    var* s_ = ChainableStack::instance_->memalloc_.alloc_array<var>(1);
    *s_ = var(new vari(value_of(s), false));
    return foo_fn__(c, *m_, *s_);
  }
};

Additionally MPI map_rect requires some way to serialize the object but other than that, is this API sufficient for all of our higher-order functions?

5 Likes

This looks promising!

Right now map_rect and reduce_sum enforce statelessness by requiring that you can default-construct the functor object you give to it. So this would need to be reworked, but from scanning over your proposal this looks fine.

Maybe this piggy-pack technique can be expanded to tuples directly?

(this would be a bit of work I am afraid, but closures are definitely cool to have)

If you wanna do this, absolutely have at it. That would be great!

Oh wow I need to get in gear on polishing the ODE stuff. If we could have closures and variadic arguments for writing ODEs, that would be slick.

I’ll take a closer look this afternoon (gotta go do something now). I’ll have to think about how any deep-copy stuff should work.

Like ideally none of that would be necessary cause the variable capture would be the right thing. But like reduce_sum and the ODE solvers do the weird nested autodiffs and in those cases we do need to be careful with the variables we captured.

Alright so the use case for pre-computing Jacobians during the forward sweep (for reduce_sum) would be something like:

auto myfunc(const auto& f, const auto& x) {
  nested_rev_autodiff nest;
  auto f_local = f.deep_copy();
  auto x_local = deep_copy(x);

  auto out = f_local(x_local);
  out.grad();

  Eigen::VectorXd adj = Eigen::VectorXd::Zero(f.count_vars() + count_vars(x));
  f_local.accumulate_adjoints(adj);
  accumulate_adjoints(adj.data() + f.count_vars(), x_local);

  return std::make_tuple(value_of(out), adj);
}

var outside(const auto& f, const auto& x) {
  vari** varis = ChainableStack::instance_->memalloc_.alloc_array<vari*>(f.count_vars() + count_vars(x));
  f.save_varis(varis);
  save_varis(varis + f.count_vars(), x);

  auto t = myfunc(f, x); // element 0 is value, element 1 are adjoints

  return var(new precomputed_gradients_vari(std::get<0>(t), f.count_vars(), varis, std::get<1>(t)));
}

I don’t think this needs a set_zero_adjoints(). Those would get handled by the regular nested::set_zero_adjoints() call.

Do you see any ways to make this cleaner or anything? Not that it’s not clean enough, but some things still look a bit manual.

I think what you’re proposing would work (though it would need count vars added, and each of the deep_copy/save_varis/accumulate adjoints should defer to those underlying implementations).

1 Like

Have any more thoughts on this?

My current progress is on this stanc3 branch.
The parser accepts closures and creates corresponding callable objects. The object has an API for accessing the captured vars but it doesn’t do anything yet. I didn’t test anything but as long as you capture only data it hopefully works.
Now integrate_ode_bdf always invokes the variadic ode_bdf_tol and here’s a math branch where ode_bdf uses the closure object API.

So, this compiles

data {
  int<lower = 0> N;          // number of measurement times
  real ts[N];                // measurement times > 0
  vector[2] y_init;          // initial measured populations
  real<lower = 0> y[N, 2];   // measured populations
}
transformed data {
    real x_r[0];
    int x_i[0];
}
parameters {
  real<lower = 0> alpha;
  real<lower = 0> beta;
  real<lower = 0> gamma;
  real<lower = 0> delta;
  vector<lower = 0>[2] z_init;// initial population
  real<lower = 0> sigma[2];   // measurement errors
}
transformed parameters {
  functions
  vector dz_dt(real t, vector z) {
    real du_dt = (alpha - beta * z[2]) * z[1];
    real dv_dt = (-gamma + delta * z[1]) * z[2];
    return [du_dt, dv_dt]';
  }
  vector[2] z[N] = integrate_ode_bdf(dz_dt, z_init, 0.0, ts);
}
model {
  alpha ~ normal(1, 0.5);
  gamma ~ normal(1, 0.5);
  beta ~ normal(0.05, 0.05);
  delta ~ normal(0.05, 0.05);
  sigma ~ lognormal(-1, 1);
  z_init ~ lognormal(log(10), 1);
  for (k in 1:2) {
    y_init[k] ~ lognormal(log(z_init[k]), sigma[k]);
    y[ , k] ~ lognormal(log(z[, k]), sigma[k]);
  }
}

The generated C++ class is

template <typename captured_t__>
class closure0__ {
  const captured_t__ alpha;
  const captured_t__ beta;
  const captured_t__ delta;
  const captured_t__ gamma;
  public:
  using captured_scalar_t__ = captured_t__;
  const int num_vars__;
  closure0__(const captured_t__ alpha__, const captured_t__ beta__,
             const captured_t__ delta__, const captured_t__ gamma__)
  : alpha(alpha__), beta(beta__), delta(delta__),
  gamma(gamma__), num_vars__(stan::is_var<captured_t__>::value ? 1 + 1 + 1
                             + 1 : 0) { }
  template <typename T0__, typename T1__>
  Eigen::Matrix<typename boost::math::tools::promote_args<captured_t__,
typename boost::math::tools::promote_args<T0__,
T1__>::type>::type, -1, 1>
  operator()(const T0__& t, const Eigen::Matrix<T1__, -1, 1>&
             z, std::ostream* pstream__) const {
  return closure0_impl__(t, z, alpha, beta, delta, gamma, pstream__);
  }
  
  class ValueOf__ {
    const double alpha;
    const double beta;
    const double delta;
    const double gamma;
    public:
    ValueOf__(const closure0__& init) : alpha(value_of(init.alpha)),
    beta(value_of(init.beta)), delta(value_of(init.delta)),
    gamma(value_of(init.gamma)) { }
    template <typename T0__, typename T1__>
    Eigen::Matrix<typename boost::math::tools::promote_args<T0__,
T1__>::type, -1, 1>
    operator()(const T0__& t, const Eigen::Matrix<T1__, -1, 1>&
               z, std::ostream* pstream__) const {
    return closure0_impl__(t, z, alpha, beta, delta, gamma, pstream__);
    }
    };
  class DeepCopy__ {
    const captured_t__ alpha;
    const captured_t__ beta;
    const captured_t__ delta;
    const captured_t__ gamma;
    public:
    DeepCopy__(const closure0__& init) : alpha(deep_copy_vars(init.alpha)),
    beta(deep_copy_vars(init.beta)), delta(deep_copy_vars(init.delta)),
    gamma(deep_copy_vars(init.gamma)) { }
    template <typename T0__, typename T1__>
    Eigen::Matrix<typename boost::math::tools::promote_args<captured_t__,
typename boost::math::tools::promote_args<T0__,
T1__>::type>::type, -1, 1>
    operator()(const T0__& t, const Eigen::Matrix<T1__, -1, 1>&
               z, std::ostream* pstream__) const {
    return closure0_impl__(t, z, alpha, beta, delta, gamma, pstream__);
    }
    
    void accumulate_adjoints(double*) const { /* alpha *//* beta */
      /* delta *//* gamma */}
    };
  void set_zero_adjoints() const { /* alpha *//* beta *//* delta */
    /* gamma */ }
  void accumulate_adjoints(double*) const { /* alpha *//* beta *//* delta */
    /* gamma */}
  void save_varis(vari**) const { /* alpha *//* beta *//* delta */
    /* gamma */}
};

Filling in the missing parts shouldn’t be too hard.

One annoying thing I noticed is that when ode_bdf calls the function it puts the output stream as the third argument but most other callers put the stream in the last place. Not a problem for the above code where the function takes only three arguments but in general we need to do something about the inconsistency.

3 Likes

The positioning of the pstream arugment is why we produce the rsfunctor for reduce sum and odefunctor in the variadic ode code: https://github.com/stan-dev/stanc3/pull/525/files#diff-05c667b4d63efa8b39abcae56e7b9834R188

I am not sure how we can make this consistent unless we make some sort of wrappers for the exisiting integrate_* , map_rect, etc.

It would be nice to not need this, that is for sure.

Also: Very nice!

Wow, indeed there have been more thoughts lol.

That code looks amazing!

Is this something you need any help or info on? Or are you good?

Yeah, I think we need to standardize this.

The reason I put the variadic arguments last is then I can rely on C++ to do all the type deduction for the template parameter packs. This makes the math code cleaner.

I think we just update all the other things. We’ll have to anyway for these closures. I like variadic last, msgs second to last cause then C++ handles type deduction.

1 Like

I think I go it. It just means this part

  void set_zero_adjoints() const { /* alpha *//* beta *//* delta */
    /* gamma */ }
  void accumulate_adjoints(double*) const { /* alpha *//* beta *//* delta */
    /* gamma */}
  void save_varis(vari**) const { /* alpha *//* beta *//* delta */
    /* gamma */}

needs to be filled in with something like

  void set_zero_adjoints() const {
    alpha.set_zero_adjoints();
    beta.set_zero_adjoints();
    gamma.set_zero_adjoints();
    delta.set_zero_adjoints();
  }
  void accumulate_adjoints(double* ptr) const {
    ptr++ = alpha.adj();
    ptr++ = beta.adj();
    ptr++ = gamma.adj();
    ptr++ = delta.adj();
  }
  void save_varis(vari** ptr) const {
    ptr++ = alpha.vi_;
    ptr++ = beta.vi_;
    ptr++ = gamma.vi_;
    ptr++ = delta.vi_;
  }

And loops for vectors and arrays.

1 Like

Update: I haven’t forgotten about this.

The Lotka-Volterra model runs and gives the correct result. It’s a bit slower though. In fact all integrate_ode_* are based on mostly the same code should now work with closures.
algebra_solver, reduce_sum and map_rect do not compile.
Is the plan to migrate all of the higher-order functions to variadic arguments? How far it that?

User-defined higher-order functions are also allowed. The main restriction is that a function cannot return a closure.

In C++ a closure object looks like

class closure<captured_scalar_t__> {
  auto operator()(Args...); // call the function
  class ValueOf__; // closure without internal autodiff
  class DeepCopy__; // closure with independent autodiff varis
  const int num_vars__;
  void set_zero_adjoints();
  void accumulate_adjoints(double*);
  void save_varis(vari**);
};

And for an example of what the implementation looks like see the ode interface adaptor class. The nested class hierachy is a bit unwieldy. If anyone has ideas to simplify this please share. In theory closure<var>::ValueOf__ is the same as closure<double>::DeepCopy__ but clang did not like “instantiating a template within its own definition”.

Arrays are captured by reference. In the following model dz_dt does not make a copy of x and y but stores only a pointer.

data {
  real x[100];
}
parameters {
  real y[100];
}
transformed parameters {
  functions
  vector dz_dt(real t, vector z) {
    // do something with x and y
  }
  vector[K] z[N] = integrate_ode(dz_dt, z0, t0, ts);
}

That’s memory-efficient but creates a problem with adjoint sensitivity methods. The adjoint ODE must call dz_dt during the reverse pass and by that time y has gone out of scope. I guess closures would have to copy everything that’s not static data.

4 Likes

Very cool!

We have not decided if we make everything variadic yet. Though, the ode functions are being cast into that right noe by @bbbales2 and @charlesm93 wanted to attack the algebra_solver.

… but if closures are indeed doable and are running with not a too high of a performance hit, then variadic is less needed, obviously. Or does it still make sense to have most things variadic and closures? Maybe for performance reasons?

We should discuss a bit more and reflect… though having everything variadic and closure won’t be wrong. Or is the thing that variadic things do not work with what you do here (it should as I understand)?

Closure object is basically a functor struct with a parameter pack inside it. Closures use much the same C++ plumbing as variadic arguments. It makes sense to do both at the same time. Also, with closures the function signatures no longer need the three extra arguments for params + real data + int data. I was thinking replacing those with variadic arguments would simplify the interface but keep it backwards-compatible.

Yeah, I think so. I also like this use of the word static.

Yeah.

The problem with the last interface is that there are optional arguments. There’s no way to tell if someone means to pass in double-double-int as arguments to their ode right hand side, or they’re passing them as tolerances.

I think the old interface just has to be something deprecated at this point and we plumb it into whatever future-thing we do.


As a way forward on this, I guess I’d like to open a pull request as a branch of variadic odes and get the math component of this working.

@nhuurre does this seem like a reasonable plan? If so, do you want to put together the closure-functor ODE branch? If you’re busy I can do this, but I’d like you to give me the functor object to test with (so I’m not making too many bad assumptions). I like the 1d ode system y' = a y t for testing.

Also, we want every function generated by stanc3 to be one of these functor objects, right? Or will there be some delineation?

Another question I have is how can functors handle higher order autodiff? This deep_copy/accumulate_adjoints sorta stuff we put together explicitly in a case where we weren’t doing higher order autodiff, but we should think about if there’s a way to generalize it before we spread it too far and wide in the code.

Also, summoning @Bob_Carpenter for comments

In C++ maybe, but during code generation the Stan compiler knows exactly how many parameters the passed function/closure consumes. If it wants double, double, int then those are variadic arguments; if it doesn’t then they are tolerances.

I have closure-prototype branch for math and stanc3. Do you want me to open a (draft) pull request? Tests are broken on the math branch because ODEs expect a functor that implements the all the closure stuff but I didn’t update the test functors. There needs to be some way to create adapter class that wraps a non-closure functor in a dummy closure.

Hmm, good point. I don’t know the other implications of that.

Yo that rocks. Yeah open a pull.

I suppose we could do that. What’s the downside of just converting everything to functor-objects? I guess the obvious one is Stan c++ gets more awkward to use.

1 Like

I pointed the math pull at the variadic ODE stuff so I could see the diff there.

1 Like

Oops, thanks.

The diff still looks too large–how did those HMM commits get there? Did I merge in a wrong branch at some point?

Nah, I think the variadic ODEs stuff just fell behind develop and your branch is more up to date.

I’ll clean up the merge when I get to looking at this closer (in the next few days).