# A new continuation-based autodiff by refactoring

## Part I: Lambda-based Closures

Autodiff is where Stan spends 95% of its execution time, so I always devote cycles to figuring out how to make it faster and easier to use.

#### Motivations

• The underlying theory is really about continuations—the reverse mode pass to apply the chain rule is just a bunch of functions queued up in the forward pass. The current autodiff system does this implicitly, but I hadn’t really thought about it formally until talking to Barak Pearlmutter and @Matthijs

• For years @rtrangucci and others who’ve looked closely at the matrix autodiff have been clamoring for matrix autodiff without having to explicitly copy out an Eigen matrix (which entails heap allocation)

• C++11 makes a lot of this stuff a lot easier than it used to be. We’ll start with a literal continuation-based implementation, then I’ll refactor the closures into custom, memory tight objects that could be used for serious autodiff.

I’m going to make a lot of references to how things are implemented now. If you’re not familiar, we have an arXiv paper on Stan’s autodiff that explains how pretty much everything’s implemented now.

#### The Code

So let’s get started with the implementation. I’m going to start with some nice C++11 tricks, then over a few posts I’ll refactor into something with custom implementations. For a bit of a spoiler, here’s a post on lambdas explaining why std::function types are so heavy. Here’s the code. I’ll go back through it repeating peaces to highlight, but I thought it’d be good if you have it all in one place:

#include <cmath>

#include <functional>
#include <limits>
#include <iostream>
#include <vector>

std::vector<chain_t> stack_;

std::size_t next_idx_ = 0;
inline std::size_t next_idx() {
return next_idx_++;
}

struct var {
double val_;
int idx_;
var(const var& v) : val_(v.val_), idx_(v.idx_) { }
var(double val, int idx) : val_(val), idx_(idx) { }
var(double val) : val_(val), idx_(next_idx()) { }
var() : val_(std::numeric_limits<double>::quiet_NaN()), idx_(next_idx()) { }
};

struct matrix_var {
MatrixXd val_;
int** idx_;

}

inline var operator+(const var& x1, const var& x2) {
var y(x1.val_ + x2.val_);
});
return y;
}
inline var operator+(const var& x1, double x2) {
var y(x1.val_ + x2);
});
return y;
}
inline var operator+(double x1, const var& x2) {
var y(x1 + x2.val_);
});
return y;
}

inline var operator*(const var& x1, const var& x2) {
var y(x1.val_ * x2.val_);
});
return y;
}
inline var operator*(const var& x1, double x2) {
var y(x1.val_ * x2);
});
return y;
}
inline var operator*(double x1, const var& x2) {
var y(x1 * x2.val_);
});
return y;
}

inline var exp(const var& x) {
var y(std::exp(x.val_));
});
return y;
}

std::vector<double> grad(const var& y, const std::vector<var>& x) {
for (auto chain_f = stack_.crbegin(); chain_f != stack_.crend(); ++chain_f)
std::vector<double> dy_dx(x.size());
for (std::size_t i = 0; i < x.size(); ++i)
return dy_dx;
}

int main() {
stack_.clear();
next_idx = 0;
var x1 = 10.3;
var x2 = -1.1;
std::vector<var> x = { x1, x2 };
var y = x1 * x2 * 2 + 7;
std::cout << "y = " << y.val_ << std::endl;
for (std::size_t i = 0; i < x.size(); ++i)
std::cout << "dy / dx[" << (i + 1) << "] = " << dy_dx[i] << std::endl;
return 0;
}


Here’s how it runs (the c++1y flag says to use C++11, C++14 [edit: and C++17]).

$clang++ -std=c++1y ad.cpp$ ./a.out
y = -15.66
dy / dx[1] = -2.2
dy / dx[2] = 20.6


#### Autodiff type: var

Everything other than the example application gets dropped into the namespace agrad, which is the original namespace I used for Stan’s first autodiff (it was short for “automatic gradients”); we’ve since put everything in stan::math.

The top-level type we use is quite a bit different than our current pointer-to-implementation.

struct var {
double val_;
int idx_;
var(const var& v) : val_(v.val_), idx_(v.idx_) { }
var(double val, int idx) : val_(val), idx_(idx) { }
var(double val) : val_(val), idx_(next_idx()) { }
var() : val_(std::numeric_limits<double>::quiet_NaN()), idx_(next_idx()) { }
};


It stores two values, a value and an index (with padding to 8-byte alignment, it’s still going to be a 16-byte object); later, I’ll typedef out the index types and value types for flexibility. We almost always pass these var by reference and they all live on the function call stack (not on the heap), so the extra weight here isn’t a problem.

The main difference to what we’re doing is that the value is stored locally. This means we don’t have to chase a pointer to get a value. It also means we won’t have to allocate values in the arena-based memory unless we need them in the reverse pass. Finally, rather than allocating vari and pointing to it, we only maintain and index. That index structure is maintained with a global variable next_idx and a wrapper function to get the next one and increment:

std::size_t next_idx_ = 0;  // global!

inline std::size_t next_idx() {
return next_idx_++;
}


From the constructor, you can see that when we construct a var from a double, we’ll allocate a new index:

  var(double val) : val_(val), idx_(next_idx()) { }


There’s no memory being allocated for that index. Everything remains local to the var, which is great for increasing memory locality and reducing pointer chasing and subsequent cache misses (which can cost dozens of floating-point operations).

One must be very careful with globals, translation units, etc., but I’m not going to focus on that here. Also, we can optionally wrap up our global variables in one or more thread-local variables; the cost is a bit of a synchronization slowdown and the benefit is full multi-threading with independent autodiff (that makes things like parallel map as was just released in 2.18 easy).

#### The autodiff stack

Conceptually, reverse mode autodiff works by evaluating an expression, and for each subexpression, pushing a continuation to propagate the chain rule in reverse onto a stack. These are then visited in reverse order to propagate the chain rule from the final value expression down to the input arguments.

The first implementation follows the theory and literally creates a stack of continuations. We need to include the <functional> header in order to define the appropriate types.

typedef std::vector<double> adj_t;

std::vector<chain_t> stack_;


The type adj_t is for sequence of adjoints, indexed by expression id (as held in a var). The The type chain_t is the type of the continuations held in the autodiff stack, and stack_ is the global variable holding the continuation stack.

The continuations are functions that work on a vector of adjoints. Each expression has an index, and that expression’s adjoint value is in the adjoint stack (which is type adj_t and created when derivatives are needed rather than in advance). More formally, we use the C++11 functional lib std::function to provide a type for the continuations, std::function<void(adj_t&)>. Thus each continuation applies to a vector of adjoints to propagate the chain rule one or more steps.

To run more than one autodiff, stack_.clear() will have to be called. We’ve swept details like that under the rug to keep this example simple. We’ll ramp up to much cleaner implementations by the third post.

It’s probably easiest to see how autodiff works for a non-trivial unary function. Here’s the full definition of exp:

inline var exp(const var& x) {
var y(std::exp(x.val_));
});
return y;
}


First, the variable y is constructed with the value given by exponentiating the value of the argument. Then the fun starts. Rather than having to define new vari classes for each autodiff, we’re going to use the magic of lambdas. First, note that stack_.emplace_back(...) is going to construct an object of the type std::function<void(adj_t&)> (the type of elements held by stack_) using the memory in the stack. This conveniently avoids a copy (which is heavy for std::function). Now let’s break down the lambda expression that will be emplaced,

[=](adj_t& adj) {
}


The initial [=] controls the closure behavior of the lambda, with = denoting pass-by-value. The values being bound by the closure are x.idx, y.val_ and y.idx_. Copies of these are made in the closure object with their current value. This creates a function that applies to an adj_t, which is the vector of adjoint values. The action here says to update the adjoint of x by adding the value of y (i.e., exp(x)) times the adjoint of y. The adjoints are held in the vector of scalar values in adj and are indexed by the variable index. When the chain rule’s done, we wind up with each adjoint being the derivative of the final value with respect to the expression represented by the index of the adjoint.

Let’s consider one more example, operator+. It has three instantiations, one of which we consider here:

inline var operator+(const var& x1, double x2) {
var y(x1.val_ + x2);
});
return y;
}


Now there are two arguments, but only one is a var. The value is calculating in the obvious way to construct a new var (with a freshly generated index). Then the continuation is defined to update the adjoint of x with the adjoint of the result y. As for exponentiation, x1.idx and y.idx are stored by value.

#### Applying the chain rule

This is pretty easy now with a function that takes a result variable y and computes derivatives w.r.t. the variables in x, returning them as a vector of floating point values.

std::vector<double> grad(const var& y, const std::vector<var>& x) {
for (auto chain_f = stack_.crbegin(); chain_f != stack_.crend(); ++chain_f)
std::vector<double> dy_dx(x.size());
for (std::size_t i = 0; i < x.size(); ++i)
return dy_dx;
}


We start by initializing the adjoint vector of double values to a large enough size to use y.idx_ and assign all the values to 0.0. Because we take y as the result, each adjoint slot will hold the derivative of y with respect to the expression with its index. So we start by setting the derivative of y with respect to itself to be 1.

Then we just walk over the stack of continuations from last added down to the first (the iterator crbegin() is the top of the stack and crend() is the bottom of the stack. Each updates the adjoints by propagating the chain rule from an expression down to its operands; it’s as easy as applying the function *chain_f to the stack of adjoints, (*chain_f)(adj).

Next we allocate a standard vector for the results and copy the adjoints for the x values into dy_dx to return.

#### Using it

Here’s the main function again, which is defined outside of the agrad namespace.

int main() {
stack_.clear();
next_idx = 0;
var x1 = 10.3;
var x2 = -1.1;
std::vector<var> x = { x1, x2 };
var y = x1 * x2 * 2 + 7;
std::cout << "y = " << y.val_ << std::endl;
for (std::size_t i = 0; i < x.size(); ++i)
std::cout << "dy / dx[" << (i + 1) << "] = " << dy_dx[i] << std::endl;
return 0;
}


First we clear the global variables (this will not free memory in the stack, just reset the size to zero, which is the behavior we want for Stan). Then se set the next index for the next expression to zero.

Next, we just set two variables of type var to 10.3 and -1.1 respectively. There is no explicit assignment, but there is an implicit var(double) constructor which will get used. Then we group the variables into the x vector to pass to grad; this uses C++11 brace initializers.

Next, we have the expression we evaluate, var y = x1 * x2 * 2 + 7;. This finds the overloaded operator* and operator+ implementations defined in the agrad namespace through argument-dependent lookup (ADL).

Finally, we calculate the gradient by calling agrad::grad and assign the result to dy_dx fo printing.

Et voila.

#### Warning!!!

This works and hopefully easy to follow. But it’s going to be slow and a memory hog.

The culprit is in the way that std::function works. There is a really great post explaining why this approach isn’t tenable (I hot linked the relevant section, but the whole thing’s super clear and very worth reading):

#### Next time

We’ll fix this problem by defining our own custom continuations. Then we’ll be back implementing someting like our current vari, but without any built-in member variables.

13 Likes

## Part II: Custom Closures

This is part II of the series in rewriting autodiff. We started with a very direct implementation of continuations with lambdas and implemented the container type using std::function. What we can do to improve this code is write custom closures to implement the reverse pass continuations.

In Part III, we’ll come back to using efficient memory arenas for all of our objects. For now, it’s still wasteful with its allocation.

#### What went wrong: std::function

The problem with the first version was that it used a std::function on a stack. That’s wasteful in terms of memory and would limit the size of expressions we could evaluate. For example,

#include <functional>
#include <iostream>

int main() {
int a = 2;
auto f_closure = [=](int x) { return x + a; };
std::function<int(int)> f = f_closure;

std::cout << "sizeof(a) = " << sizeof(a) << std::endl;
std::cout << "sizeof(f_closure) = " << sizeof(f_closure) << std::endl;
std::cout << "sizeof(f) = " << sizeof(f) << std::endl;
std::cout << "f_closure(3) = " << f_closure(3) << std::endl;
std::cout << "f(3) = " << f(3) << std::endl;
}


evaluates to

sizeof(a) = 4
sizeof(f_closure) = 4
sizeof(f) = 48
f_closure(3) = 5
f(3) = 5


The size penalty for assiging to the std::function type is a deal breaker here.
Note that the size of the closure itself is quite small—just the size of the data it stores.

As tempting as that closure size is, there’s no way to export a bunch of things that size and put them on a list. It’s only clever shuffling and local use that allows the closure to be this size locally. For one thing, we need the function pointer as well as the data.

Key point: It’s worth stressing that there’s no way to get away from having some kind of pointer internally in ther reverse mode autodiff stack. That will be a vtable pointer in what we have now and in what I’ll be doing in this post; it could be a function pointer. That all gets wrapped up inside std::function very cleverly. The consequence is that reverse mode acts as interpreted code where we have to keep looking up functions and chasing pointers rather than compiled code where addresses are known statically. But it’s nicely behaved virtual code because everything just gets executed in order without any branching.

For more information, I’d like to reiterate my recommendation for Shahar Mike’s blog post on lambdas and function types.

#### How to fix it—custom continuations

We’ll do exactly what we did in Stan—write custom continuation types.

Here’s the whole program, which I will then break down.

#include <functional>
#include <limits>
#include <iostream>
#include <cstdio>
#include <vector>

struct chainable {
};

// globals required for autodiff
std::vector<chainable*> STACK_;
std::size_t NEXT_ID_ = 0;

inline void push_chainable(chainable* c) {
STACK_.push_back(c);
}

inline void clear_memory() {
NEXT_ID_ = 0;
STACK_.clear();
}

inline std::size_t next_id() {
return NEXT_ID_++;
}

struct var {
double val_;
int id_;
var(const var& v) : val_(v.val_), id_(v.id_) { }
var(double val, int id) : val_(val), id_(id) { }
var(double val) : val_(val), id_(next_id()) { }
var() : val_(std::numeric_limits<double>::quiet_NaN()), id_(next_id()) { }
};

void grad(const var& y, const std::vector<var>& x,
std::vector<double>& dy_dx) {
for (auto chain_f = STACK_.crbegin(); chain_f != STACK_.crend(); ++chain_f)
dy_dx.clear();
for (auto xi : x)
}

struct multiply_vv : public chainable {
size_t x1, x2, y;
double x1v, x2v;
multiply_vv(size_t x1, size_t x2, size_t y, double x1v, double x2v)
: x1(x1), x2(x2), y(y), x1v(x1v), x2v(x2v) { }
}
};
var operator*(const var& x1, const var& x2) {
var y(x1.val_ * x2.val_);
push_chainable(new multiply_vv(x1.id_, x2.id_, y.id_, x1.val_, x2.val_));
return y;
}

struct add_vv : public chainable {
size_t x1, x2, y;
add_vv(size_t x1, size_t x2, size_t y) : x1(x1), x2(x2), y(y) { }
}
};
var operator+(const var& x1, const var& x2) {
var y(x1.val_ + x2.val_);
return y;
}

int main() {
var x1 = 10.3;
var x2 = -1.1;
std::vector<var> x = { x1, x2 };
var y = x1 * x2;
std::vector<double> dy_dx;
std::cout << "y = " << y.val_ << std::endl;
for (std::size_t i = 0; i < x.size(); ++i)
std::cout << "dy / dx[" << (i + 1) << "] = " << dy_dx[i] << std::endl;
return 0;
}


#### Same var class

The var types are just the same as in the last post, with the same global identifier management.

std::size_t NEXT_ID_ = 0;
inline std::size_t next_id() {
return NEXT_ID_++;
}

struct var {
double val_;
int id_;
var(const var& v) : val_(v.val_), id_(v.id_) { }
var(double val, int id) : val_(val), id_(id) { }
var(double val) : val_(val), id_(next_id()) { }
var() : val_(std::numeric_limits<double>::quiet_NaN()), id_(next_id()) { }
};


#### Continuation base class: chainable

The chainable base functor defines a virtual method for the continuation signature and implements it as a no-op (unlike our existing code base for Stan, this should probably just be pure virtual with = 0).

typedef std::vector<double> adj_t;

struct chainable {
};


As before, the continuations operate on a vector of adjoint values, indexed by expression id.

#### Stack of chainables

The stack is now going to be a stack of pointers to chainable objects.

std::vector<chainable*> STACK_;

inline void push_chainable(chainable* c) {
STACK_.push_back(c);
}


We have a convenience method now to clear (not free) memory.

inline void clear_memory() {
NEXT_ID_ = 0;
STACK_.clear();
}


They have to be pointers to avoid slicing the member variables and virtual functions. As I said above, we can’t get around paying this pointer-chasing penalty—this structure just makes it explicit.

#### Reverse pass

This is the same as before:

void grad(const var& y, const std::vector<var>& x,
std::vector<double>& dy_dx) {
for (auto&& chain_f = STACK_.crbegin(); chain_f != STACK_.crend(); ++chain_f)
dy_dx.clear();
for (auto&& xi : x)
}


Now it’s been rewritten to take the derivative vector as an argument, which it clears and then populates. The use of auto&& here take the appropriately const-qualified reference to the specified object, even if it’s an rvalue.

#### Closure implementations

The only thing left to do is implement the closures. Let’s look at multiplication.

struct multiply_vv : public chainable {
size_t x1, x2, y;
double x1v, x2v;
multiply_vv(size_t x1, size_t x2, size_t y, double x1v, double x2v)
: x1(x1), x2(x2), y(y), x1v(x1v), x2v(x2v) { }
}
};
var operator*(const var& x1, const var& x2) {
var y(x1.val_ * x2.val_);
push_chainable(new multiply_vv(x1.id_, x2.id_, y.id_, x1.val_, x2.val_));
return y;
}


This multiply_vv class is the custom closure we’ll use to populate the autodiff continuation stack. It’s declared to be a subclass of chainable so that it can be pushed onto the stack.

It has a constructor which sets the indexes for x1, x2, and y, as well as the values of x1 and x2 as x1v and x2v. After storing these, the actual functor definition of void operator()(adj_t&) const is straightforward.

Finally, the actual operator*(const var&, const var&) used for multiplication requires creating the value as before, then pushing a chainable onto the stack with instantiations for all the variables. This probalby should’ve been done with just x1, x2, and y arguments with the multiply_vv constructor doing the fiddly member extraction.

#### Running it

That’s really pretty much it. We replaced std::function captures of lambdas with implicit closures with custom closures and replaced calling the functions with functor application.

The main looks pretty much as before other than the different usage pattern for dy_dx.

int main() {
var x1 = 10.3;
var x2 = -1.1;
std::vector<var> x = { x1, x2 };
var y = x1 * x2;
std::vector<double> dy_dx;
std::cout << "y = " << y.val_ << std::endl;
for (std::size_t i = 0; i < x.size(); ++i)
std::cout << "dy / dx[" << (i + 1) << "] = " << dy_dx[i] << std::endl;
return 0;
}

4 Likes

## Part III: Arena-Based Memory and Functional Application

#### What’s still wrong

Allocating that stack of chainables with malloc and then freeing them when the stack is cleared is going to be very slow, with a memory-affecting constructor and destructor for each subexpression being evaluated.

#### The fix: arena-based memory

Our third version of this program is going to clean up a bunch of things at once. It’s going to give us

• configurable sizes for indexes (32 or 64-bit being the only logical choices) and floating point scalars (32 bit and 64 bit being most obvious choices)—we’ll just go with the machine size of 64 and 64

• arena-based memory management with custom operator new for the chainable base class (and descendants)

• functional for for evaluating gradients like the one in stan::math::gradient

• more operators including += and *= and double-var versions of binary ops

The code still isn’t very well organized, documented, or broken down into functions. So I’ll step through it all like in the last posts.

#include <limits>
#include <iostream>
#include <vector>
#include <cmath>

// CONFIGURABLE SIZES

typedef unsigned int index_t;          // determines max expression size
typedef double real_t;                 // basic scalar type
typedef std::vector<real_t> vec_t;     // vector of scalars

// SIMPLE MEMORY ARENA

std::size_t MEM_SIZE = 1024 * 1024 * 1024;  // 1 GB

char* ARENA = static_cast<char*>(malloc(MEM_SIZE));      // ** global **

char* ARENA_NEXT = ARENA;                                // ** global **

inline void clear_arena() {
ARENA_NEXT = ARENA;
}

inline char* arena_malloc(std::size_t nbytes) {
char* bytes = ARENA_NEXT;
ARENA_NEXT += nbytes;
return bytes;
}

template <typename T>
inline T* arena_array_malloc(std::size_t n) {
return static_cast<T*>(arena_malloc(sizeof(T) * n));
}

// STACK OF CHAINABLES FOR REVERSE PASS IN ARENA

struct chainable {
virtual void operator()(vec_t& adj) const { }
static inline void* operator new(std::size_t nbytes) {
std::cout << "arena allocating nbytes = " << nbytes << std::endl;
return arena_malloc(nbytes);
}
};

std::vector<chainable*> STACK;                          // ** global **

inline void clear_stack() {
STACK.clear();  // leaves capacity unchanged
}
inline void push_chainable(chainable* c) {
STACK.push_back(c);
}

index_t NEXT_ID = 0;                                    // ** global **

inline index_t next_id() {
return NEXT_ID++;
}

// CLEAR MEMORY (RESETS, BUT DOES NOT FREE)

inline void clear_memory() {
NEXT_ID = 0;
clear_stack();
clear_arena();
}

// AUTODIFF VARIABLE

struct var {
real_t val_;
int id_;
var(const var& v) : val_(v.val_), id_(v.id_) { }
var(real_t val, int id) : val_(val), id_(id) { }
var(real_t val) : val_(val), id_(next_id()) { }
var() : val_(std::numeric_limits<real_t>::quiet_NaN()), id_(next_id()) { }
var& operator+=(const var& x);
var& operator+=(double x);
};

// REVERSE PASS GRADIENT CALCULATIONS (PURE CONTINUATIONS)

void grad(const var& y, const std::vector<var>& x,
std::vector<real_t>& dy_dx) {
for (auto&& chain_f = STACK.crbegin(); chain_f != STACK.crend(); ++chain_f)
dy_dx.clear();
for (auto&& xi : x)
}

/**
* Set the value y = f(x) and derivatives dy_dx = d/dx f(x).
*/
const std::vector<real_t>& x,
real_t& y, std::vector<real_t>& dy_dx) {
clear_memory();
std::vector<var> xv;
xv.reserve(x.size());
for (auto&& xi : x) xv.emplace_back(xi);
var yv = f(xv);
y = yv.val_;
}

// allocated in arena
struct add_vv : public chainable {
index_t x1, x2, y;
add_vv(index_t x1, index_t x2, index_t y) : x1(x1), x2(x2), y(y) { }
}
};
struct add_v : public chainable {
index_t x1, y;
add_v(index_t x1, index_t y) : x1(x1), y(y) { }
}
};
inline var operator+(const var& x1, const var& x2) {
var y(x1.val_ + x2.val_);
return y;
}
inline var operator+(const var& x1, double x2) {
var y(x1.val_ + x2);
return y;
}
inline var operator+(double x1, const var& x2) {
return x2 + x1;
}

inline var& var::operator+=(const var& x) {
val_ = val_ + x.val_;
index_t old_id = id_;
id_ = next_id();
return *this;
}
inline var& var::operator+=(double x) {
val_ = val_ + x;
index_t old_id = id_;
id_ = next_id();
return *this;
}

// MULTIPLICATION

// allocated in arena
struct multiply_vv : public chainable {
index_t x1, x2, y;
real_t x1v, x2v;
multiply_vv(index_t x1, index_t x2, index_t y, real_t x1v, real_t x2v)
: x1(x1), x2(x2), y(y), x1v(x1v), x2v(x2v) { }
}
};
struct multiply_v : public chainable {
index_t x1, y;
real_t  x2v;
multiply_v(index_t x1, index_t y, real_t x2v) : x1(x1), y(y), x2v(x2v) { }
}
};
inline var operator*(const var& x1, const var& x2) {
var y(x1.val_ * x2.val_);
push_chainable(new multiply_vv(x1.id_, x2.id_, y.id_, x1.val_, x2.val_));
return y;
}
inline var operator*(const var& x1, double x2) {
var y(x1.val_ * x2);
push_chainable(new multiply_v(x1.id_, y.id_, x2));
return y;
}
inline var operator*(double x1, const var& x2) {
return x2 * x1;
}

// EXPONENTIATION

struct exp_v : public chainable {
index_t x1, y;
real_t yv;
exp_v(index_t x1, index_t y, real_t yv) : x1(x1), y(y), yv(yv) { }
}
};
inline var exp(const var& x) {
var y(std::exp(x.val_));
push_chainable(new exp_v(x.id_, y.id_, y.val_));
return y;
}

// EXAMPLE

struct my_fun {
template <typename T>
T operator()(const std::vector<T>& x) const {
return exp(x[0]) * x[1] * x[1];
}
};

int main() {
using std::vector;
vec_t x = { 5.2, -3.9 };
vec_t dy_dx;
double y;
my_fun f;
std::cout << "y = f(x) = " << y << std::endl;
for (std::size_t i = 0; i < x.size(); ++i)
std::cout << "dy / dx[" << (i + 1) << "] = " << dy_dx[i] << std::endl;

return 0;
}


#### Configurable Sizes

typedef unsigned int index_t;          // determines max expression size
typedef double real_t;                 // basic scalar type
typedef std::vector<real_t> vec_t;     // vector of scalars


Using smaller integers means closures will be smaller and thus programs may be more memory local. It will limit number of subexpressions to around 4 billion. I’m still using double-precision floating point for real scalars and also vectors (the vec_t is replacing adj_t as I’ll need it in multiple places).

#### Simple arena-based memory

WARNING: Do not try this at home.
This will segfault in the face of a program that’s too large. To get around that issue, the memory arena either needs to be carefully sized or it needs to be dynamically expandable. The memory arena in Stan provides the same interface with a slightly slower (due to branching and pointer chasing) but much safer implementation.

std::size_t MEM_SIZE = 1024 * 1024 * 1024;  // 1 GB

char* ARENA = static_cast<char*>(malloc(MEM_SIZE));      // ** global **

char* ARENA_NEXT = ARENA;                                // ** global **

inline void clear_arena() {
ARENA_NEXT = ARENA;
}

inline char* arena_malloc(std::size_t nbytes) {
char* bytes = ARENA_NEXT;
ARENA_NEXT += nbytes;
return bytes;
}

template <typename T>
inline T* arena_array_malloc(std::size_t n) {
return static_cast<T*>(arena_malloc(sizeof(T) * n));
}


The memory size here is fixed at 1GB. That’s it. Go above that and segfault. You have now been warned and I’ve assumed the appropriate waiver before going on.

The arena itself is just a bunch of bytes (char in C++). The pointer to the next available memory location is ARENA_NEXT and it starts at the beginning.

The arena_malloc function has the same signature as the built in malloc function. It just returns a pointer to the next position in the arena then advances the pointer. The arena_array_malloc is just a convenience for arena-allocating arrays.

#### Closures allocated in the arena

This won’t do anything for us on its own. C++ has a really nifty feature where you can overload the operator new implementation for a class and all of its subclasses. Here’s what that looks like for our new chainable class:

struct chainable {
virtual void operator()(vec_t& adj) const { }

static inline void* operator new(std::size_t nbytes) {
std::cout << "arena allocating nbytes = " << nbytes << std::endl;
return arena_malloc(nbytes);
}
};


Same functor signature for performing continuations. But now there’s an operator new definition that calls the arena (and prints out how many bytes were allocated). Now whenever new is called on a suclass of chainable, its memory will be drawn from the arena.

The arena winds up being 8-byte aligned because we always allocate objects and in clang++ and g++ at least, that produces 8-byte alignment. That’s good, because we need it for actually operating on the objects efficiently.

#### Stack of continuations

These closures are used as continuations to propagate the chain rule in reverse mode. We hold them in a stack just like before, which has a couple convenience methods.

std::vector<chainable*> STACK;                          // ** global **

inline void clear_stack() {
STACK.clear();  // leaves capacity unchanged
}

inline void push_chainable(chainable* c) {
STACK.push_back(c);
}


Note again that clear() does not affect capacity. So the stack will keep growing until it doesn’t need to grow any more and then stay at that size until its memory is freed directly.

Autodiff variables and the ids behave exactly as before

index_t NEXT_ID = 0;                                    // ** global **

inline index_t next_id() {
return NEXT_ID++;
}

struct var {
real_t val_;
int id_;
var(const var& v) : val_(v.val_), id_(v.id_) { }
var(real_t val, int id) : val_(val), id_(id) { }
var(real_t val) : val_(val), id_(next_id()) { }
var() : val_(std::numeric_limits<real_t>::quiet_NaN()), id_(next_id()) { }
var& operator+=(const var& x);
var& operator+=(double x);
};


We’ve added operator+= declarations for var that will be defined out of the class. The class is still incomplete compared to a full arithmetic class.

#### Defining functions and continuations

Function definitions are just like in the last instance—the arena memory management remains implicit.

struct exp_v : public chainable {
index_t x1, y;
real_t yv;
exp_v(index_t x1, index_t y, real_t yv) : x1(x1), y(y), yv(yv) { }
}
};
inline var exp(const var& x) {
var y(std::exp(x.val_));
push_chainable(new exp_v(x.id_, y.id_, y.val_));
return y;
}


The compound add-assign operators are defined using the existing add_vv closure type,

inline var& var::operator+=(const var& x) {
val_ += x.val_;
index_t old_id = id_;
id_ = next_id();
return *this;
}


The argument for the object being operated on is implicit. Thus val_ refers to the value of that object, whereas x.val_ picks out the value of the argument. The first step updates the object’s value by incrementing it. Then the old index is saved and a new one is generated. The old and new indexes are used to construct the add_vv insance, which is pushed onto the stack of contiuations. Then a reference to the object is returned by return *this in order to match the return type.

The one for double values is similar, but uses add_v.

inline var& var::operator+=(double x) {
val_ += x;
index_t old_id = id_;
id_ = next_id();
return *this;
}


The basic gradient calculation is just like last time,

void grad(const var& y, const std::vector<var>& x,
std::vector<real_t>& dy_dx) {
for (auto&& chain_f = STACK.crbegin(); chain_f != STACK.crend(); ++chain_f)
dy_dx.clear();
for (auto&& xi : x)
}


But now we will define the interface that Stan actually uses. This uses the std::function argument because the storage cost isn’t high and it avoids templating on the functor type (i.e., type of f).

void gradient(const std::function<var(const std::vector<var>&)>& f,
const std::vector<real_t>& x,
real_t& y, std::vector<real_t>& dy_dx) {
clear_memory();
std::vector<var> xv;
xv.reserve(x.size());
for (auto&& xi : x) xv.emplace_back(xi);
var yv = f(xv);
y = yv.val_;
}


This takes an argument function f that operators on a vector<var> to return a var. Here is the example of such a function we use in the evaluation:

struct my_fun {
template <typename T>
T operator()(const std::vector<T>& x) const {
return exp(x[0]) * x[1] * x[1];
}
};


An instance of my_fun will be assignable to std::function<var(const std::vector<var>&)>. Note that the user here just writes my_fun and never has to refer to autodiff types—everything’s just implicit in the template parameter T.

Going back to the gradient function, it takes the functor as an argument, along with real-valued vectors of argument and a mutable real values for the result and multable vector of real values for the gradient.

The first step is to clear the memory. The clear function is simple, clearing the stack, the arena, and setting the next identifier back to zero.

inline void clear_memory() {
NEXT_ID = 0;
clear_stack();
clear_arena();
}


The rest of the function first creates var instances of the argument scalar types.

  std::vector<var> xv;
xv.reserve(x.size());
for (auto&& xi : x) xv.emplace_back(xi);


By using emplace, the new var is constructed in place in the vector’s memory rather than being copied.

Then it applies the function, extracts the value, and calls the other gradient function to calculate the gradient.

  var yv = f(xv);
y = yv.val_;


#### Running it all

From a user’s persepctive, it’s just a matter of defining a functor to differentate, which I’ll repeate here again:

struct my_fun {
template <typename T>
T operator()(const std::vector<T>& x) const {
return exp(x[0]) * x[1] * x[1];
}
};


Then we can call it by setting up the argument vectors and result types, instantiating the functor f, then calling the gradient functional.

int main() {
using std::vector;
vec_t x = { 5.2, -3.9 };
vec_t dy_dx;
double y;
my_fun f;
std::cout << "y = f(x) = " << y << std::endl;
for (std::size_t i = 0; i < x.size(); ++i)
std::cout << "dy / dx[" << (i + 1) << "] = " << dy_dx[i] << std::endl;

return 0;
}


That’s it. I believe this will be faster than how we do autodiff now. It’s smaller and more memory local for most operations, so it should be faster.

#### Next up: matrix types

The real target is matrix types. I haven’t written this part yet, but the idea is to have a matrix_var that stores a MatrixXd value and int[] of ids.

7 Likes

If you’re gonna do big posts like this you should really go for a fantasy genre title. “The Autodiff Archive” maybe? Especially if it’s a series… “Lord of the Refactorings”, “Vari Revelations”

1 Like

Question on this.

How will this affect fvars and second order autodiff?

Right now when we work with the math library and fvar<var>s, the prim implementations of everything gets used, everything gets blown out into (literally) billions of primitive operations, and we lose all the work that went into making the var bit of Stan math fast.

1. What do we have to do on the fvar side so that when we’re doing reverse passes we don’t lose all our custom reverse mode optimizations?

2. Does this autodiff system change this situation at all? Is it easier or harder for fvar stuff to take advantage of var components in the new autodiff? Or does it all still get blown up into lots of tiny primitive operations?

edit: fixed typos

Unless we could get enough reflection to run reverse inside reverse, we’d have to do it the same way we do it now using fvar<var> and fvar<fvar<var>>.

First, make sure the Hessian code you care about is using the fvar<var> version, not the fvar<fvar<double>> version.

The fvar implementations call whichever reverse-mode function is used. If they call x * y, then reverse-mode’s multiplication is used. If they call multiply_self_transpose() then the compound reverse-mode function is used. So it’s just a matter of the primitives you use in the fvar implementations.

Write better forward-mode code that calls specialized reverse-mode implementations. I’d urge you to work through exp(x) and x * y for x and y of type fvar<fvar<var>> to see what happens and see the room for improvement. The big gains in harder cases come from exploiting sparsity patterns to avoid doing irrelevant work.

We haven’t even started optimizing forward mode. First order of business is to test it for matrices. Then we can start using it instead of finite-differencing gradients, as RStan does now.

And if I astroturf a bunch of fake reviews on Amazon, I could even sell the audio book.

1 Like

Here’s a first stab at a vector class with proper move semantics for constructors and assignments. It does its own RAII management instead of delegating to std::vector or Eigen::Matrix. It looks like they [added a move constructor Eigen::Matrix], so it would’ve been possible with std::move() to indicate the rvalue status of the copy.

I really liked Kholdstare’s blog post explaining move semantics, though it oddly didn’t talk about assignment or the rule of five (an upgrade of the rule of three for move semantics).

Anyway, here goes. The goal is a data structure that lets you access the raw data memory locally without copying (see the Map returns) and lets you do sets and gets. I think this accomplishes that, but I’d very much like feedback on this. I’m very new to all this C++11/14 stuff particularly.

Becaue I manage my own memory, I kick things off with some helpful subroutines (“subroutine” sounds so 1970s, but then so does malloc).

template <typename T>
inline T* array_malloc(std::size_t n) {
return static_cast<T*>(malloc(sizeof(T) * n));
}

template <typename T>
inline T* copy_array_malloc(const T* x, index_t N) {
T* y = array_malloc<T>(N);
for (index_t n = 0; n < N; ++n) y[n] = x[n];
return y;
}

template <typename M>
inline typename M::Scalar* copy_array_malloc(const Eigen::EigenBase<M>& x) {
typename M::Scalar* y = array_malloc<typename M::Scalar>(x.size());
for (index_t n = 0; n < x.size(); ++n) y[n] = x(n);
return y;
}


The third example is being careful with Eigen argument types so as to be able to accept expression templates; see the guide to Eigen argument types for more information. I could’ve used a reference without a template parameter here, I think—does anyone know which would be better?

Anyway, here’s the vector class, meant to be used for Stan’s vector type.

struct vector_var {
int N_;
real_t* val_;
index_t* id_;

vector_var() : N_(0), val_(nullptr), id_(nullptr) { }

// Eigen reference constructor (allow template expression args)
vector_var(const Eigen::Ref<Eigen::Matrix<real_t, -1, 1>>& v)
: N_(v.size()),
val_(copy_array_malloc(&v[0], v.size())),
id_(array_malloc<index_t>(v.size())) {
for (int i = 0; i < v.size(); ++i)
id_[i] = next_id();
}

// Eigen base constructor
template <typename M>
vector_var(const Eigen::EigenBase<M>& v)
: N_(v.size()),
val_(copy_array_malloc(v)),
id_(array_malloc<index_t>(v.size())) {
for (int i = 0; i < v.size(); ++i)
id_[i] = next_id();
}

// copy constructor
vector_var(const vector_var& v)
: N_(v.N_),
val_(copy_array_malloc(v.val_, v.N_)),
id_(copy_array_malloc(v.id_, v.N_)) {
}

// move constructor (shallow copy in initializers, nullify source in body)
vector_var(vector_var&& v) : N_(v.N_), val_(v.val_), id_(v.id_) {
v.N_ = 0;
v.val_ = nullptr;
v.id_ = nullptr;
}

// destructor frees memory to complete RAII pattern
~vector_var() { free(val_); free(id_); }

// are these still necessary with move constructor?

// copy assignment
vector_var& operator=(const vector_var& v) {
N_ = v.N_;
val_ = copy_array_malloc(v.val_, v.N_);
id_ = copy_array_malloc(v.id_, v.N_);
return *this;
}
// move assignment (destructive; see move constructor)
vector_var& operator=(vector_var&& v) {
N_ = v.N_;
val_ = v.val_;
val_ = nullptr;
id_ = v.id_;
id_ = nullptr;
return *this;
}

inline real_t val(index_t i) const {
return val_[i];
}
template <typename M>
inline decltype(auto) val() const {
return Eigen::Map<real_vector_t>(val_, N_);
}

inline index_t id(index_t i) const {
return id_[i];
}
inline decltype(auto) id() const {
return Eigen::Map<index_vector_t>(id_, N_);
}

var operator[](index_t i) const {
return var(val(i), id(i));
}

void set(index_t i, real_t v) {
val_[i] = v;
id_[i] = next_id();
}
void set(index_t i, const var& v) {
val_[i] = v.val_;
id_[i] = v.id_;
}
};


I’m pretty sure I could use a direct memory copy here instead of most of these loops, because the contents are primitive data types. At the very least, I could probably use a clever std::algorithm one-liner. Any suggestions?

I’ll leave this post with a promissory note. I’ve already written some operators for the last version of this I did before completely refactoring the vector_var class. The basic idea is to write those helper functions for the arena malloc and then implement the matrix derivatives using maps. The derivative code will wind up looking like the adjoint-Jacobian product formulation @bbbales2 just coded up, only unfolded one step.

1 Like

This covers the unevaluated types in Eigen?

What @ChrisChiasson said makes sense though. If these things are always going to be copied out of, then wouldn’t an implicit conversion happen and vector_var(const Eigen::EigenBase<M>& v) handle everything?

Is there a reason to avoid std::copy? Is that something that can’t be fast for some weird reason?

Also, and maybe I’m not thinking about what I’m saying very well, but what do you think about using the curiously recurring template pattern like Eigen does and try to get special matrix types worked into this?

Like, cholesky_decompose takes in a positive_definite matrix, and then the output is guaranteed to lower_triangular. We want to write functions that in general will operate on all these matrices but then specializations that take advantage of the hidden structure. Ideally this happens so nobody has to look up weird solves in the manual.

I guess it could all be runtime checks, but compile time on types seems more glamorous.

Also what about statically sized types? Is there a way to work that in here? These vector_vars are going to get malloc’d in the Bob arena anyway, right? And we know sizes at compile-time in Stan.

I’m just thinking in terms of working with this all – if we can get statically sized types then all of a studden Eigen variables don’t go off malloc/freeing and carrying around Eigen::LLT and Eigen::LDLT objects isn’t so hard any more.

edit: changed some wording for clarity

Also I wouldn’t limit yourself to C++14. C++17’s constexpr stuff is very cool: https://medium.com/@LoopPerfect/c-17-vs-c-14-if-constexpr-b518982bb1e2 . Since this is all pie in the sky…

It makes compile time type logic start looking a whole lot more like regular C++ and not crazy template shenanigans.

I’m confused about what is and what isn’t supported in constexpr through the versions.

1 Like

Good point. I should probably change that to act like the current version (which does sometimes confuse people). We usually don’t use the no-arg constructor for var instances.

I’m still not entirely clear on all of the rules for the compiler adding constructors, so I was just explicit.

It would make sense to initialize like double. We’ve usually initialized Stan with NaN for debugging purposes—it’s too easy for 0 to slip through density functions, etc., unnoticed.

We could do that. I do something like that in the code generated for a Stan program.

Much smaller. As I pointed out and the article I linked pointed out, std::function is huge—it’s minimum 40+ bytes.

That we could do.

There is apparently move semantics, but it’d just be moving the heap memory in—I want everything in the arena here. So I don’t want to copy to Matrix, then copy to arena.

Nope. I just need a more comprehensive knowledge of the algorithms lib. I see from the doc that

In practice, implementations of std::copy avoid multiple assignments and use bulk copy functions such as std::memmove if the value type is TriviallyCopyable

Neat.

That might be possible aong with something like Adept’s expression templates.

We might also be able to tackle some of this in the generator.

No, we don’t know the sizes statically. Just consider:

int N;
real y[N];   // size isn't known statically


Even if you statically size, beyond a point Eigen’s going to malloc anyway. The cutoff’s pretty low because you don’t want to be cluttering the stack with memory, especially in multi-threaded apps.

Thanks for the pointer. I need to learn more about constexpr. Most of our use of autodiff won’t be able to have much that’s constant, though.

This is so cool!

c++1y was the name of the c++14 standard before it was released - we use that flag because gcc 4.9.3 came out before it C++14 was released so that’s how we’re able to use c++14 features cross compiler. I only mention this pedantry in case you or others thought we needed to wait for C++17 to use this code.

A cool C++11 feature is to defer to another constructor:

  var(const var &v) : var(v.val_, v.idx_) {}


Would a C++11 for each loop work here?

      for (auto&& chain_f : STACK)


How does this work with nested autodiff? I’m still reading all of these posts so if there’s an answer somewhere I’ll find it soon and edit this out. What I’m imagining is that we have a global stack of indices indicating the vari index where a nested autodiff starts, and we push an index onto it when we get a call to start_nested() and then when we need the current nested run’s gradient, we pop & run the chain thunks just from that index to the end?

I’m doing some really silly microbenchmarks with Eigen types and finding they might be similar performance wise (probably because closing over two pointers fits within the small size optimized buffer a std::function object contains). I will try to expand this a little and see where it breaks… [edit - it’s way way way slower, I had a bug]

Thanks. Not pedantic at all. I’d much rather be corrected than continue in error.

I assume the constructor delegation is going to all get inlined.

We need to loop over the stack in reverse and unfortunately the reverse algorithm is super heavy compared to the reverse iterators.

The ids are still in topological order, so that’d work the same way—like you outline.

Similar to what? The std::vector type is just too heavy to make a stack of them for autodiff, I think.

I had meant that the std::function stack was similar to the chainable custom continuation stack, but I had a mistake in my benchmark code, haha. It’s like 2-3 orders of magnitude worse. I’m sad to lose the really elegant lambda derivatives. Are you thinking we should try switching from our current vari pointer system to this one?

Indeed. They don’t have types, though, so you need something like std::function, which is tres elegante, but not very useful in this high throughput situation.

Even more so. It’s possible to do a lot more statically with code generation, but that brings in a ridiculous number of other headaches.

Can the lambda be automated behind the scenes, sort of like the current operands and partials, and then specialized as desired?

For example, let’s say the the adjoints are all Eigen VectorXd to generalize to component functions with multiple inputs and outputs.
In first-order reverse mode the updates will always be of the form
\alpha^{*}_{i} = J^{j}_{i} \alpha_{j} ,where J is the Jacobian of the component function, \alpha^{*} are the adjoints on the inputs, and \alpha are the adjoints on the outputs.

Consequently if a user could provide a function that evaluates the partial derivatives at the current point then a lambda implementing this update could be constructed automatically without the user having to know anything about the inner workings of the autodiff. Similarly, if a user could provide a function that evaluates the product of the Jacobian times an arbitrary vector then we could constructs a lambda automatically that would be just about as efficient as hand-written code but the inner workings of the autodiff would be completely abstracted away from the user.

2 Likes

In other words, can we create the continuation by first checking for a specialized function, then if that doesn’t exist looking for a Jacobian-vector function, then if that doesn’t exist looking for a Jacobian function, and if none of those exist erring out?

Exactly. I haven’t included it here, but it’s the basis of the recent PR of @bbbales2 that just got merged to allow users to write the value and adjoint-Jacobian product functions as a functor. That would be easy to do here, too.

The version @bbbales2 just wrote is really nice in that it uses a minimal number of virtual functions and handles all the autodiff stack management and in fact all the var types behind the scenes. If the value function is written with templates, it could be used for forward-mode, too. The impetus was that ongoing email discussion with David Duvenaud that finally drove the adjoint-Jacobian formulation through my thick head.

I suspect the expression templates would be a bit heavy for univariate uses as I don’t think there’s a way to avoid the constructor overhead. There, I think something more like the current vectorization framework is in order. That’s based on static class definitions that don’t have constructor overhead because they don’t need to store intermediate calculations during the value calculation that might be needed for the gradients.

@bbbales2 is working to generalizing beyond unary functions with single outputs to at least functions with two or three possibly multivariate inputs.

2 Likes

#### Part IV: Matrices and Matrix Functions

I haven’t quite had time to get this working, but the basic organization’s there of how it’ll work. I tried to make the enhancements others suggested in earlier responses like automatic constructor definitions, and I added a few of my own like brace initialization.

If you can see why the vector autodiff isn’t working yet, please let me know!

The idea is just like ordinary autodiff—the vector stores a vector of values and a vector of int indexes. This makes getting and setting both very easy, though I haven’t been able to figure out how to do reference sets without creating expression templates with back pointers to the originals. Then the autodiff is exactly the same.

By my rough calculation, with integer indexes, the memory used should be about the same as for the current approach, but with benefits of much better memory locality and no need to copy out vector or matrix values (they can be used directly with Eigen::Map).

I’m still a little hazy on how the Eigen reference and base class pointer arguments work and which gets instantiated in the constructors and other classes. Or which is more efficient, so if anyone knows that, please let me know.

#include <Eigen/Eigen>
#include <algorithm>
#include <cmath>
#include <initializer_list>
#include <iostream>
#include <limits>
#include <utility>
#include <vector>

// CONFIGURABLE SIZES

typedef unsigned int index_t;          // determines max expression size
typedef double real_t;                 // basic scalar type
typedef std::vector<real_t> vec_t;     // vector of scalars
typedef Eigen::Matrix<real_t, -1, 1> real_vector_t;
typedef Eigen::Matrix<index_t, -1, 1> index_vector_t;

// HEAP MEMORY MANAGEMENT

template <typename T>
inline T* array_malloc(std::size_t n) {
return static_cast<T*>(malloc(sizeof(T) * n));
}

template <typename T>
inline T* copy_array_malloc(const T* x, index_t N) {
T* y = array_malloc<T>(N);
std::copy(x, x + N, y);
return y;
}
template <typename T, int R, int C>
inline T* copy_array_malloc(const Eigen::Matrix<T, R, C>& x) {
T* y = array_malloc<T>(x.size());
std::copy(&x[0], &x[0] + x.size(), y);
for (index_t n = 0; n < x.size(); ++n) y[n] = x(n);
return y;
}
template <typename M>
inline typename M::Scalar* copy_array_malloc(const Eigen::DenseBase<M>& x) {
typename M::Scalar* y = array_malloc<typename M::Scalar>(x.size());
// can't use std::copy because not knowing M requires x(n) access
for (index_t n = 0; n < x.size(); ++n) y[n] = x(n);
return y;
}

// ARENA MEMORY MANAGEMENT

std::size_t MEM_SIZE = 1024 * 1024 * 1024;  // 1 GB

char* ARENA = static_cast<char*>(malloc(MEM_SIZE));      // ** global **

char* ARENA_NEXT = ARENA;                                // ** global **

inline void clear_arena() {
ARENA_NEXT = ARENA;
}

inline void* arena_malloc(std::size_t nbytes) {
char* bytes = ARENA_NEXT;
ARENA_NEXT += nbytes;
return bytes;
}

template <typename T>
inline T* array_arena_malloc(std::size_t n) {
return static_cast<T*>(arena_malloc(sizeof(T) * n));
}

template <typename T>
inline T* copy_array_arena_malloc(const T* x, index_t N) {
T* y = array_arena_malloc<T>(N);
std::copy(x, x + N, y);
return y;
}
template <typename T, int R, int C>
inline T* copy_array_arena_malloc(const Eigen::Matrix<T, R, C>& x) {
T* y = array_arena_malloc<T>(x.size());
std::copy(&x[0], &x[0] + x.size(), y);
for (index_t n = 0; n < x.size(); ++n) y[n] = x(n);
return y;
}
template <typename M>
inline typename M::Scalar* copy_array_arena_malloc(const Eigen::DenseBase<M>& x) {
typename M::Scalar* y = array_arena_malloc<typename M::Scalar>(x.size());
for (index_t n = 0; n < x.size(); ++n) y[n] = x(n);
return y;
}

// STACK OF CHAINABLES FOR REVERSE PASS IN ARENA

struct chainable {
virtual void operator()(vec_t& adj) const { }
static inline void* operator new(std::size_t nbytes) {
return arena_malloc(nbytes);
}
};

std::vector<chainable*> STACK;                          // ** global **

inline void clear_stack() {
STACK.clear();  // leaves capacity unchanged
}
inline void push_chainable(chainable* c) {
STACK.push_back(c);
}

index_t NEXT_ID = 0;                                    // ** global **

inline index_t next_id() {
return NEXT_ID++;
}

inline index_t* next_id_array_malloc(int N) {
index_t* y = array_malloc<index_t>(N);
for (index_t n = 0; n < N; ++n) y[n] = next_id();
return y;
}

// CLEAR MEMORY (RESETS, BUT DOES NOT FREE)

inline void clear_memory() {
NEXT_ID = 0;
clear_stack();
clear_arena();
}

// AUTODIFF VARIABLE

struct var {
real_t val_;
index_t id_;
~var() = default;
var() = default;
var(const var& v) = default;
var(real_t val, index_t id) : val_(val), id_(id) { }
var(real_t val) : var(val, next_id()) { }
var& operator=(const var& x) {
val_ = x.val_;
id_ = x.id_;
return *this;
}
var& operator=(double x) {
val_ = x;
id_ = next_id();
return *this;
}
var& operator+=(const var& x);
var& operator+=(real_t x);
};

struct vector_var {
int N_;
real_t* val_;
index_t* id_;

vector_var() = default;

// Eigen reference constructor (allow template expression args)
vector_var(const Eigen::Ref<Eigen::Matrix<real_t, -1, 1>>& v)
: N_(v.size()),
val_(copy_array_malloc(v)),
id_(next_id_array_malloc(v.size())) { }

// argument copies on purpose
vector_var(std::initializer_list<real_t> v)
: N_(v.size()),
val_(array_malloc<real_t>(v.size())),
id_(next_id_array_malloc(v.size())) {
}

// // Eigen base constructor
template <typename M>
vector_var(const Eigen::DenseBase<M>& v)
: N_(v.size()),
val_(copy_array_malloc(v)),
id_(next_id_array_malloc(v.size())) { }

// copy constructor
vector_var(const vector_var& v)
: N_(v.N_),
val_(copy_array_malloc(v.val_, v.N_)),
id_(copy_array_malloc(v.id_, v.N_)) {
}

// move constructor (shallow copy in initializers, nullify source in body)
// can this be = default with same behavior?
vector_var(vector_var&& v) noexcept
: N_(v.N_), val_(v.val_), id_(v.id_) {
v.N_ = 0;
v.val_ = nullptr;
v.id_ = nullptr;
}

// allocated, but unitialized (like Eigen, unlike std::vector)
vector_var(size_t N)
: N_(N), val_(array_malloc<real_t>(N)), id_(array_malloc<index_t>(N)) { }

// destructor frees memory to complete RAII pattern
~vector_var() { free(val_); free(id_); }

// are these still necessary with move constructor?

// copy assignment
vector_var& operator=(const vector_var& v) {
N_ = v.N_;
val_ = copy_array_malloc(v.val_, v.N_);
id_ = copy_array_malloc(v.id_, v.N_);
return *this;
}
// move assignment (destructive; see move constructor)
vector_var& operator=(vector_var&& v) {
N_ = v.N_;
val_ = v.val_;
val_ = nullptr;
id_ = v.id_;
id_ = nullptr;
return *this;
}

index_t size() const {
return N_;
}

real_t val(index_t i) const {
return val_[i];
}
decltype(auto) val() const {
return Eigen::Map<real_vector_t>(val_, N_);
}

index_t id(index_t i) const {
return id_[i];
}
decltype(auto) id() const {
return Eigen::Map<index_vector_t>(id_, N_);
}

// std::vector-style indexing
var operator[](index_t i) const {
return var(val(i), id(i));
}

// Eigen::Vector-style indexing
var operator()(index_t i) const {
return var(val(i), id(i));
}

void set(index_t i, real_t v) {
val_[i] = v;
id_[i] = next_id();
}
void set(index_t i, const var& v) {
val_[i] = v.val_;
id_[i] = v.id_;
}
};

// index_t: (2 * N + 1)    real_t: (N + 1)   size pointer: 1
// bytes (int index_t, double real_t): 16N + 28
// current:  N pointers to y, N ptrs to x1,  pointer to x2, length (8), vtable
//           pointer: 2 N + 3  size: 2
//           bytes: 16N + 32  (+ vari being pointed to, but consider it a wash)
struct vector_scalar_multiply_vv : public chainable {
const index_t N_;
const index_t* x1_id_;
const index_t x2_id_;
const index_t* y_id_;
real_t* x1_val_;
real_t x2_val_;
vector_scalar_multiply_vv(const vector_var& x1,
const var& x2,
const vector_var& y)
: N_(x1.N_),
x1_id_(copy_array_arena_malloc<index_t>(x1.id_, x1.N_)),
x2_id_(x2.id_),
y_id_(array_arena_malloc<index_t>(y.N_)),
x1_val_(copy_array_arena_malloc<real_t>(x1.val_, x1.N_)),
x2_val_(x2.val_)
{ }

for (int n = 0; n < N_; ++n) {
std::cout << "\n**** n = " << n << std::endl;
std::cout << "x1_id_[n] = " << x1_id_[n] << std::endl;
std::cout << "y_id_[n] = " << y_id_[n] << std::endl;
std::cout << "x2_val_ = " << x2_val_ << std::endl;
std::cout << "x1_val_[n] = " << x1_val_[n] << std::endl;
}
}
};

inline vector_var multiply(const vector_var& x1, var& x2) {
vector_var y(x1.val() * x2.val_);
push_chainable(new vector_scalar_multiply_vv(x1, x2, y));
return y;
}

// REVERSE PASS GRADIENT CALCULATIONS (PURE CONTINUATIONS)

void grad(const var& y, const std::vector<var>& x,
std::vector<real_t>& dy_dx) {
for (auto&& chain_f = STACK.crbegin(); chain_f != STACK.crend(); ++chain_f)
dy_dx.clear();
for (auto&& xi : x)
}

/**
* Set the value y = f(x) and derivatives dy_dx = d/dx f(x).
*/
const std::vector<real_t>& x,
real_t& y, std::vector<real_t>& dy_dx) {
clear_memory();
std::vector<var> xv;
xv.reserve(x.size());
for (auto&& xi : x) xv.emplace_back(xi);
var yv = f(xv);
y = yv.val_;
}

// allocated in arena
struct add_vv : public chainable {
index_t x1, x2, y;
add_vv(index_t x1, index_t x2, index_t y) : x1(x1), x2(x2), y(y) { }
}
};
struct add_v : public chainable {
index_t x1, y;
add_v(index_t x1, index_t y) : x1(x1), y(y) { }
}
};
inline var operator+(const var& x1, const var& x2) {
var y(x1.val_ + x2.val_);
return y;
}
inline var operator+(const var& x1, double x2) {
var y(x1.val_ + x2);
return y;
}
inline var operator+(double x1, const var& x2) {
return x2 + x1;
}

inline var& var::operator+=(const var& x) {
val_ += x.val_;
index_t old_id = id_;
id_ = next_id();
return *this;
}
inline var& var::operator+=(double x) {
val_ += x;
index_t old_id = id_;
id_ = next_id();
return *this;
}

// MULTIPLICATION

// allocated in arena
struct multiply_vv : public chainable {
index_t x1, x2, y;
real_t x1v, x2v;
multiply_vv(index_t x1, index_t x2, index_t y, real_t x1v, real_t x2v)
: x1(x1), x2(x2), y(y), x1v(x1v), x2v(x2v) { }
}
};
struct multiply_v : public chainable {
index_t x1, y;
real_t  x2v;
multiply_v(index_t x1, index_t y, real_t x2v) : x1(x1), y(y), x2v(x2v) { }
}
};
inline var operator*(const var& x1, const var& x2) {
var y(x1.val_ * x2.val_);
push_chainable(new multiply_vv(x1.id_, x2.id_, y.id_, x1.val_, x2.val_));
return y;
}
inline var operator*(const var& x1, double x2) {
var y(x1.val_ * x2);
push_chainable(new multiply_v(x1.id_, y.id_, x2));
return y;
}
inline var operator*(double x1, const var& x2) {
return x2 * x1;
}

// EXPONENTIATION

struct exp_v : public chainable {
index_t x1, y;
real_t yv;
exp_v(index_t x1, index_t y, real_t yv) : x1(x1), y(y), yv(yv) { }
}
};
inline var exp(const var& x) {
var y(std::exp(x.val_));
push_chainable(new exp_v(x.id_, y.id_, y.val_));
return y;
}

// EXAMPLE

struct my_fun {
int n_;
my_fun(int n) : n_(n) { }
my_fun() : my_fun(0) { }
template <typename T>
T operator()(const std::vector<T>& x) const {
T x1 = x[0];
for (size_t i = 0; i < x2.size(); ++i) x2.set(i, x[i + 1]);
}
};

int main() {
using std::vector;
vec_t x = { 5.2, -3.9, 1.2 };
vec_t dy_dx;
double y;
my_fun f;