If we aren’t evaluating Jacobians in forward mode or aren’t parallelizing reverse mode, this won’t scale for general purpose functions. Forward mode will be faster, but then reverse mode will still be single threaded and slow.
lpdfs will be a bit misleading because we’re computing the Jacobians manually in forward mode with the ops and partials stuff.
I think this makes sense still. Within one process, there is only ever one sampler running one model.
The separate AD stacks we’re making are subservient to the main one, for sure. But I don’t think we really need for them to be standalone.
And the single primary-stack design makes things like the global grad/set_zero_adjoint stuff doable.
Cool cool, couple comments:
- independent chunks of work can be reordered (no more unique
topological sort of operations)
An alternative approach is to allow the AD tape to grow in many
independent sub-trees. So after each parallel phase the chunks are
not merged together to form a single AD tape. These sub-trees are
then linked together in a tree-like structure which must allow for
traversal in reverse order as needed during the reverse sweep
The simpler version of parallel reverse mode doesn’t change how varis on the primary stack need organized.
Each of our parallel calls, parallel_reduce, parallel_map, whatever, have a vari on the primary autodiff stack. These varis can manage a bunch of sub-stacks that are allocated in the forward pass and then accessed again in reverse mode, and these varis can manage the specific parallelism however they like.
(all global operations like
`set_zero_adjoint`, `recover_memory`, ... all need a rewrite)
This is true, but the primary stack just needs to maintain a list of sub-stacks. If parallel tasks are limited to not launching other tasks, only the primary stacks allocate sub-stacks so it’s easy to track them. No need for it to be that exotic.