Parallel autodiff v3

This is all racing ahead much faster than I can keep up. I have no idea how map_rect works now, much less how it can be made to work with closures.

This whole notion of data was meant to be things that we didn’t autodiff against that we could pass to functions like ode_integrate. I think @wds15 is now imagining it as something that’s immutable after data read.

I don’t even know if map_rect works correctly now.

What happens if I do this in a Stan program:

for (i in 1:2)
  y = map_rect(f, phi, thetas, { { 1.0 }, {2.0} }, { { i }, { i } });

The integer data changes each iteration. But I thought we tagged memory batches by instance of map_rect. So I don’t see how this can assign a single map_rect call to a processor with fixed data because the data changes by call.

Also, what if the function f in my map_rect call also calls map_rect? That’s all fine under serial evaluation, but I have no idea what’ll happen with MPI or threading. Does anything make sure we don’t wind up spawning hundreds of threads or processes?

The current doc doesn’t say what the requirements are:

So I’d like to start with fixing the current doc to indicate the requirements on the data arguments, etc. Then maybe we can fix the parser to deal with that.

Why not?

Is this only going to apply to functions with a single scalar output? Why that restriction? And what is “automatic parallelism” (I think that’s the same question @seantalts asked).

Even if it weren’t, there’s no way to write a stateful function in the Stan language. Of course, you could in the math lib, but it’d be very tricky now given our basic abstraction that log density evals are independent.

So closures won’t be any use at all here—their main utility in Stan would be implicit capture of parameters and data so we didn’t need to do all the packing and unpacking. But that won’t work with the parallelization we have in place, so I don’t see the point of adding closures. Sure, they’re cool, but I don’t think they’ll be so useful for Stan given that we won’t allow them to be used anywhere useful.

How can start_nested work in parallel? It’s using a resizable contiguous stack.

Is there an example? I never figured out how we could split out all the argments given that we have shared parameters, mapped parameters, mapped real data, and mapped integer data. How could the function call know which is which? You can’t have four parameter packs in a function—it’d be ambiguous.

It already does with print(). There shouldn’t be a problem with variable length functions if they’re well defined.