I understand there is a lot of development in this area, I’m not fully aware of it, this is just me floating an idea that works simply and neatly in my case and doesn’t ‘seem’ conceptually hard to implement. So:
In order to parallelise a stan program that does the same thing over many different subjects, in R I just split the data for the subjects into groups and compute the grad_log_prob separately for each group, either with a correction to the prior for each group passed in as data or (slightly more efficiently) only computing it for one group.
This only works (at present) for custom computations in R and not with Stan’s sampler, but some kind of layer in between fitting algorithms and stan models, that allowed multiple datasets / models, might be useful for a number of things, I think.
That looks like nice improvement, but wouldn’t be general enough for my case – packing up a complex model into a function that returns a single value is just not always feasible / desirable. Though maybe once ragged arrays also make it in this ends up being sufficient…
Have a look again at reduce_sum. The function you need to write allows you to have any number of arguments! This facility is extrmley flexible. I have myself ported very complex models over to reduce sum in a matter of 30min or so.
@wds15 – yes, the input flexibility side of things is solved with reduce_sum, which is a wonderful improvement, but the output flexibility is still limited. I could probably use a reduce_sum version of the model for fitting, and the regular version for computing output after fitting, but that’s still messy… assignment from within functions to the call location would also do the trick, something like the <<- operator in R, though that’s getting a little wild ;)
For fast fitting you really only worry about getting the log lik and its gradient fast. From one iteration to the next Stan will internally do a number of calls to that function. Thus, it will be most efficient if you fit your model first using reduce_sum… and then you parallelise over the different draws with whatever other mechanism (R parallel things or whatever).
This forces you to write twice your model (once for reduce_sum and once for simulation), but that’s a matter of arranging functions accordingly. Computationally this is most efficient - and, I think, if you desire to use these parallelisation techniques, then it is a good thing to follow good coding practice. I mean, the performance in Stan is constrained by the AD graph size and this is kept small by doing the reduce instantly.
Sure, simulation anyway requires a re-write, but there can be a range of interesting quantities (e.g. a subject specific correlation matrix) that are computed during the fitting process. Which could then mean two re-writes, one for saving incidental quantities, one for simulating new ones. Maybe I’m doing it wrong, and it’s anyway always more efficient to not save stuff along the way?
Yes! Do not save stuff if you are worried about fitting performance. If your concern is about performance, then save as little as possible to get that log-lik value calculated (and the gradient is handled in the backend of Stan for you). Then do a separate run with the quantities you want to calculate where you parallelise over the draws of the posterior.
I mean, think about it - the sampler often has a tree depth of ~6 in my problems. That means that we do 2^6 moves between each iteration. If you start saving stuff it will be 63x discarded anyways as you only save the temporaries of the last - actually saved - move of the sampler.
So it’s much better to first do the inference and then the generated quants - supported by the gqs facility in RStan & CmdStan, for example.
Ok, but just to understand, where do the costs come from? From my fairly naive perspective I see those stored calculations are anyway discarded 63x, even if they occur in a local block and are not explicitly saved. The crude tests I just ran also suggest that at least in my case, the computational costs must outweigh the io related costs to such an extent that no difference is noticeable when shifting a large transformed pars block (saving quite a few pieces) into the model block.
moving things between transformed parameters / model block is not too helpful… but do not calculate things in the transformed parameters in addition just for the purpose of having that output later in the final output.
This is exactly what reduce_sum is doing for you internally. The sum is added to the target and all the gradients get added behind the scenes.
The difficulty is in generality. It’s not always so easy to split the data into groups with minimal numbers of parameters when there’s hierarchical structure. Or spatio-temporal structure. Or you have something like a GP structure.
I’m not sure where it’s at, but @bbbales2 and @wds15 are writing doc with examples.
We use references so we can call back into memory, which doesn’t require R/Javascript-style environment-reflection shenanigans.