Before I dive into this, I wanted to ask whether it is in principle possible to implement within-chain parallelization for a model with multivariate outcomes. As an example, consider the SUR model in chapter 1.15 of the Stan user guide (https://mc-stan.org/docs/stan-users-guide/multivariate-outcomes.html. For now, let’s just assume that reduce-sum should be implemented as in chapter 26.1 in the user guide (https://mc-stan.org/docs/stan-users-guide/reduce-sum.html), i.e., slicing over y. The complication here is that that y is neither an array nor a vector, it is an array of vectors. Is this possible? And how would one go about programming this?
OpenCL, unfortunately, doesn’t help much because multi_normal_cholesy does not (yet) seem to be optimized for this.
You can see that the sliced variable is declared as T[], so it can be any type. That includes an array of vectors. So it’s programmed exactly the same way as the univariate case.
It’s important when doing this that the inner calls within reduce_sum themselves use vectorization and don’t just loop over items. Each call to multivariate-normal has to do a solve over the covariance matrix. That’s why using a Cholesky-factor parameterization is so much more efficient (quadratic rather than cubic).
You do not have to slice your data y via the first argument. You can just put a dummy argument into the first argument which is being sliced over. That dummy argument just has to be as long as you have elements in your actual data, which is part of the shared arguments in this case. You can take a look at the generated code from brms to see an example for this type of approach. This approach does not incur a noticeable performance cost at all.
Once Stan implemented the tuple feature, it would be nice to have a
function which takes variadic tuples and parameters as input.
The parameters are shared, and the for each element of the tuple(s) the statements
within the function are executed in parallel using a limited threadpool.
The result is given back as tupel. Something what Matlab does with
its parfor statement, but more in a functional programing sense like
the awesome Reduce-sum function of Stan.
the current reduce sum works with a limited thread pool.
So is what you are asking for a parallel foreach thing essentially? The key difference to reduce sum would be that one would not run the summation reduction on the outputs if I get you right. Instead one would return an array with as many elements as given to the foreach.
Maybe an issue on GitHub would be good to get this started… which should include pseudo code for how this may work in Stan. I’d place that in the Stan-math repo, but I am not 100% sure where to stick it. Ideally a design doc, but I would not ask for that now.
I am also not sure if I would have time to work on it…reduce_sum is quite powerful for inference such that the parfor seems like convenience to some extent at least…which is not bad at all.
It’s about a parallel foreach thing, correct.
Let us summarise first what we have so far:
We have map_rect, reduce_sum and the option to use a external BLAS library for
parallelism. I get high CPU utilization with reduce_sum combined with OPENBLAS.
A parallel foreach would be similar to map_rect.
Why would a parallel foreach beneficial? Apart from a nicer syntax than map-rect.
The motivation is cache locality. The parameters in each iteration in parallel foreach
can be optimized to the L2-cache size of the executing sub-task assigned CPU.
I spot that parallelization utilizes all my CPUs highly, whereas the execution time
was below what I would expect. One explanation for this may be that the caches
are used poorly.
I think the current implementation should be a similar to a _lp function, where
we can add to the derivative graph and return the results. Further the solution
should support ragged arrays and tuples are an excellent solution for.
Thus due to the time constraints of your work schedule and we need to
have tuples in Stan we should postpone the idea until then.
Many thanks for the awesome reduce_sum functionality, it’s a great and
lean solution for parallelism.