Since I am using a lot the log_sum_exp function I thought it would not hurt to give it analytic gradients using the operands_and_partials facility. This cuts down quite a bit of code in comparison to a bunch of vari definitions which are used currently. I think I did the obvious definition of the function with operands_and_partials, see below. However, the code with the analytic gradients is ~15% slower than the original one. All tests defined for log_sum_exp pass with the definition below.
This is counter-intuitive to me and if someone has an idea why things get slower even if analytic gradients are given, that would be helpful. With some luck this means that our operands_and_partials could be tweaked which would be a noticeable speedup. The only thing I see is that there is additional copying going when using the operands_and_partials…maybe c++11 moving could remove this? Not sure.
hmm… compilers do not make a difference here. clang++ and g++ 6 show the same behavior. However, I think I was able to shorten the performance penalty by a few percent (maybe ~3%, so down to 12% slowdown) by simply declaring the build function of operands_and_partials as const since it does not change at all the state of operands_and_partials.
I think this is also what Matthijs is recently finding with his GLM refactors. Maybe that means we have to take a closer look at operands and partials.
Yes, this is why I am reporting it here. Any optimization on operands_and_partials should have a direct impact on overall Stan performance.
The only issue I can see so far is the additional copying which is going on which we probably can avoid. So what the class does is
upon construction allocate storage for the partials in doubles/vectors/Eigen structures; this depends on types of the edges passed in
the client code fills in the partials
the build method allocates memory on the ad stack and then copies things over
Hence, the memory for the partials is allocated twice. The way operands_and_partials is written would allow to reuse any instance many times with the same operands, but I am not sure we use this behavior. All functions I have seen end with a return ops_partials.build() statement. So maybe the double allocation of the memory for the partials could be eliminated. If the compiler is smart enough that will be done already… is there an easy way to figure this out?
As requested by @seantalts yesterday I have pushed the branch with the analytic gradients for log_sum_exp to github. In that branch I have already started to guess a bit at what could improve the performance. The perfomance tests themselves are not part of that branch. What I am running as test is the test model which is part of the cmdstan mpi branch. That model is a PK model which does a ton of log_sum_exp calls.
If we figure out why there is such a slow down and how to avoid, then this would be a very likely an appreciable speedup for Stan itself - that’s my guess here.
That’s exactly what I’m thinking. If we can preallocate into the autodiff expression graph, we can avoid this copy.
There’s also the work of assuming it’s really six elements long all the time—I haven’t looked at it closely enough to see how much of that can be compiled away in space and time.
The other issue to think about is memory locality.
I am glad we think the same here. This should be easily tested with some prototype code.
Now… this discussion got me thinking and maybe the gradient precomputation thing can be masivley parallelized in the following way: The issue with parallelism and AD is the requirement to get the AD tree build up with the pointers in order. So operands need to be linked with their results and the gradients - and this needs to happen in sync. However, we have many functions which use the operands and partials. If now a new operands and partials just pre-allocates the storage needed for everything on the stack, then we have the door open to parallelizing it, I think. So what I think of could be done is:
A function with analytic gradients pre-allocates all needed memory on the AD stack as a first step.
Then the function registers the operands.
Next the function value is calculated (but not more).
Instead of now calculating the gradients, the function returns instantly and submits to a threadpool the task to fill in the gradients.
This way the program can keep building the AD tree while the analytic gradients are being calculate asynchronously! C++11 brings all we need for this with it - as I understood the future concept would probably be the right thing for this.
Should the above idea work, then we would probably see quite some speed gains due to this alone. Have I missed anything? The key idea is to maintain the only operation which needs correct order in sync (memory allocation, operands linking), but let everything else execute on demand in parallel.
Thoughts? (This only works for gradients though, not higher order)
Yes, that’ll work. I thought this is basically what you’re doing with MPI.
Three things. First, some of those computations have a lot of serial dependencies for intermediate quantities. Second, we need more cores to use all the speed gains. Third, all that coordination.
MPI is slightly different. With MPI we put everything into distinct processes. Each process has its own AD tree and as such there are no constraints within each process.
What I am suggesting here is essentially an alternative to openMP.
The serial dependencies could be a problem, yes; we would need to check if this is a showstopper or not. However, with closures and stuff like that it should be at least easy to program (I mean using the intermediates).
I am not worried about the cores. For high performance computing you can assume clusters with many cores. Moreover, MPI allows parallelization between machines while what I have described above works only on a given machine.
I am a bit overloaded with all those Stan projects such that I won’t be able to do a prototype any time soon, but we should keep this in the back of our minds.
This sounded like how you have to put the autodiff tree back together with the result like you do for MPI.
If you pull up 100 cores and then ask each one to parallelize low-level calcs over 20 cores, you’re going to be requesting a whole lot of cores. I might be on a 16-core desktop or a shared cluster at Columbia rather than at Google or on AWS.
I can promise you: For any problem which involves to solve an ode for multiple subjects you will have a huge smile in your face once you went through the trouble and coded things up using MPI map rect and use for the start 4 cores with it. Stan then just prints those iteration messages so much faster!
I am really curious how this approaches fares for variable selection or the like.