I am doing parallelism in Stan with map_rect() function, and the speed is not really improved by what I expected. (N times faster with N shard). I am wondering whether it is because HMC slows the model down. Anyone knows if there is a way to measure the time usage of only HMC or only map_rect() function?
I don’t think that’s currently possible without writing your own C++ code (and would IMHO require some clever hacks to do even in C++).
That is unfortunately not what map_rect can do - the benefits of paralellizing over N cores would always be less than N times speedup, sometimes noticeably so (even if HMC took no time at all). Without knowing specifics of your model, it is impossible to say how much of a speedup you should expect.
Thanks for your reply. I just notice that stan 2.21.0 released new map_rec() function which they indicate it should be much faster. I will try that new version. Also, do you know any situation that parallelizing would slow the model down even slower than the original model?
In brief, parallelization comes with some overhead that is offset if the work done in the parallel regions is big enough. If instead it’s trivial, the cost of communication and spawning of threads can exceed the time savings from running in parallel.