My guess is that with N=10^4 you will still be faster with the second approach without map_rect, but with bigger N the map_rect solution will be at some point faster.
The parallel reduce which I am working on would speed this type of stuff up in full automation (as you will then get automatic sharding and vectorization)… but it will take a while until that will be ready (after Stan 3 at least).