Did i implement reduce sum correctly?

First, I’d check that you’re getting similar answers across reduce-sum runs as you do across runs without it. It’s generally considered easier to optimize a correct program than to debug an optimized one, or as Knuth is rumored to have said, “premature optimization is the root of all evil.”

This is often true if you use our random initializations, which are uniform between -2 and 2 on the unconstrained parameter scale. It’s not uncommon to see as much as a factor of 2 or more difference in speeds for different seeds. If you reduce the range from (-2, 2) to (-0.5, 0.5), it can often stabilize things at the risk of missing outlier modes, failing to diagnose bad mixing, etc. This is usually OK to do—@andrewgelman, for example, is urging us to use less diffuse initializations (which seems to be a reversal of decades of advice using people to use more diffuse initializations to debug poor mixing).

Reduce-sum is only likely to provide a big speed boost when the amount of work done on each thread (or process if using MPI) dominates the communication cost. For example, it’s almost never worth doing this for a simple GLM, but it’s almost always worth doing with nested ordinary differential equation models or if you have to do a bunch of matrix solves. All of the code inside your computations seems to be relatively simple arithmetic.