Reduce_sum cores, chains, threads

I get no speed-up at all, following Reduce Sum: A Minimal Example. I tried my local machine and a cluster of cores, different values for chains, cores, threads, but I rarely get any speed-up at all, and never anywhere close to the 2.7 speed-up in the case study.

Attempting in regular CmdStan, to see if any speed-up is possible there, see Cmdstanr reduce sum case study, but: unused argument (threads = TRUE).