Stan significantly slower after incorporating multithreading?

I was under the impression that grainsize = 1 indicated that the grainsize is chosen automatically (see here). I added some print statements in the partial sum function and found the groups were sized around 3 or 4. But your point about operations per group is well taken. In fact, I found this discussion which covers exactly what you mentioned.