That’s going to be true for most mathematical operations in decently optimized software. Getting data from memory to the CPU registers is really really expensive.
I don’t know how much the interfaces parallelize compilation, but you can do a lot in parallel from CmdStan on teh build side.
For parallelizing chains, that’s useful up to four or so—enough to diagnose multimodalities and non-convergence, but beyond that, you’re mainly getting proportionately higher effective sample sizes for the same runs. You still have to pay the price of warmup.
When we get MPI done, the way to speed things up will be parallelizing within chains. That’ll be able to use a lot of cores. So 16, 32, or a cluster full of cores would be good in that situation if you really need to scale. Personally, I think I’m going to get one of the new iMacs because I won’t have to learn a new OS and I want a nice 5K screen.
If it’s even a question, good solid-state disks are a necessity, since compiling Stan brings in an awful lot of little files.