Optimal num_stan_threads when using multiple chains

Thought i was having the same issue. Running Stan in Docker (DigitalOcean) with 6 vCPU (Intel Xeon Gold 6140 @ 2300Hz, 25MB cache). Was hopping to have 16 vCPUs, but even 6 doesnt get more than 66% CPU utilization.

I made sure i had the correct Makevars file:
CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -flto -ffat-lto-objects -Wno-unused-local-typedefs -Wno-ignored-attributes -Wno-deprecated-declarations in $HOME/.R/Makevars
and
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
Sys.setenv(LOCAL_CPPFLAGS = ‘-march=native’)

I get 4 chains/4 cpu at 100%, but 2 cpu’s totally unused.

This seems to be a limitation of Stan as noticed by @Mike_Terrell :

It would save a lot of time if we could split 1 chain over 2 cpu’s (or even 3) to hopefully split the sampling time. But as @Bob_Carpenter points out: