I see the following suggestion in the vignette about within-chain parallelization using brms:
For a given Stan model one should usually choose the number of chains and the number of threads per chain to be equal to the number of (physical) cores one wishes to use.
Suppose I have 24 CPUs on a computer. I would choose 6 threads per chain for a model using 4 chains to take full advantage of potential speed gain. I expected to see that all the 24 CPUs would be active most of the time. However, it does not seem to be the case: when I checked the CPU usage, I rarely saw more than 4 CPUs were used.
If I want to simultaneously run 5 separate models, can I still choose 6 threads per chain for each of the 5 models? Is this considered hyper-threading?
How large is your data set? The minimal grainsize enforced is 100 which limits parallelism in some cases with few data rows.
You should not fire off more Stan threads at the same time than you have physical CPU cores… but there can always be exceptions to this “rule of thumb”.
There are 32005 rows and 4 columns in the dataset in the long format of the data.frame.
Could you elaborate how is grainsize defined?
In my test of a model with 4 chains and 8 threads per chain, I didn’t see more than 4 CPUs involved during the few times I checked the CPU usage. The runtime was 4 times shorter compared to the original job with no within-chain parallelization, and this is why I thought I might be able to run some simultaneous models through hyper-threading.
The default grainsize is max(100, data rows / (2 * # of threads requested)… I do not think you need to bother with this.
Getting a 4x speedup is very good with 8 threads per chains.
Just try out things. It‘s super complicated to tell people what to do here… do you want greatest throughput or shortest walltime for model runs will determine what you are going to do.