Hey all,
I am very confused by the performance of parallelism in Stan. Here I tried different make files with different models and different core number assign to the job.
c1
: means making the model using original make file
c2
: means adding p_thread
and DSTAN_THREAD
to make and make the model
*core
means number of cores I assign to this job
ori
means the original model without map_rec()
function
multi
means the modificated model with map_rec()
function and the number after is the number of shards.
It looks to me that without using multithreading and simply assign more cores to job will improve the speed a lot?
Does the number of shards has to be the same as number of the cores ?
It looks like for the same number of cores, more shards slower the model down? Does the amount of data in each shard matter? It might have a lower thershold or it will slow down?
And does it have anything to do with STAN_NUM_THREADS
?
Should STAN_NUM_THREADS
match the number of cores assigned?
(I set STAN_NUM_THREADS=-1
for all of them)
Any ideas or advice would be helpful!! Thank you all!
JobName Elapsed State NCPUS
------------------------------ ---------- ---------- ----------
c1_1core_multi20 01:16:34 COMPLETED 1
batch 01:16:34 COMPLETED 1
c1_1core_ori 02:55:08 COMPLETED 1
batch 02:55:08 COMPLETED 1
c1_10cores_ori 00:47:39 COMPLETED 11
batch 00:47:39 COMPLETED 11
c2_1core_ori 04:31:20 RUNNING 1
c2_10cores_ori 01:10:38 COMPLETED 11
batch 01:10:38 COMPLETED 11
c2_10cores_multi20 01:01:26 COMPLETED 11
batch 01:01:26 COMPLETED 11
c2_20cores_multi20 00:33:33 COMPLETED 21
batch 00:33:33 COMPLETED 9
c2_20cores_multi50 00:41:44 COMPLETED 21
batch 00:41:44 COMPLETED 21
c2_20cores_multi100 00:53:48 COMPLETED 21
batch 00:53:48 COMPLETED 21