Performance of Parallel Computation with map_rec()

Hey all,
I am very confused by the performance of parallelism in Stan. Here I tried different make files with different models and different core number assign to the job.
c1: means making the model using original make file
c2: means adding p_thread and DSTAN_THREAD to make and make the model
*core means number of cores I assign to this job
ori means the original model without map_rec() function
multi means the modificated model with map_rec() function and the number after is the number of shards.

It looks to me that without using multithreading and simply assign more cores to job will improve the speed a lot?
Does the number of shards has to be the same as number of the cores ?
It looks like for the same number of cores, more shards slower the model down? Does the amount of data in each shard matter? It might have a lower thershold or it will slow down?
And does it have anything to do with STAN_NUM_THREADS?
Should STAN_NUM_THREADS match the number of cores assigned?

(I set STAN_NUM_THREADS=-1 for all of them)

Any ideas or advice would be helpful!! Thank you all!

                       JobName    Elapsed      State      NCPUS 
------------------------------ ---------- ---------- ---------- 
              c1_1core_multi20   01:16:34  COMPLETED          1 
                         batch   01:16:34  COMPLETED          1 
                  c1_1core_ori   02:55:08  COMPLETED          1 
                         batch   02:55:08  COMPLETED          1 
                c1_10cores_ori   00:47:39  COMPLETED         11 
                         batch   00:47:39  COMPLETED         11 
                  c2_1core_ori   04:31:20    RUNNING          1 
                c2_10cores_ori   01:10:38  COMPLETED         11 
                         batch   01:10:38  COMPLETED         11 
            c2_10cores_multi20   01:01:26  COMPLETED         11 
                         batch   01:01:26  COMPLETED         11 
            c2_20cores_multi20   00:33:33  COMPLETED         21 
                         batch   00:33:33  COMPLETED          9 
            c2_20cores_multi50   00:41:44  COMPLETED         21 
                         batch   00:41:44  COMPLETED         21 
           c2_20cores_multi100   00:53:48  COMPLETED         21 
                         batch   00:53:48  COMPLETED         21 

1 Like

Looking at your figures you seem to benefit a lot from map_rect - that’s great.

If you set STAN_NUM_THREADS=-1 then the program will detect how many cpus are available on that machine and then use this many cores. I am not sure what you mean by assigning CPUs? Does that mean number of CPUs assigned in your cluster? If so, then you should definitely set STAN_NUM_THREADS to the number of cores you actually assigned.

Wrt. to the shard question: Internally stan does a very simple scheduling. So it splits the number of shards in equal blocks of work. The chunking size is chosen so that you end up having exactly as many blocks as CPUs assigned. As long as each shard represents a sufficient amount of work, then you should not need to worry. Many people get hung up with this, but it’s not super-important. If the sharding question really matters, then your problem would likely not benefit from map_rect anyways, I think.

We will use the Intel TBB for scheduling in the near future. At that point we have dynamic work scheduling which should eliminate the need to put any thought into shard sizes.

Thank you for the reply! Yes, by assigning cpus I mean in cluster.

I think my data in each shard is too small(only 10 for each when there are 100 shards) and causes slowing down. I will try different sizes of the data. Also, do you have any idea that why assigning more cpus to the model without map_rec() function and even without making with DSTAN_THREAD would make the model speed up? Any guess?

Not really. Maybe this reserves more cores on the target machine and then you benefit more from things like TurboBoost from Intel? Just a wild guess.