What would happen if no. of shards or partial sums is greater than threads per chain?

I am trying to use “map_rect()” and “reduce_sum()” for my Stan code.
But, I wonder if the number of shards in “map_rect()” are very high then how would it parallelize operations if no. of threads is fixed in the $sample call?

Why is that important?

Just trying to understand, how choosing the no. of shards and no. of threads per chain could affect runtime.

Just write your program down with reduce sum preferably and then see where you land . Focus on getting it coded right before thinking about the last bits of performance.

Other than that you can lookup the Intel tbb doc about chunking. The tbb drives all of this.

@msk98, just to make sure, is your Stan model currently running and chains converging properly, irrespective of the shard/thread question?