In some cases one or more chains is significantly slower than others, getting stuck during warmup but eventually adapting appropriately. This results in the wait time being dominated by those chains. With within-chain parallelization, it would be nice for threads_per_chain to adjust dynamically as the faster chains finish, to provide idle CPUs to speed up the slower chains later in their life cycle.
You could try doing more threads than you have cores so that when some chains end the others just take more threads. The threading library is supposed to keep this scheduling sane, though I don’t know if there is a big overhead or something.