If it is of any help, we also had a complex hierarchical model that began in a regime where it took months to finish and were able to get it down to hours or days by replacing for loops (to the extent possible) with matrix operations, and through within-chain parallelization with reduce_sum()
. Our mega-thread cataloguing that entire process is here.
1 Like