We were just discussing this in another thread starting with this comment from @Red-Portal.
Memory pressure becomes a much bigger deal when parallelizing, as typically only a relatively tiny L1 cache is on-core and the L2, L3 cache and RAM are shared. There’s also the problem of memory locality with a bunch of ad-hoc allocations. Allocating everything together in contiguous memory could be a big saving. This is something I really messed up in coding Stan arrays as C++ std::vector
as we have this problem all over the place.