Similar thing hit me in a different context: if we can control the behavior of some math function according to where the sampler is, there are some performance gain can be made. Like in ODE models sometimes even if I know the problem is non-stiff I have to run it in stiff solver because the sampler stepped into regions the solution becomes stiff.
Within-chain parallelization can be done in various flavors, and currently we are not designing it in a organized way(which is the point of the thread, I guess). There are things easier/better to be done in distributed level, and things better in thread level. Even in a same level we can have multiple tiers. Unless we can reach a design decision there are going to be many iterations in future designs.