If one has access to a supercomputer, would it be possible to use 24 chains in Stan (24 cores/node) and then just combine those chains in the summary? Would this effect the quality sampling at all? Would there be a functional argument against doing this?
When you have lots of CPUs, call that number N, you have several options:
Run J warmup iterations on each followed by K iterations on each, yielding N*K post-warmup iterations. Now, if your model samples efficiently, you generally only need a thousand or so post-warmup iterations for talking about even relatively remote tail probabilities (ex. 95% credible intervals), so if N is big K doesn’t need to be very big. However, J should still be reasonably large so that the geometry of the parameter space is well characterized by each chain by the end of their independent warmup periods.
Use the map_rect features in existing versions of Stan to run fewer chains but multiple cores per chain. This is probably better than 1 but takes more work in setting up the model to use map_rect.
Use the reduce_sum feature in the release candidate of Stan 2.23 to do the same as 2 but with an easier interface, though arguably more limited, as it applies only to scenarios where you have a bunch of sums to compute (typically in the likelihood part of the model) that can be done in chunks in parallel.
Use the experimental “campfire” Stan branch to do 1 but where the chains communicate information about the geometry to one another and automatically terminate warmup when they’ve collected enough information, which usually much earlier than with the standard independent warmup approach.