I had something similar happen when fitting a Gaussian Process model with a few thousand parameters, but it did have an impact on convergence of the different chains (I believe it was the main factor, but there could have been others).

To that my answer would be “possibly”. I am not familiar with the outputs of rstan, but I’m assuming that is not a trace of metrics, only a single matrix (the one fixed for the sampling period), and the correlation is among the matrix entries, not the same matrix entry across several warmup iterations, which may give a better results due to a larger sample size. However, it is not obvious to me that there should be some expected and high correlation between chain metrics, and that it will hold when scaling up to large models. I would have to think about what metric correlation between chains means, how to best calculate it, and how it’s expected to scale.

As far as I understand the process of computing the space metric, or mass matrix, is a very empirical one: just take the expectation of the Kronecker product of the parameter vector for (some part of) the warmup iterations, so it is sensitive to what’s happening there and it is possible that it “doesn’t converge” in the sense that difference chains may be at different regions at that point and end up with different mass matrices. However, it is also possible that, though different, they are still “good enough” to explore the space: they would just explore the parameter space more or less efficiently (maybe your are seeing the effect on run times, but it’s also hard to say what’s a good or bad metric to say how they should impact performance). Step sizes also differ between chain, but may still work, you could ask the same question about how similar they should be across chains.

For my problem, I considered using other criteria (for instance, average posterior probability of the chain, run time, etc) to decide what the good mass matrix was and use that as an argument for a fixed metric throughout all chains. I guess that could have worked, but in practice I was able to break down the problem into several small problems and run the separately. My guess is in your case you are trying to fix what’s not broken, but maybe others will have other ideas or opinions.