Variation in elapsed time of parallel chains and best practices for computing expectations from multiple chains

Yes, I realize, but if you have within-chain but not between-chain convergence that’s exactly the figure you’d expect from the permuted samples, right? (Assuming this “permutation” just concatenates the thinned chains, and that’s why there are 8 "steps here.)
Shouldn’t then the algorithmic parameters be shared between chains, and convergence be left to the likelihood and stochastically drawn momenta? I didn’t see any tree depth or divergences warnings, only positive semi-definite rejection warnings, which are kind of expected.

The model is a multi-channel gaussian process of the form (for M=2 channels):

y \sim \mathcal{N}(\mathbf{\mu}, K) , K = \begin{bmatrix} \sigma_{11} C_{11} \sigma_{12} C_{12} \\ \sigma_{21} C_{21} \sigma_{22} C_{22} \end{bmatrix}

where \sigma_{ij} are the signal variance parameters for the different channels C_{ij} are the correlations between the data points given whatever kernel, and \mu is just a vector of M concatenated means for each channel. Likelihood is negative binomial, so I’m sampling two latent functions (two experimental replicates) that otherwise share all other parameters.
Each channel has ~50 data points, so the total number of data points is \sim 50 M, and I get to choose that 2 \leq M < ~10000 (but I won’t go there because there are \mathcal{O}(M^2) parameters ).

There’s not much more in the stan code, I think, but if you think it will really be helpful I can post it. The model is large but simple, and I don’t think there’s much to be done with the priors, I started with uniform priors for the off-block diagonal parameters (we have no idea what these could be) but then switched to zero-mean normal priors with variance vaguely-related to the data (that can’t be known either).

As far as I understand this could be fixed by having a longer burn-in and changing the adapt_delta default. That’s my main question here.
If that is possible everything should converge and all the ad hoc stuff goes away, otherwise there is a finite probability that the initial adaptation of the sampler ends up with different values for a number of chains (the more the worse), and runs are bound to have “outlier” chains that just screw up the whole analysis, and would lead to a ridiculous situation where I have to run the same analysis over and over until all chains converge.

Thanks again.