Variation in elapsed time of parallel chains and best practices for computing expectations from multiple chains

Thanks for the feedback, @betanalpha. I looked into your case studies for gaussian processes; my implementation is more or less similar, some small differences is that it uses a negative binomial likelihood, the process means are not zero, and I have one latent function for each of the two replicates.
The main difference is that the covariance matrix describes multiple channels, so instead of a single \sigma^2 signal variance parameter in the kernel k(x,x') = \sigma^2\ exp\left(-\frac{|x-x'|^2}{2\ell^2}\right), each channel has that and a signal variance parameter for each other channel, such that k_{ij}(x,x') = \sigma^2_{ij}\ exp\left(-\frac{|x-x'|^2}{\ell_i^2 + \ell_j^2}\right).
The main parameters of interest are these off-diagonal signal variances as a measure of interaction of channels.

Here’s the Stan code:

data {
    int<lower=1> N;
    int<lower=1> M;
    int trilM;
    vector[N] x;

    int y1a[M*N];
    int y2a[M*N];
    int y1b[M*N];
    int y2b[M*N];

    vector[M*N] normFact1a;
    vector[M*N] normFact2a;
    vector[M*N] normFact1b;
    vector[M*N] normFact2b;

    real mu_est[M];
    real sigma_est[M];
}
transformed data {
    real delta = 1e-8;

    real sigma[M] = sigma_est;
    real sigma2 = max(sigma_est);
}
parameters {
    real mu[M];

    real<lower=0> kfDiag[M];
    real kfTril[trilM];

    real<lower=0> ell[M];

    vector[M*N] f1_til;
    vector[M*N] f2_til;

    real<lower=0> alpha[M];
}
model {
    vector[M*N] muVector = mean_vector(x, mu, M, N);
    vector[M*N] alphaVector = dispersion_vector(alpha, M, N);

    matrix[M*N, M*N] K = multi_k_matrix(x, ell, kfDiag, kfTril, M, N, delta);

    matrix[M*N, M*N] L = choleskydecompose(K);

    vector[M*N] f1 = exp(muVector + L * f1_til);
    vector[M*N] f2 = exp(muVector + L * f2_til);

    mu ~ normal(mu_est, sigma);
    ell ~ gamma(2, 2);
    kfDiag ~ normal(0, sigma);
    kfTril ~ normal(0, sigma2);

    f1_til ~ normal(0, 1);
    f2_til ~ normal(0, 1);

    alpha ~ uniform(0, 1e9);

    y1a ~ neg_binomial_2((f1 .* normFact1a), alphaVector);
    y1b ~ neg_binomial_2((f1 .* normFact1b), alphaVector);
    y2a ~ neg_binomial_2((f2 .* normFact2a), alphaVector);
    y2b ~ neg_binomial_2((f2 .* normFact2b), alphaVector);
}

mean_vector and dispersion_vector just concatenate the appropriate means and negative binomial dispersion parameters to match the data, and multi_k_matrix sets up the covariance matrix like I described above and in my previous reply (can’t use cov_exp_quad function because of the additional structure).

Like I said I didn’t see any warnings about divergences, but from the divergences case study I see that you sample a standard normal and translate/scale it. This is not a hierarchical model, so it’d be just shifting it with constant parameters (mu_est, sigma_est, max(sigma_est)).
Is this supposed to always improve inference in the hierarchical case and/or in this case?

Although with an increasing number of channels more data is added, I’d expect that the number of parameters may be to large to be able to get good estimates, but that’s part of the objective here: seeing how much this model can handle.
Thanks for the attention, if you have specific suggestions I’d really appreciate it.