Variation in elapsed time of parallel chains and best practices for computing expectations from multiple chains

I am running 8 parallel chains for a large gaussian process model, which will probably be scaled up further, and from some preliminary results I anticipate issues with the mixing of the chains. I have two main questions.

  • The first issue I see is that different parallel chains may take as little as 1 or more than 10 hours; I know some variation is expected but I didn’t expect differences of that order, so I’m assuming there could be a computational justification. The machines running this have plenty of CPUs and more than enough memory for the job. Is this expected in general, or is something else going on here?

  • The second issue is with convergence and mixing itself. The likelihood of each chain seems to be stationary, but not all are mixing properly. I am already juggling the amount of data/model size (ideally I want to include as much as possible), the length of the chains, and the total runtime (it could easily run for weeks). I am using PyStan and I see that one of the first suggestions seems to be to add control={'adapt_delta':0.9} or 0.99 and increase burn-in length in the sampling function call, but I’m guessing this may affect performance. I read the Stan and PyStan references and it was not clear what are the trade-offs here. Is there a limit to how much this can help? (Like I said, I’ll probably be operating in the limit of time and size of data set).

  • Finally, suppose the likelihood of half of the chains converge and mix, but the rest seem stationary around lower values. Of course if they all mix the expectations can be computed from any of the samples, but what would be the recommendation in this other case?

I could think of using only the ones converging around the higher likelihood (assuming there are enough of them to assess between and within convergence), but I’m not sure that’s correct.

bonus question: how does the extract function permuted option work exactly? It seemed like it randomly permuted samples, but from a shorter run I got this figure that makes it look like it’s just stacking the traces.image

thanks.

Can you show code? 1D, 2D, or mode dimensions? Amount of data 1e3, 1e4, 1e5, 1e6 or more?

My guess is adaptation ending up with different step sizes which leads to different treedepths. I don’t know hot to check that in Python, but I know some PyStan users use also ShinyStan for diagnostics. You might check @betanalpha’s stan_utility functions, they have something for treedepth at least. Do you get warnings about divergences or max treedepth exceedences?

GP is likely to have funnel shaped posterior, which makes the sampling time sensitive to step size. You might improved performance with better priors, but can’t say much without the code. Check @betanalpha’s GP cases studies 1, 2, and 3 at Writing - betanalpha.github.io

Try to fix the model and priors.

No, it’s not correct.

Figure looks really bad.

Yes, I realize, but if you have within-chain but not between-chain convergence that’s exactly the figure you’d expect from the permuted samples, right? (Assuming this “permutation” just concatenates the thinned chains, and that’s why there are 8 "steps here.)
Shouldn’t then the algorithmic parameters be shared between chains, and convergence be left to the likelihood and stochastically drawn momenta? I didn’t see any tree depth or divergences warnings, only positive semi-definite rejection warnings, which are kind of expected.

The model is a multi-channel gaussian process of the form (for M=2 channels):

y \sim \mathcal{N}(\mathbf{\mu}, K) , K = \begin{bmatrix} \sigma_{11} C_{11} \sigma_{12} C_{12} \\ \sigma_{21} C_{21} \sigma_{22} C_{22} \end{bmatrix}

where \sigma_{ij} are the signal variance parameters for the different channels C_{ij} are the correlations between the data points given whatever kernel, and \mu is just a vector of M concatenated means for each channel. Likelihood is negative binomial, so I’m sampling two latent functions (two experimental replicates) that otherwise share all other parameters.
Each channel has ~50 data points, so the total number of data points is \sim 50 M, and I get to choose that 2 \leq M < ~10000 (but I won’t go there because there are \mathcal{O}(M^2) parameters ).

There’s not much more in the stan code, I think, but if you think it will really be helpful I can post it. The model is large but simple, and I don’t think there’s much to be done with the priors, I started with uniform priors for the off-block diagonal parameters (we have no idea what these could be) but then switched to zero-mean normal priors with variance vaguely-related to the data (that can’t be known either).

As far as I understand this could be fixed by having a longer burn-in and changing the adapt_delta default. That’s my main question here.
If that is possible everything should converge and all the ad hoc stuff goes away, otherwise there is a finite probability that the initial adaptation of the sampler ends up with different values for a number of chains (the more the worse), and runs are bound to have “outlier” chains that just screw up the whole analysis, and would lead to a ridiculous situation where I have to run the same analysis over and over until all chains converge.

Thanks again.

Longer warmup rarely solves fitting problems, and increasing adapt_delta works only for very mild pathologies. More serious problems like this indicate substantial problems with your model – the exact nature of those problems depends on the exact model and the structure of the observed data.

For more see the first two sections of [1701.02434] A Conceptual Introduction to Hamiltonian Monte Carlo, pystan_workflow, and Diagnosing Biased Inference with Divergences. And please do look through the Gaussian process case studies to see that robust prior specification for even seemingly simple Gaussian process models is subtle.

1 Like

Thanks for the feedback, @betanalpha. I looked into your case studies for gaussian processes; my implementation is more or less similar, some small differences is that it uses a negative binomial likelihood, the process means are not zero, and I have one latent function for each of the two replicates.
The main difference is that the covariance matrix describes multiple channels, so instead of a single \sigma^2 signal variance parameter in the kernel k(x,x') = \sigma^2\ exp\left(-\frac{|x-x'|^2}{2\ell^2}\right), each channel has that and a signal variance parameter for each other channel, such that k_{ij}(x,x') = \sigma^2_{ij}\ exp\left(-\frac{|x-x'|^2}{\ell_i^2 + \ell_j^2}\right).
The main parameters of interest are these off-diagonal signal variances as a measure of interaction of channels.

Here’s the Stan code:

data {
    int<lower=1> N;
    int<lower=1> M;
    int trilM;
    vector[N] x;

    int y1a[M*N];
    int y2a[M*N];
    int y1b[M*N];
    int y2b[M*N];

    vector[M*N] normFact1a;
    vector[M*N] normFact2a;
    vector[M*N] normFact1b;
    vector[M*N] normFact2b;

    real mu_est[M];
    real sigma_est[M];
}
transformed data {
    real delta = 1e-8;

    real sigma[M] = sigma_est;
    real sigma2 = max(sigma_est);
}
parameters {
    real mu[M];

    real<lower=0> kfDiag[M];
    real kfTril[trilM];

    real<lower=0> ell[M];

    vector[M*N] f1_til;
    vector[M*N] f2_til;

    real<lower=0> alpha[M];
}
model {
    vector[M*N] muVector = mean_vector(x, mu, M, N);
    vector[M*N] alphaVector = dispersion_vector(alpha, M, N);

    matrix[M*N, M*N] K = multi_k_matrix(x, ell, kfDiag, kfTril, M, N, delta);

    matrix[M*N, M*N] L = choleskydecompose(K);

    vector[M*N] f1 = exp(muVector + L * f1_til);
    vector[M*N] f2 = exp(muVector + L * f2_til);

    mu ~ normal(mu_est, sigma);
    ell ~ gamma(2, 2);
    kfDiag ~ normal(0, sigma);
    kfTril ~ normal(0, sigma2);

    f1_til ~ normal(0, 1);
    f2_til ~ normal(0, 1);

    alpha ~ uniform(0, 1e9);

    y1a ~ neg_binomial_2((f1 .* normFact1a), alphaVector);
    y1b ~ neg_binomial_2((f1 .* normFact1b), alphaVector);
    y2a ~ neg_binomial_2((f2 .* normFact2a), alphaVector);
    y2b ~ neg_binomial_2((f2 .* normFact2b), alphaVector);
}

mean_vector and dispersion_vector just concatenate the appropriate means and negative binomial dispersion parameters to match the data, and multi_k_matrix sets up the covariance matrix like I described above and in my previous reply (can’t use cov_exp_quad function because of the additional structure).

Like I said I didn’t see any warnings about divergences, but from the divergences case study I see that you sample a standard normal and translate/scale it. This is not a hierarchical model, so it’d be just shifting it with constant parameters (mu_est, sigma_est, max(sigma_est)).
Is this supposed to always improve inference in the hierarchical case and/or in this case?

Although with an increasing number of channels more data is added, I’d expect that the number of parameters may be to large to be able to get good estimates, but that’s part of the objective here: seeing how much this model can handle.
Thanks for the attention, if you have specific suggestions I’d really appreciate it.

Only the number of iterations, warmup iterations, initial step size, adapt_delta and other parameters controlling or initializing adaptation are shared. During the adaptation each chain will get different step size and mass matrix.

This will not solve the convergence problem. but will get some speedup by using

y1a ~ neg_binomial_2_log(f1 + logNormFact1a, alphaVector)
...

where logNormFact1a = log(normFact1a)

Do you have reference for that “variance parameter for each other channel,”?

alpha ~ uniform(0, 1e9);

Are you sure uniform is good for alpha? Maybe hierarchical prior would be better?

Negative-binomial can produce multi-modal posterior with certain values of alphas, so it’s also possible that this would explain the behavior.

Depending on how the distance between the observations is distributed

ell ~ gamma(2,2);

might also be too vague.

I recommend to look at pair plots of ell, kfDiag, kfTril, alpha and lp__ to learn about possible multimodality or funnels.

2 Likes

Terrible, terrible prior. Maybe something like this?

I would suggest the half normal, exponential or half t as a prior on the appropriate transformation of alpha -/ suggested in the final section of this blog https://andrewgelman.com/2018/04/03/justify-my-love/ (make sure the parameteizatjons match!!)

2 Likes

Thanks for the tips, @avehtari.

I described this in more detail (albeit informally) here and references therein, the main reference is probably Melkumyan and Ramos and couple of others, but the manuscript on this work is not published or preprinted so it may require a few others to describe it properly.

Well, Ideally I would not have potentially overdispersed data and just use a poisson, but I’m not messing with some “standards” since I’ll already have to be convincing about not using frequentists statistics and linear models. So the negative binomial is there to stay, and the alpha seems to be well estimated, but you are likely right that this may not be as innocuous as it is usually believed.
Not sure what I’d use for a hyperprior in a hierarchical implementation. Could an inverse gamma be recommended as a prior directly instead?

We have very little prior knowledge for this. Unity is the scale of observation, but change can happen at any scale between that and 13 (total length of observation); ideally I would actually prefer to have it uniform, because biologically we cannot know the scale of change of each channel, but I know it could get too messy so I decided to constrain it for the good of the chain. So unless it’s crucial I think it’s less vague than I’d like.

I think maybe the chain is too short for the burn-in tuning of parameters, but now I have a few ideas to work with here so maybe that’ll solve it. Thanks for the help. I truly appreciate it.

Hi, Max,

nice seeing you again around here. Thanks for the subtlety. Care to elaborate?
I know uniform is frequently not optimal, but I was hoping to get away with not messing with that. Be happy to try something different, though.

That looks pretty interesting. Will look into it. Thanks.

These “vague but proper” priors are often very bad. They tend to be sensitive to the end point and are often actually quite informative. So while it’s hard to say that they are never a good idea, I can’t come up with a situation where that prior would be a good choice. If you’re lucky it doesn’t hurt the inference, but the more common experience is that priors like that lead to bad inference.

Most of the literature on this focuses on priors on variances (Gelman has a paper from 2005 in Bayesian analysis that covers this case well)

1 Like

If text is fine

fit.get_adaptation_info()
1 Like

See Dan’s answer. See the recommendations page for some general pointers on building and choosing priors. To see whether your particular choice of prior is sensible, I recommend fixing \mu and simulating some data from a negative binomial with your uniform prior on \phi. Compare the simulations with, say, a half-normal on 1/\sqrt{\phi}. To begin with, just look at the induced distribution of \text{Var}(X) = \mu + \mu/\phi under both priors.

mu <- 3
phis.uniform <- runif(1e6, 0, 1e9)
vars.uniform <- mu + mu/phis.uniform
inverse.sqrt.phis.halfnormal <- abs(rnorm(1e6, 0, 1))
phis.halfnormal <- 1/inverse.sqrt.phis.halfnormal^2
vars.halfnormal <- mu + mu/phis.halfnormal

par(mfrow = c(1, 2))
hist(vars.uniform, probability = TRUE, xlab = "Var(X)", main = "Uniform")
hist(vars.halfnormal, probability = TRUE, xlab = "Var(X)", main = "Half-normal")

image

It sounds like you didn’t read all three of the case studies – the last one goes into extensive detail about why both small and large length scales are poorly identified for these kinds of kernels and why you need to use domain expertise to suppress length scales near zero and near infinity. A gamma(2, 2) prior admits significant mass near small length scales if everything isn’t calibrated correctly then the resulting posterior can be challenging to fit.

The multiple kernel marginal deviations and summed length scales can also induce poor identification that would need to be compensated with more informative priors, but the model is sufficiently complicated that it’s just speculation without spending a significant amount of time delving deeper into its details. I strongly recommend that you start simpler and introduce the desired structure sequentially to better isolate what features are causing the sampling problems.

1 Like

Thanks. I am aware of the widespread confusion of uniform and uninformative priors, but if they don’t cause trouble on small models it is sometimes easier to justify that when implementing a Bayesian version of what people normally do using MLE, even if the implementation is not the most justified statistically speaking. This doesn’t seem to be the case here at all, so I completely agree this can be improved.

Nonetheless the priors will still require justification, which I would normally be able to do on biological grounds, but here I can’t really. Therefore it’s important that I am able to justify the choices statistically and not be exposed to reviewers’ criticism in the form of “you chose the priors based on what parameters would work for your results”. I think that (long) blog post and references will go a long way to help me justify that, but it would be nice to have a less technical peer-reviewed publication directed to natural scientists et al. to try make this point clear more broadly (and maybe you guys are aware of something like that).

Other than that I think my question is mostly answered, although I’m not sure I can pick a single of the answers as the “solution”. Thanks a lot.

I didn’t suggest not to use negative binomial. I suggested to consider the sensible range of alphas. Negative binomial has log concave likelihood for most alpha values. When log concave likelihood is combined with Gaussian prior the posterior is unimodal. For small alpha values negative binomial is not log-concave and when combined with Gaussian prior the posterior is multimodal. There is no one alpha value as this depends also on mu, but any alpha below 1 is very small and likely to be problematic. If you use default initialization uniform from [-2,2], then you have 50% chance to get alpha<1, multimodality and stuck chain. Having uniform prior on alpha doesn’t help here. That’s why I recommend to look at the alpha values and consider alternative prior.

The default initialization is also likely to cause problematic values for all other parameters, too. Parameters with real<lower=0> are intialized between 0.01 and 100. That’s quite wide range form kfDiag, ell, and alpha. mu, f1_til and f2_til don’t have constraints but they are inside exp(muVector + L * f1_til) and exp(muVector + L * f2_til) are likely to have even wilder range due to sum and exponentiation.

This is an example of a model where it will cause problems. Anything with a GP isn’t a small model.

Right, you didn’t. I was the one who was wishing I didn’t have to deal with that, and otherwise was hoping I could start the chain at “large” values of alpha and hope it didn’t drift into that dangerous area, but you’re right, and after hearing several suggestions I will start by trying \alpha = 1/ \sqrt{\phi}, where \phi \sim \mathcal{N}(0,1) with <lower=0>.

Also, just to be clear

here f1 is no longer exponentiated, it seems to me. The _log version does that implicitly, right?

Agreed. Thanks.

Yes, Still the same problem holds for the initialization, and you can still think more reasonable inits by thinking in terms of mu for neg-binomial.

1 Like