Imagine I have 6 chains running in HMC algorithm (with Stan). After warm-up, which took 2000 iterations, three chains have converged (I know the ‘true’ parameter values), but the latter three have failed to do so. Sampling in both cases took 500 iterations.

Then I’ve used the method suggested on the site and run warm-up for the failed three chains again - say, 2000 more iterations. All of them finally converged. My question is rather general: Is it methodologically fine to use different warm-up (burn-in) sizes for different chains? Or should I also add some warm-up and resample the successful chains as well (which, I belive will not add to much to the results)?

Technically there’s nothing wrong with that. But I’d first investigate why the 3 chains fail to converge. It may also help to use the init step size and/or mass matrix from the successful chains.

Oh, thank you for reply. What are the ways to investigate those fails? I thought that it just can happen because of unfortunate starting points (usually I take them at random).

My example is complex enough (mixed logit aggregated demand model).
Stan optimization procedure (i.e. newton) finds the global optimum reliably, but HMC usually fails (variational algorithm usually also fails).
HMC only converges if I take initial values not far from optimization parameters.
I’ve tested it even with 10_000 warmups.

Typically we want to make sure that there are no divergent transitions reported. That’s the usual problem with non-convergence due to sub-optimal parameterization. The usual fix is to reparameterize. The most common cause is centered parameterizations of hierarchical models.

The second quote here says that HMC succeeds. Was the first just referring to our defaults failing to run for enough iterations?

If you have something like a high-dimensional regression, you can step down the initialization from uniform(-2, 2) to something like uniform(-1, 1) or even tighter. That tends to put you less far into the tail.

You also want to parameterize the model as much as you can so the posterior is centered and roughly unit scale (in the unconstrained parameters). Then the default initialization will be roughly correct.

I’m not sure we talk about it directly in a single section. The basic idea is that we would like our posterior distribution to be standard normal (i.e., unit scale, centered at zero, and every parameter is independent). We talk about ways to do that in the section of the user’s guide on reparameterization:

It also comes up in the problematic posteriors chapter:

It’s pretty easy for simple parameters with unit priors. For example, If I have this

parameters {
real alpha;
}
model {
alpha ~ normal(100, 500);
}

Then the model isn’t centered at the origin and has a scale of 500 rather than 1. To adjust the model so that it samples over the right space, we can use an affine transform:

parameters {
real alpha_raw;
}
transformed parameters {
real alpha = 100 + 500 * alpha_raw;
}
model {
alpha_raw ~ normal(0, 1);
}

which puts alpha_raw’s prior on the unit scale. it turns out you can do that with affine transforms this way:

Oh, I see, that’s very clear. Reading about non-centered reparametrization I’ve missed the point that both location and shift can hamper sampling process.

Just a final remark to close the topic. If I have a vector of parameters which I have to preserve because of data structure. Is there a proper way to reparameterize the vector?