Is it fine to use different warm-up (burn-in) sizes for different chains (via manual continuing of chains)?

garej · August 5, 2022, 11:14am

Imagine I have 6 chains running in HMC algorithm (with Stan). After warm-up, which took 2000 iterations, three chains have converged (I know the ‘true’ parameter values), but the latter three have failed to do so. Sampling in both cases took 500 iterations.

Then I’ve used the method suggested on the site and run warm-up for the failed three chains again - say, 2000 more iterations. All of them finally converged.
My question is rather general: Is it methodologically fine to use different warm-up (burn-in) sizes for different chains? Or should I also add some warm-up and resample the successful chains as well (which, I belive will not add to much to the results)?

yizhang · August 9, 2022, 5:54pm

Technically there’s nothing wrong with that. But I’d first investigate why the 3 chains fail to converge. It may also help to use the init step size and/or mass matrix from the successful chains.

garej · August 9, 2022, 7:05pm

Oh, thank you for reply. What are the ways to investigate those fails? I thought that it just can happen because of unfortunate starting points (usually I take them at random).

My example is complex enough (mixed logit aggregated demand model).
Stan optimization procedure (i.e. newton) finds the global optimum reliably, but HMC usually fails (variational algorithm usually also fails).
HMC only converges if I take initial values not far from optimization parameters.
I’ve tested it even with 10_000 warmups.

Bob_Carpenter · August 10, 2022, 8:36pm

Typically we want to make sure that there are no divergent transitions reported. That’s the usual problem with non-convergence due to sub-optimal parameterization. The usual fix is to reparameterize. The most common cause is centered parameterizations of hierarchical models.

The second quote here says that HMC succeeds. Was the first just referring to our defaults failing to run for enough iterations?

If you have something like a high-dimensional regression, you can step down the initialization from uniform(-2, 2) to something like uniform(-1, 1) or even tighter. That tends to put you less far into the tail.

You also want to parameterize the model as much as you can so the posterior is centered and roughly unit scale (in the unconstrained parameters). Then the default initialization will be roughly correct.

garej · August 18, 2022, 6:55am

Yes, I have a multidimentional regression model, not hierarciacal one.

Additional iterations help only if the init values were taken not far from true values.

I don’t quite understand the statement. Would you mind to put a link to relevant tutorial (or manual) where this method is demonstrated.

Bob_Carpenter · August 23, 2022, 8:48pm

I’m not sure we talk about it directly in a single section. The basic idea is that we would like our posterior distribution to be standard normal (i.e., unit scale, centered at zero, and every parameter is independent). We talk about ways to do that in the section of the user’s guide on reparameterization:

It also comes up in the problematic posteriors chapter:

It’s pretty easy for simple parameters with unit priors. For example, If I have this

parameters {
  real alpha;
}
model {
  alpha ~ normal(100, 500);
}

Then the model isn’t centered at the origin and has a scale of 500 rather than 1. To adjust the model so that it samples over the right space, we can use an affine transform:

parameters {
  real alpha_raw;
}
transformed parameters {
  real alpha = 100 + 500 * alpha_raw;
}
model {
  alpha_raw ~ normal(0, 1);
}

which puts alpha_raw’s prior on the unit scale. it turns out you can do that with affine transforms this way:

parameters {
  real<offset=100, multiplier=500> alpha;
}
model {
  alpha ~ normal(100, 500);
}

garej · August 26, 2022, 9:25pm

Oh, I see, that’s very clear. Reading about non-centered reparametrization I’ve missed the point that both location and shift can hamper sampling process.

Just a final remark to close the topic. If I have a vector of parameters which I have to preserve because of data structure. Is there a proper way to reparameterize the vector?

parameters {
vector[5] theta;
}

transformed parameters {
  ?
}

model{
theta[1] ~ normal(10, 50);
theta[2:5] ~ normal(0, 30);
}

Is it correct to use two auxiliary variables in this case?

parameters {
    real  alpha;
    vector[1:4] beta;
}

transformed parameters {
  vector[5] theta;
  theta[1] = 10 + 50 * alpha;
  theta[2:5] = 30 * beta;
}

model{
  alpha ~ normal(10,50);
  beta[1:4] ~ normal(0,30);
}

Is there a way to get rid of beta vector in this case?

Topic		Replies	Views
Chain length Modeling	2	1042	January 5, 2018
New adaptive warmup proposal (looking for feedback)! Algorithms	50	4176	July 31, 2020
Using a fixed step size for HMC (or NUTS) and other sampler options Algorithms mcmc	21	3340	June 27, 2019
Is having 4 chains with length of 5000 the same as having 8 chains with length of 2500? Modeling	8	1576	January 28, 2022
Iterations and warm ups in Stan General	7	6932	November 22, 2019

Is it fine to use different warm-up (burn-in) sizes for different chains (via manual continuing of chains)?

Related topics