Is it fine to use different warm-up (burn-in) sizes for different chains (via manual continuing of chains)?

Imagine I have 6 chains running in HMC algorithm (with Stan). After warm-up, which took 2000 iterations, three chains have converged (I know the ‘true’ parameter values), but the latter three have failed to do so. Sampling in both cases took 500 iterations.

Then I’ve used the method suggested on the site and run warm-up for the failed three chains again - say, 2000 more iterations. All of them finally converged.
My question is rather general: Is it methodologically fine to use different warm-up (burn-in) sizes for different chains? Or should I also add some warm-up and resample the successful chains as well (which, I belive will not add to much to the results)?

Technically there’s nothing wrong with that. But I’d first investigate why the 3 chains fail to converge. It may also help to use the init step size and/or mass matrix from the successful chains.

1 Like

Oh, thank you for reply. What are the ways to investigate those fails? I thought that it just can happen because of unfortunate starting points (usually I take them at random).

My example is complex enough (mixed logit aggregated demand model).
Stan optimization procedure (i.e. newton) finds the global optimum reliably, but HMC usually fails (variational algorithm usually also fails).
HMC only converges if I take initial values not far from optimization parameters.
I’ve tested it even with 10_000 warmups.

Typically we want to make sure that there are no divergent transitions reported. That’s the usual problem with non-convergence due to sub-optimal parameterization. The usual fix is to reparameterize. The most common cause is centered parameterizations of hierarchical models.

The second quote here says that HMC succeeds. Was the first just referring to our defaults failing to run for enough iterations?

If you have something like a high-dimensional regression, you can step down the initialization from uniform(-2, 2) to something like uniform(-1, 1) or even tighter. That tends to put you less far into the tail.

You also want to parameterize the model as much as you can so the posterior is centered and roughly unit scale (in the unconstrained parameters). Then the default initialization will be roughly correct.

1 Like

Thank you. I read it many times but it seems more appropriate for hierarcial model. My model indeed is a kind of multidimentinal regression of the form below:

Is there a room for non-centered parameterizations here?

model {
  // priors

  theta1[1] ~ normal(0, 25)T[-100,0];
  theta1[2:] ~ normal(0, 3);                    // less informative
  
  to_vector(Demo) ~ normal(0, 5);
  Sigma ~ normal(0, 5);
  
  // model of sales 
  {
    matrix[T, J+1] pred_shares;
    
    for(t in 1:T) {
      
      pred_shares[t] = shares1( .... , v[t], demogr[t], theta1, Sigma, Demo, NS, J, P );
      
      sales[t] ~ multinomial(pred_shares[t]');
    }
  }
}

where shares1 is some sofmax function.
the main part of shares1 is:

row_vector shares1(....., theta1, Sigma, Delta ){
   ...
   utilities[1:NS, 1:J] = delta + (bigX2 * diag_matrix(Sigma) * V)' + (bigX2 * Demo * D)';   
   ...
}

There are 45 parameters altogether.

Well, HMC can converge with 10_000 iterations and your defalts if I take intital values not far from optimum parameters. If I take it in random most often they even do not move too far from 0 values.

Do you mean that I have to normalize parameters somehow?