Idea for additional convergence metric: Rhat for the warmup info?

Would it make any sense to compute something like Rhat for the info that comes out of the warmup period? I’m thinking primarily about the HMC parameters, but the actual model parameter samples might also be helpful? (I know the samples during warmup are not samples from the posterior, but maybe they contain some info useful for detecting bad adaptation in that dramatic differences between chains suggests warmup explored different regions between the chains and may therefore be insufficient.)

If some thresholds were developed for that, it would permit terminating at the end of adaptation if adaptation was deemed insufficient.

Obviously the very-cool ideas for pooled warmup would thwart this, but thought I’d post the idea anyhow.

  • stepsize__ is fixed for each chain and thus within chain variance is 0, and thus there is no sense comparing within and between chain variances, and thus Rhat is not valid. ESS computation for single chain or multichain is for the same reasons invalid. The diagnostic information is in the actual values and between chain variation which can be summarised, e.g., with mean, sd, and quantiles.
  • accept_stat__, treedepth__, n_leapfrog__, and divergent__ are conditional on fixed stepsize. They may have within chain variation, but the asymptotic between chain variance depends on the variance of the between chain stepsize__, and thus Rhat is not valid. For the same reason multichain ESS is invalid. Single chain ESS would be meaningful if we would care about MCSE for accept_stat__, treedepth__, n_leapfrog__, and divergent__ and the series of their values are shown to be Markovian. The diagnostic information is mostly in the actual values and overall variation which can be summarised, e.g., with mean, sd, and quantiles. In addition between chain variation compared to between chain variation of stepsize would tell sensitivity of the algorithm to the stepsize__, but calling this comparison Rhat is invalid. In addition within chain autocorrelation can reveal insights to the algorithm behavior, but that this better described in direct terms instead of ESS.
  • Reporting Rhat and ESS for stepsize__, accept_stat__, treedepth__, n_leapfrog__, and divergent__ has been convenience choice as to allow easy use of just one csv file and summary is computing everything for all quantities. I would change what is reported.
  • Practically, an experienced user can use Rhat and ESS stepsize__, accept_stat__, treedepth__, n_leapfrog__, and divergent__ for diagnostics, but they are theoretically invalid and it would be better for the experienced users to use also the correct terms.
  • For less experienced users showing Rhat and ESS for these quantities is confusing which we have seen several times in Stan discussion forum. Convenience of the implementation should not be a reason to unnecessarily confuse users.
  • Summaries for stepsize__, accept_stat__, treedepth__, n_leapfrog__, and divergent__ should be reported separately with clear indication they are diagnostics and then it would be possible to display appropriate summaries.
  • The warmup period is expected to be non-stationary, so Rhat and ESS would not be that useful for the whole period. For some later warmup windows they could tell something, but then that is part of adaptive warmup campfire .
1 Like

Thanks for such a thorough answer!