How does Stan work during warm up (Stan warm-up algorithm/Stan burn-in algorithm)?

Dear Stan folks,

I am wondering how Stan works during the warm-up phase. To do the prior predictive checks, I programmed the R code with the blavaan package to draw samples from prior distributions. I did the work with the following R code (I omitted the model syntax since that is not the focus of this question).

fit <- bsem(model=model.syntax, data=dat, n.chains = 2, prisamp = TRUE,
            burnin = 2500, sample = 2500)

With the prisamp = TRUE argument, I get samples from the prior distributions. Meanwhile, I also specified the number of burn-ins for the sampling. Since the convergence is not the main interest in the prior predictive checks, I considered using burnin=0 instead of setting the number of burn-ins.

When I discussed this with Ed (author of the blavaan package), he said it might still be safe to consider the warm-up since the starting values could have some influence on the draws. In this context, I would like to know how Stan actually works in the warm-up phase. Although I am keen on getting samples from the priors, does the Stan warm-up have effects on drawn samples? If any, how much? Furthermore, if it exists, how many burn-ins should be set? I know this indeed depends on the model specifications and data, but I want to get a rough sense of this.



The first 150 iterations (I think that’s right) is the period in which the adaptation for the HMC parameters occurs, see this section from the reference manual for more information:

The iterations after this contain the process of the sampler finding the typical set and so may not be representative of the posterior distribution, hence the discarding of these samples prior to assessing convergence/ESS.

1 Like

Hi @andrjohns,
Thanks for your reply.
If I understood your answer correctly, is it right that the initial 150 iterations are used for the adaptation of the HMC sampler? And still, the iterations after the initial 150 iterations might not be representative, so the burn-in samples of at least more than 150 are necessary for my situation?

Yep, that’s right

1 Like

Cool, thanks a lot!!

This is incorrect.

The first 150 iterations have little adaptation and are meant to give the sampler a chance to get close to, if not into, the typical set. Adaptation that’s too aggressive early on tends to adapt more towards the irrelevant structure of the posterior (outside of the typical set) rather than the relevant structure (inside the typical set).

After that initial window the main adaptation routine turns on, exploring the typical set and then updating the sampler configuration in a sequence of expanding windows. There’s no guarantee that the sampler will always find that typical set when the main adaptation turns on but usually it’s close enough to be sufficiently robust.

The iterations start becoming “representative” once the sampler has found the typical set – which again happens early during warmup – but the variation in sampling behavior induced by the adaptation can introduce awkward behavior in those samples. By removing all of the warmup iterations we avoid not only the early non-representative samples (including those that extend beyond the nominal 150 iteration initial window) but also any awkward behavior due to the active adaptation.


Great, thanks for clarifying!

1 Like