When reviewing traceplots of a model with some convergence issues, I noticed one of the chains started its warmup in an area of the parameter space well away from the other 3 chains. It also got ‘stuck’ there, whilst the other 3 chains began their fuzzy caterpillar and apparent effective sampling.

I noticed that the area where the rogue chain began was an area that would have had a very low likelihood according to the prior. I was originally under the impression that the prior would influence the initial value but after seeing this and also what’s written in the manual (which suggests default is uniform(-2,2)), I now realise this may not be the case.

Are initial values prior agnostic? Do strong priors however help chains ‘recover’ from beginning in areas well away from the likely parameter space?

I feel like manually setting initial values is an ugly hack around this issue, and that instead I should be re-parametising my model to have likely coefficient values in the range of uniform(-2,2) but maybe not.

The initial values are only influenced by the hard constraints and not the priors. The uniform(2,2) works on the unconstrained scale. You can make the uniform narrower in ‘rstan’ with ‘init_r = ‘

In my experience, convergence issues are rarely fixed by changing initial values (though this happens, measurement error models would be an example of such a case). I’d suggest checking whether the place your chain got stuck isn’t a second local maximum of your posterior and if so, try to reparametrize to have only one posterior mode. Even relatively strong priors may not wipe out a second mode (local maximum) so getting rid of the second mode by tightening priors might require to make your priors stronger than is defensible.

Yes, initial values are always prior agnostic and are uniform(2,2) on the unconstrained scale of the parameter as @stijn mentioned. Like you said you experienced, it’s possible that if you don’t supply initial values you can start in an area with very low prior likelihood. For example, if I have a parameter that represents the weight of a person in grams and the prior is Normal(70,000, 10,000), then the default initial values will clearly be in an area of log-probability. The prior helps Stan’s sampler get to where there’s higher log-probability by changing the gradient to take you in the right direction, but sometimes you can get stuck in a spurious mode like you said, or as I experience in ODE land you’ll be in an area of parameter space where your ODEs can’t even evaluate properly. For that reason, I would recommend supplying initial values if you have an idea of what they should be.

Also, if you know that parameter values in that spurious mode shouldn’t be possible, then you can adjust your priors to put even less or zero mass there.

You can make this less hacky by drawing initial, random values from the priors. You have to do this outside Stan. Like @arya and @martinmodrak indicate, a lot will depend on the parameter space that you are working in. If smallish variation in the parameters can have a big effect on the likelihood (power functions, hierarchical sds, ODE land) it can be better to supply your own initial values or start narrower in the automatically provided ones. If it’s really a convergence problem, than maybe your model is weakly identified, not optimally parameterized, or your data are less informative than you might think. This would require an investigation of your model.

I vaguely remember a blog post from either @Bob_Carpenter or @andrewgelman where they were talking about making it so Stan automatically draws the initial values from the priors specified in the model. I don’t remember if it was a thought or an actual feature that’s being worked on. I figure it might be kind of hard to do because somehow Stan would have to detect what part of the model is the prior.

Thanks to everyone that’s posted, your discussion has been really helpful.

I definitely notice that setting stronger priors helps the sampler move away from areas of lower expected log probability and into areas of fuzzy caterpillar and convergence. However, as @martinmodrak warns, I also worry I’m distracting the sampler away from possible alternative posterior modes.

I’ll keep experimenting and try to find the right balance.

Strictly speaking, you can’t tell Stan to draw from the priors, because a Stan program doesn’t know anything about priors; all it does is construct the objective function or log-posterior.

With the new Stan in which you can write declrattions such as:

vector[N] theta ~ normal(0, 1);

There, I guess one could consider the model statements that are attached to declarations as the “prior” for the purpose of drawing inits.

This sort of thing actually comes up a lot, especially in model building where we want to start by setting a parameter to a fixed value and then relaxing it. For example, the model is not converging well, so we set some parameters to fixed values. We set alpha to the value 27.0. Then we decide to relax alpha; we give it a normal(27, 1) prior. With current Stan this can be a disaster because “27” is so far from the default starting values. So when putting this into Stan we either need to specify the starting values (which is awkward because the starting values are part of the call to Stan, not part of the Stan program itself) or else we need to do some clumsy reparameterization, e.g.,

theta_0 = 27; // or input theta_0 as data

real e_theta ~ normal(0,1);

theta = theta_0 + e_theta;

Instead, it would be good to allow theta to have some scaling in its declaration? Or, to step back a moment, it would be good to have some mechanism for letting a parameter relax from a fixed value, without either needing hard-coded inits (as that’s redundant with the model) or an awkward reparameterization.