Changing default inits

Stan uses default init of 2; that is, initial values for parameters are sampled from uniform(-2,2) on the internal (unconstrained) scale.

I think it would be better to use the default of 0.1.

It should be even better to use inits from Pathfinder, and that’s fine too. My thought is that switching to default of 0.1 is very low cost, and that can be done while waiting for Pathfinder inits to be put in.

P.S. See my comment below. My concern here is not Stan crashing or being unable to draw any samples (I think that’s what Steve and Aki are talking about when they refer to the sampler “failing”). Rather, my concern is that the default starting values are in many problems so far away from the mass of the distribution that Stan gets lost trying to get anywhere. My concern is not Stan failing in the sense of it not running at all, but rather Stan having a practical failure in the sense of going very slowly and having poor mixing because it’s wasting time spinning its wheels out in the boondocks.

Do you mean sampling from a range of -0.1 to 0.1? I think we could change the default to try (-2, 2) and then if that fails try (-0.1,0.1)

My suggestion is to start with current default (-2.2) so that when it works, the behavior doesn’t change. But in case of reject, halve the range to (-1,1) and keep halving until success. This will keep the original idea of aiming for a large variation in the initial values, but with 5th try we’re already in the range @andrewgelman proposes. We talked about this with Andrew yesterday, and he thinks the halving would be too complicated to implement, but I think changing the default initial init that much is too big change with unknown consequences, too. @stevebronder’s suggestion of trying (-2,2) and then (-.1,.1) is less flexible than what I propose. Of course, in my approach it is still possible that even (-\epsilon,\epsilon) init doesn’t work because 0 is bad for some parameters.

Implementation isn’t difficult; all you need to do is to adjust the user-supplied init_radius on each loop iteration:

I don’t like halving strategy, though.
The current code retries 100 rejections but floating point arithmetic doesn’t really support more 50 halvings. Actually, I’d guess that even 20 halvings remove so much entropy that you might as well zero-initialize at that point.
I propose changing the radius linearly until it reaches zero

init_radius_adjusted = init_radius * (MAX_INIT_TRIES - num_init_tries)

That’s what I thought :D

The benefit of halving is that we don’t need to retry 100 times. I think it would be fine to stop after 20 retries when using halving. I guess that 100 times has been used just because [-2,2] was such a bad choice originally, and increasing the number of retries was just the wrong fix. Changing the init linearly will waste more retries.

1 Like

Just to clarify the above discussion . . . There are two issues:

  1. Often with the default inits Stan moves very slowly, has trouble converging, gets stuck in bad parts of parameter space. This problem sometimes goes away when we use more reasonable inits. I suspect that if we replace the default inits setting of 2.0 with a new default of 0.1, we will get better performance on average.

As a first step here, I think it could make sense to do some experimentation, starting with the examples in posteriordb, with different default inits values to see what happens.

  1. Sometimes with the default inits, Stan can never get a starting point that avoids underflow/overflow and it just gives up. In that case it seems to make sense to try smaller inits.

The above discussion is all about issue 2, which is fine. But my concern is about issue 1.

I added a P.S. to my post above to clarify.

1 Like

This may be naive, but I’ve often wondered why we can’t just have the option of user supplied min and max rather than forcing [-2, 2] or [-x, x] come to that. It would just be convenient at the point where my model misbehaves and I want to check inits are not the issue. At present I have to make the jump to specifying values or a function.

Adding one data point: Right now I’m testing a posterior (model + data) with Pathfinder for which init=0.1 is bad but init=2 is good.

Is there a common instance where you know a good initialization region but it isn’t symmetric around zero on the unconstrained scale?

A good initial value vector is one that is a point in non-negligible (and more generally, non-flat, when something has gone wrong in model specification) posterior density. I just think it would be kind and helpful to users to give them a quick way of specifying uniform distributions for inits.
Otherwise we are telling everyone with asymmetric-about-zero posteriors (including Pathfinder init people) to change their models, or write init functions.

(Important to recognise here that, with large n, and even worse with large p, likelihoods are extremely concentrated in a tiny part of parameter space and there can be big open prairies where log posterior is digitally rounded to -infinity. That makes it matter more to some users than others. But in extreme cases, specifying marginal uniforms is problematic because it creates an init hypercuboid, a tiny proportion of which will contain the (typically) hyperellipsoid of useful init vectors. So people do have to go to functions or supplying the vector(s) sooner or later, I just think that this would be a nice user-centric stepping stone for not-so-Big data users.)

Would this mean different uniform distributions for different parameters? I’m having a hard time imagining how that would be passed to Stan. If you just wanted a global but not-zero-centered uniform distribution I think we could support that relatively easily

Hi, yes I think the only use is where we specify one uniform distribution to pick inits for all parameters. Otherwise, people need to write functions or supply values.

From a teaching perspective, it will be interesting to see how pathfinder inits develop, and whether that would be a good standard approach for beginners to adopt. If that is easy and reliable, it might be better than my idea of specifying a min and max. So you might not want to rush into this.