Why can initialization fail despite appropriate parameter scales and constraints to support?

I don’t have a specific model for this, but it’s happened to me a decent amount and folks report it here pretty frequently too:

I’ll have a model where all the parameters are on scales where weakly informed priors assign mass mostly within the -3:3 range and constraints that should yield finite likelihood for any possible combination of parameter values, yet default initialization fails for all chains. Can anyone explain why this can happen? Possibly something nuanced about the gradient that wouldn’t be obvious?

If it helps at all, at least for the cases I’ve personally encountered, most of the time I can get initialization to succeed by specifying either init=0 or init=x, with x being some value smaller than the default value 2.

1 Like

Only the constraints are used for initialization, not the priors. The [-2, 2] initialization is on the unconstrained scale, so for a zero bounded variable the initialization looks like hist(exp(runif(1000, -2, 2))).

The argument for keeping initializations wider than narrower comes from Rhat – that if we start chains farther apart from each other then Rhat will be more reliable in telling us when they mix.

The argument against this is that practically it doesn’t seem that important (we have a ton of cases where numerics fail cause wide initialization, and not so many cases where Rhat gets tricked, though of course detecting Rhat getting tricked would be difficult).

(Edit: I wouldn’t mind narrower inits myself)

2 Likes

I’m interested to see what @avehtari has to say on this. I personally worry sick of diagnostics getting “tricked” as @bbbales2 puts it. But I guess this can be checked empirically (not exhaustive and not proof of anything, mind you): run a bunch of models with a new initialisation scheme and plot the obtained Rhats against each other to try and notice some persistent trend.

2 Likes

@andrewgelman is a founding member of the [-1, 1] inits club. I think @avehtari is in that camp as well.

I think you’d have to try to trick it pretty hard. Like Aki does that here and the experiments are cool: [1903.08008] Rank-normalization, folding, and localization: An improved $\widehat{R}$ for assessing convergence of MCMC

I got a lot of faith in multi-chain Rhat from doing the cross-chain warmup stuff. It worked really well (especially compared to trying to do a 1-chain sorta have-I-mixed check). But I wasn’t really trying to break it either.

The general worry about “am I overdispersed enough for mixing” is probably kinda unresolvable too. Like in general are we overdispered enough at [-2, 2] so that the chains mix reliably? Eh, maybe or maybe not, but we’re definitely dispersed enough to regularly get annoying numerical problems lol – and so it’s like we gotta take that annoyance into account on the objective function.

2 Likes

Ah, yes, my OP didn’t reflect this aspect, but I was aware that the init constraints (emphasis for distinction from the declared parameter constraints for future readers; that’s a distinction I learned only later in my Stan use) of [-2,2] are on the unconstrained scale.

But when initialization fails, I see error messages that the log-prob was evaluated as being -Inf and I don’t understand how that’s even possible when the constrained-scale parameter values and priors cannot possibly yield a -Inf. Am I maybe under-appreciating floating point error or something?

It’s just pretty easy for stuff to go to infinity I think, and gradients are always just a little more fragile than values. But to say anything specific here we’d have to talk about a specific model.

It’s totally possible too that there are fixable numerical problems with different bits of Stan – I think you’d have to start adding lots of prints and stuff to the code to find out exactly how whatever is blowing up is blowing up.

Since it’s so easy to just shrink the inits, I guess we do that most of the time instead of trying to investigate Stan numerics :D.

2 Likes

Ok, I think I observe it most frequently with hierarchical models with huge correlation matrices (even when using the Cholesky parameterization). I’m working on a case study on hierarchical models that with a little expansion should fall into this category, so I’ll add some prints to see if we can track down the numerics in that case at least.

2 Likes

Ooo, yeah. In this thread: Multivariate prior for hierarchical model; missing something? - #15 by Merlin_Heidemanns there was a model that wouldn’t init at -2, 2 but also wouldn’t init at 0. It was confusing and I went off looking in the wrong places!

1 Like

Yeah… that’s how I track this problem down when I run into it. Getting back to your original questions:

I’m probably stating what you already know, but it’ll happen when:

  1. the evaluation of the model-defined log likelihood underflows to -Inf. This is for just evaluation of the function value. The error messages should be good enough where this is pretty straightforward to catch, but I’m sure there are spots where we don’t always point out where that happens.

  2. something a little trickier… if any part of the model evaluation ends up being NaN. It’s possible to do something that evaluates to NaN and not use it as part of the likelihood calculation. (think something like an unused expression) Unfortunately, this has a side-effect that’s hard to debug. Although the target would end up as something sensible, the autodiff stack would have additional variables on the stack. In certain cases, this would actually effect the gradient calculation and you’ll see parts of the gradient go to NaN. It has to do with how the autodiff stack is traversed; it just goes from the end to the beginning without pruning any of those out.

    How I debug this: I usually knock out a bunch of code to see if it initializes. Then essentially binary search back until I track down a small chunk that causes the problem. And then a ton of prints. Ocassionally, I’d dig into the C++ to track something like this down, but only after I make sure I’ve narrowed it down.

  3. a bug somewhere. But if you’re seeing this happen on multiple inits, then I’m hoping it’s not.

The initialization happens in stages. It starts with evaluating the model without gradients. It’ll try different initial values until it finds a place where the model evaluation is finite without gradients. Once it does that, it’ll try that initial value with gradients. When both are ok, it starts warmup / sampling from that spot.

3 Likes

With more parameters and too large init range that many of the random draws produce very unlikely results leading to underflow / other floating point issues.
Any range width and location are difficult to automate completely and if you have something that works better use that. Bayesian workflow paper discusses a bit of initialization issues, too. Stan’s dynamic HMC is very efficient and if the inits are not exactly equal 1) the total variance is likely to be larger than within variance to start with, 2) the more strict threshold we have suggested in new Rhat paper makes false positives less likely, 3) the chains are usually run that long that getting low Rhat just by chance is less likely, 4) requiring also decent ESS helps that Rhats are more reliable. Thus it’s likely that overdispersion with respect to the true posterior is not needed, but some variation is useful and preferably don’t init everything at the mode.

2 Likes