Under what conditions should I expect stan's optimization to be deterministic?

kryft · January 4, 2020, 11:41am

The manual mentions that the L-BFGS algorithm used for optimization is deterministic. I should have deterministically generated initial values for all my model parameters, and if I run the optimization twice on the same data on my workstation, I do seem to get exactly the same output. (I set the seed parameter for stan_model.optimizing() just in case, although if I understand correctly it shouldn’t matter if initial values are specified.) I was surprised to see, however, that when I run the same code inside a Docker container (roughly a virtual machine with its own OS) the results can sometimes change significantly. My model is currently ill-conditioned in the sense that optimization will fail for maybe 5% of real data sets (the line search gives up), and the way I first noticed this phenomenon was that some data sets that failed in one environment didn’t fail in the other. When both environments do converge to a solution, the differences in parameter values are quite small (4-5 significant figures are the same), so I guess it’s conceivable that my current model gives rise to numerically unstable optimization problems. Even the initial log joint probability was significantly different (-2000 vs -2400) in one case where one environment blew up and the other didn’t, and I would assume that that at least depends only on the data and initial values (although possibly in a numerically unstable manner in my case).

The model does produce good results whenever the optimization converges, so it’s not complete garbage, just somewhat ill-conditioned.

The Docker container should have exactly the same python packages, in particular the same version of pystan (2.19.1.1). The OS is different (some flavor of linux vs macOS 10.14.5) and things like the C++ compiler used may well be different, but I optimistically assumed that details like these would at most cause insignificant differences due to the way floating point computation works. Of course if my numerical stability hypothesis is true, then this might be enough to explain the behavior I’ve observed.

Unfortunately I can’t really share the model or the data, and the details would be unreasonably involved anyway. I just thought I’d check if it’s possible to say anything about the following questions in general:

Apart from numerical issues, should the optimization output depend only on the data, the initial values and the version of (py)stan that is used?
Is it typical for ill-conditioned models to produce numerically unstable optimization problems, in which case all bets are naturally off? (I realize that this is an ill-conditioned question. :) What I’m going for is that, while I’m sure it’s trivial to construct a model that gives rise to a numerically unstable optimization problem, someone with experience or more understanding of numerical analysis and stan’s internals may be able to say something about whether this is likely to happen in practice if there are no glaring problems with parameter scaling, matrix condition numbers etc.)

avehtari · January 4, 2020, 2:36pm

I’m quite certain that in your case what you see is specifically due to the way floating point computation works. Floating point arithmetic is not universally defined and different parties can have different implementations. Different compilers and even the same compilers with different compiling options can change the order of computations, too. All these are likely to have big effect in ill-conditioned optimization. In a same way, even for well-conditioned posterior MCMC chains starting from the same point but run in different environments will diverge (diverge from each other, not diverge as in HMC divergence diagnostic) quickly as the very small differences eventually accumulate.

Optimization is deterministic, but you can’t assume the same result bit-by-bit unless hardware, OS, compiler, compiler options, Stan version, and linked libraries are all the same.

Unfortunately, yes. You should try to improve conditioning.

kryft · January 8, 2020, 12:12pm

Great, thanks a lot. I didn’t expect the results to be the same bit-by-bit of course, but it hadn’t occurred to me that the optimization might actually be unstable enough for the differences to be significant.

I’m still trying to figure out what exactly is causing the instability. I assume it’s related to the warnings about the gradient or the log probability being non-finite, but it’s not clear to me what would cause that here. At first I thought that the optimization might be trying to push a strictly positive parameter towards 0, but fixing that parameter to a reasonable constant value didn’t have any effect. I also tried to remove positivity constraints (just for debugging) to no avail.

I’ve reparameterized the model so that I have two normal(0,1) parameter vectors and two gamma(1,5) vectors. One of the normal vectors gets transformed into a multinormal vector where the covariance matrix is given as a part of the data. I noticed that the covariance matrix had a really high condition number, 8e6 for the cholesky component, but regularizing it by setting the smallest eigenvalues to something reasonable didn’t have an effect on the non-finite values. (Regularizing the matrix did speed up the optimization somewhat in the cases where it converges.)

Are there in general other possible explanations for this kind of behavior?

The only other possibly significant problem with the model that I’m already aware of (although I don’t know if it could cause this type of instability) is that the variables that I’m modeling as multinormal may not be quite multinormal enough. The one-dimensional marginals are all unimodal, but quite a few of them are somewhat skewed. The variables are highly correlated, and the multinormal approximation has seemed decent in the sense that if I assume multinormality and know the values of a few variables, the conditional means of the remaining variables will usually be quite close to their true values. (The mean distance from the true value to the conditional mean is orders of magnitude smaller than the marginal standard deviation, for instance.)

The covariance matrix was actually estimated with another stan model, and that model also had issues that I interpreted to mean that the model just wasn’t a very good fit for the data from stan’s perspective. Sampling was slow and there were N_eff and R_hat issues. (I couldn’t figure out how to get more than a boolean flag out of pystan’s hmc diagnostics, but N_eff and R_hat were flagged and the others weren’t.) Optimizing was relatively quick and didn’t have any issues, and the ML estimate returned by the optimization seemed visually identical to the mean of the sampled matrices. (A few variables had some missing values that were also modeled, so it’s possible that the issues had to do with that, but at least the inferred values for the missing values seemed reasonable.)

Topic		Replies	Views
L-BFGS-B comment Algorithms optimization	23	4522	October 13, 2016
A frequentist estimation approach for a MNP model in python stan Modeling	1	687	August 8, 2017
Is optimizing() function giving global or local maximum Modeling	2	548	January 17, 2019
Optimizing Functions of stan random variables General	7	1124	March 31, 2021
About Pystan Convergence Issue Modeling	1	580	August 13, 2020

Under what conditions should I expect stan's optimization to be deterministic?

Related topics