The manual mentions that the L-BFGS algorithm used for optimization is deterministic. I should have deterministically generated initial values for all my model parameters, and if I run the optimization twice on the same data on my workstation, I do seem to get exactly the same output. (I set the seed parameter for stan_model.optimizing() just in case, although if I understand correctly it shouldn’t matter if initial values are specified.) I was surprised to see, however, that when I run the same code inside a Docker container (roughly a virtual machine with its own OS) the results can sometimes change significantly. My model is currently ill-conditioned in the sense that optimization will fail for maybe 5% of real data sets (the line search gives up), and the way I first noticed this phenomenon was that some data sets that failed in one environment didn’t fail in the other. When both environments do converge to a solution, the differences in parameter values are quite small (4-5 significant figures are the same), so I guess it’s conceivable that my current model gives rise to numerically unstable optimization problems. Even the initial log joint probability was significantly different (-2000 vs -2400) in one case where one environment blew up and the other didn’t, and I would assume that that at least depends only on the data and initial values (although possibly in a numerically unstable manner in my case).

The model does produce good results whenever the optimization converges, so it’s not complete garbage, just somewhat ill-conditioned.

The Docker container should have exactly the same python packages, in particular the same version of pystan (2.19.1.1). The OS is different (some flavor of linux vs macOS 10.14.5) and things like the C++ compiler used may well be different, but I optimistically assumed that details like these would at most cause insignificant differences due to the way floating point computation works. Of course if my numerical stability hypothesis is true, then this might be enough to explain the behavior I’ve observed.

Unfortunately I can’t really share the model or the data, and the details would be unreasonably involved anyway. I just thought I’d check if it’s possible to say anything about the following questions in general:

- Apart from numerical issues, should the optimization output depend only on the data, the initial values and the version of (py)stan that is used?
- Is it typical for ill-conditioned models to produce numerically unstable optimization problems, in which case all bets are naturally off? (I realize that this is an ill-conditioned question. :) What I’m going for is that, while I’m sure it’s trivial to construct a model that gives rise to a numerically unstable optimization problem, someone with experience or more understanding of numerical analysis and stan’s internals may be able to say something about whether this is likely to happen in practice if there are no glaring problems with parameter scaling, matrix condition numbers etc.)