Weird inconsistent behavior between OSX and linux cluster on same Stan model

Hi all,

I’m struggling to troubleshoot a model on a Linux cluster and I’m utterly stumped. The model and data are large and complicated, and I’m not sure whether the problem will be reproducible on other systems or not. I’m posting here partly because I wonder if my weirdness exposes a subtlety in the way that Stan interacts with different systems that would be useful for people to know about.

Here’s a summary of the weirdness:

  • I have one version of the model (v1) that works locally on Mojave and works fine on the cluster.
  • I have a second version (v2) that should be more computationally efficient, but otherwise should encode the same posterior (assuming it’s not buggy).
  • When I run v2 on the cluster, it compiles and samples without complaint. However, the step-size in early exploration quickly and reliably crashes to an arbitrarily small value and never recovers. This is consistent across multiple runs. Thus, clearly either the model is encoding the wrong posterior geometry or it’s running into a bizarre numerical issue.
  • However, when I run v2 locally on Mojave, everything works absolutely fine. Sampling proceeds as expected, and the step-size adaptation behaves just like v1. Moreover, the posterior recovered from v2 is similar to that from v1. (I’ve not been able to thoroughly evaluate whether the posteriors are strictly the same to within MCMC error because the model contains 2e+6 parameters and I don’t yet have a finished run of v2 locally–it takes days)
  • In both cases, I am running cmdstan 2.26.1 from cmdstanr. The make/local are identical, and cmdstan is rebuilt after updating make/local. The cmdstanr calls to $sample() are also identical, save for the number of threads per chain. The data are identical.
  • The compiled executables on the two systems are not at all identical. Perhaps this is expected, since the C++ compilers are different (GCC 9.3.0 on Linux; Apple clang version 11.0.0 (clang-1100.0.33.17) on OSX) and the executables are being compiled to run on different OS’s. However, the compiled executables are really different. The Linux version of v2_threads is 3.0 MB, the OSX version is 4.1 MB.

Has anybody seen behavior like this before? Is there something system-related (not Stan-related) that I should be doing on Linux to deal with the problem? Am I probably just being dumb somewhere? I’ll mention that this particular model his big and complicated, with over 2e+6 parameters, and it’s broken intuitions about Stan before (an earlier version of this model originally exposed the sticky boundary issue with offset/multiplier).

Here’s v1

And here’s v2

Data are shareable through private channels.

Cheers
Jacob

1 Like

If the reduce sums should return the same thing (may they are off a bit with the optimization but should be close), you can check this with:

generated_quantities {
  real lpA = reduce_sum_A();
  real lpB = reduce_sum_B();
  real diff = lpA - lpB;
}

If they’re much different (not just floating point error), then maybe there is a difference in the model.

I don’t know how to dig into the Mojave vs. Cluster differences. It seems easier to try to figure out if there’s a difference in how target is computed in model 1 and model 2.

2 Likes

Turns out there was an error in v2, and now both v1 and v2 run as expected both locally and on the cluster. However, there’s still something deeply weird going on. The old v2 returns reasonable fits on my desktop running Mojave, but when the identical model is run with identical data (and identical compiler flags) on a Linux cluster, the stepsize reliably crashes to arbitrarily small values during the early part of warmup and just keeps going down.

Since this is no longer relevant to my own modeling needs I won’t pursue this further unless somebody else is interested.

By the way, thanks so much @bbbales2 for the advice on how to check that the models are the same. I would never have caught this otherwise, because the models are similar enough to imagine that any differences between them are due to poor mixing.

1 Like