Weird inconsistent behavior between OSX and linux cluster on same Stan model

jsocolar · April 9, 2021, 7:58pm

Hi all,

I’m struggling to troubleshoot a model on a Linux cluster and I’m utterly stumped. The model and data are large and complicated, and I’m not sure whether the problem will be reproducible on other systems or not. I’m posting here partly because I wonder if my weirdness exposes a subtlety in the way that Stan interacts with different systems that would be useful for people to know about.

Here’s a summary of the weirdness:

I have one version of the model (v1) that works locally on Mojave and works fine on the cluster.
I have a second version (v2) that should be more computationally efficient, but otherwise should encode the same posterior (assuming it’s not buggy).
When I run v2 on the cluster, it compiles and samples without complaint. However, the step-size in early exploration quickly and reliably crashes to an arbitrarily small value and never recovers. This is consistent across multiple runs. Thus, clearly either the model is encoding the wrong posterior geometry or it’s running into a bizarre numerical issue.
However, when I run v2 locally on Mojave, everything works absolutely fine. Sampling proceeds as expected, and the step-size adaptation behaves just like v1. Moreover, the posterior recovered from v2 is similar to that from v1. (I’ve not been able to thoroughly evaluate whether the posteriors are strictly the same to within MCMC error because the model contains 2e+6 parameters and I don’t yet have a finished run of v2 locally–it takes days)
In both cases, I am running cmdstan 2.26.1 from cmdstanr. The make/local are identical, and cmdstan is rebuilt after updating make/local. The cmdstanr calls to $sample() are also identical, save for the number of threads per chain. The data are identical.
The compiled executables on the two systems are not at all identical. Perhaps this is expected, since the C++ compilers are different (GCC 9.3.0 on Linux; Apple clang version 11.0.0 (clang-1100.0.33.17) on OSX) and the executables are being compiled to run on different OS’s. However, the compiled executables are really different. The Linux version of v2_threads is 3.0 MB, the OSX version is 4.1 MB.

Has anybody seen behavior like this before? Is there something system-related (not Stan-related) that I should be doing on Linux to deal with the problem? Am I probably just being dumb somewhere? I’ll mention that this particular model his big and complicated, with over 2e+6 parameters, and it’s broken intuitions about Stan before (an earlier version of this model originally exposed the sticky boundary issue with offset/multiplier).

Here’s v1

github.com

jsocolar/colombiaBeta/blob/master/stan_files/full_colombia_model/occupancy_v6_1.stan

// This is a Stan model for the full Colombia bird dataset, version 6.1
// Changes:   Include spatial effect of subregion, include effect of observer:species
//            Combined variances for spatial effects and taxonomic effects
//            Remove effects of floodSpecialist
//            Remove range-restriction effects in favor of just the "barrier effects"
//            Zero-center all random effects
//            Switch from offset/multiplier to transformed parameters for noncentering 
//              to avoid "sticky boundaries."  See https://discourse.mc-stan.org/t/offset-multiplier-initialization/20712
//            Better priors that interact as desired with effects coding
//            IMPORTANT: data structure changes to effects coding for this model!
//            Removal of offset multiplier soft-constraints (significant reparameterization 
//              probably invalidates some of the old soft-constraints)
//            Better consistent formatting throughout, and slight rearrangement of 
//              parameters and order of partial_sum arguments to keep biogeographic 
//              parameters together in one block

functions{
  real partial_sum(
    // Function arguments:
      // Data slicing and indexing

This file has been truncated. show original

And here’s v2

github.com

jsocolar/colombiaBeta/blob/master/stan_files/full_colombia_model/occupancy_v9.stan

// This is a Stan model for the full Colombia bird dataset, version 9.0, which is built on 7.0
// Changes:   switching to slice the occupancy intercept
//            better naming of data containers

// function to form a matrix with the same dimensions as ind, whose elements i,j are given by cov_u[ind[i,j]].
// This strategy for vectorizing the operation is due to Juho Timonen in Stan Slack post on 3 March 2021.
functions{
    matrix rt_mat(
    int r, // number of rows
    int c, // number of columns
    int[,] ind, // indices
    vector cov_u // unique covariate values
  ){
    int a_flat[r*c] = to_array_1d(ind);
    matrix[r,c] out = to_matrix(cov_u[a_flat], r, c);
    return(out);
  }
  
  
  real partial_sum(

This file has been truncated. show original

Data are shareable through private channels.

Cheers
Jacob

bbbales2 · April 10, 2021, 7:03pm

If the reduce sums should return the same thing (may they are off a bit with the optimization but should be close), you can check this with:

generated_quantities {
  real lpA = reduce_sum_A();
  real lpB = reduce_sum_B();
  real diff = lpA - lpB;
}

If they’re much different (not just floating point error), then maybe there is a difference in the model.

I don’t know how to dig into the Mojave vs. Cluster differences. It seems easier to try to figure out if there’s a difference in how target is computed in model 1 and model 2.

jsocolar · April 15, 2021, 4:44pm

Turns out there was an error in v2, and now both v1 and v2 run as expected both locally and on the cluster. However, there’s still something deeply weird going on. The old v2 returns reasonable fits on my desktop running Mojave, but when the identical model is run with identical data (and identical compiler flags) on a Linux cluster, the stepsize reliably crashes to arbitrarily small values during the early part of warmup and just keeps going down.

Since this is no longer relevant to my own modeling needs I won’t pursue this further unless somebody else is interested.

By the way, thanks so much @bbbales2 for the advice on how to check that the models are the same. I would never have caught this otherwise, because the models are similar enough to imagine that any differences between them are due to poor mixing.

Topic		Replies	Views
Replication issue between macOS and Windows General	6	56	November 12, 2024
Reproducibility of a non-linear model CmdStan cmdstanpy	3	208	April 23, 2024
Prior-predictive samples (sometimes) affected by operating system? RStan	4	354	May 11, 2021
Cmdstan cluster sampling speed CmdStan	3	83	January 10, 2025
Differences between model results, Rstan 2.26.22 vs. CRAN version General	3	398	August 11, 2023

Weird inconsistent behavior between OSX and linux cluster on same Stan model

Related topics