Inconsistent posteriors when changing number of threads (map_rect)


#1

Hello Stanimals (in particular @wds15) ,

when using map_rect with threading support, shouldn’t the posterior have no dependence (modulo differences due to numerical precision) on the number of threads (leaving the data, random number seed and number of chains constant in this comparison)?

Of course, the number of shards explicitly appears in such Stan code, but the number of threads doesn’t. So I would expect that my posterior should not be sensitive to a change in the number of threads while leaving the number of shards constant (obviously this does not apply to the run time).

Have a look at this figure, which is based on an experiment that I did varying the number of threads in a simple survival model:

Here the intercept_{1,2,3,4} correspond to running the same model, with the same data and the same seed, just with different values of STAN_NUM_THREADS=1,2,3,4. intercept_0 is a separate model, written without map_rect, that uses the same seed, data and number of chains, as a reference.

For more details, please have a look at this R reproducible R example. Also, in my repository here you also find the corresponding Stan models.

Maybe I did something wrong here in my code which causes this, but I can’t see it…

Cheers,

Eren.


#2

I have coded up map_rect such that it will not matter how many threads you use or if you use any of the parallel backends. The only difference is using it or not.

However, not using map_rect results in a different AD tree and then we expect to get numerically different results. Now, the differences you are seeing are quite large and raise questions. I mean, is the model really stable? Did the adaptation end up in the same tuning parameters? Did you see the same behavior with MPI? Do you have a run for comparison which uses map_rect, but has threading turned off (ahh… this is the intercept_0, right)?

Glossing through your code it looks ok… maybe also set init=0 and only use a single chain for the moment.

So, yes… this should not happen and we should look into it.


#3

Thanks!

Would the plot below help answering this affirmatively? (this time with only 1 chain and init=0 and iter=20000).

Would the info from get_adaptation_info help?

"Without map_rect:  # Adaptation terminated\n# Step size = 0.630663\n# Diagonal elements of inverse mass matrix:\n# 0.00208629, 0.000758928\n
"STAN_NUM_THREADS=1:# Adaptation terminated\n# Step size = 0.555883\n# Diagonal elements of inverse mass matrix:\n# 0.00203831, 0.000744684\n
"STAN_NUM_THREADS=2:# Adaptation terminated\n# Step size = 0.664499\n# Diagonal elements of inverse mass matrix:\n# 0.00196825, 0.000722572\n
"STAN_NUM_THREADS=3:# Adaptation terminated\n# Step size = 0.555883\n# Diagonal elements of inverse mass matrix:\n# 0.00203831, 0.000744684\n"
"STAN_NUM_THREADS=4:# Adaptation terminated\n# Step size = 0.555883\n# Diagonal elements of inverse mass matrix:\n# 0.00203831, 0.000744684\n

I do not have MPI installed/configured.

No, the intercept_0 case is the model exponential_survival.stan which does not use map_rect. The intercept_1 corresponds to 3 shards and STAN_NUM_THREADS=1, I guess this means threading turned off.

Yes did that (see also above and I also updated the repository accordingly). Still, the situation doesn’t change:


#4

Can you try fixing the seed? It should run the same with the same seed, right @wds15?


#5

Don’t I do this already, see here?


#6

Yes. Missed it (hard to tell since its buried).

Can you check one thing for me: if you run one of those commands again with the exact same seed, does it reproduce exactly? (If you extract draws in order, they should all be exactly the same as a previous run.)


#7

Did so, i.e.

Sys.setenv(STAN_NUM_THREADS=2)
fit2 <- sampling(sm, 
                data=stan_data, 
                seed=42, 
                init=0,
                chains=1, 
                cores=1, 
                iter=20000)
#############################################################################
Sys.setenv(STAN_NUM_THREADS=2)
fit2_2 <- sampling(sm, 
                 data=stan_data, 
                 seed=42, 
                 init=0,
                 chains=1, 
                 cores=1, 
                 iter=20000)
all.equal(as.matrix(fit2), as.matrix(fit2_2))

leads to

[1] TRUE

#8

Thanks for the reproducible example. I’ve verified that I get the weird results… I’m digging to see if there’s anything obvious.


#9

I don’t know if this sounds crazy, but I’m running from CmdStan and seeing that everything lines up identically for threads = 1, 3, 4, 5. But… not for threads = 2.


#10

@wds15, I might need your help. Looks like something isn’t right. This just happened while I was running it.

exponential_survival_parallel_2(7299,0x70000f2d3000) malloc: *** error for object 0x7fa6274073d0: pointer being freed was not allocated

exponential_survival_parallel_2(7299,0x70000f2d3000) malloc: *** set a breakpoint in malloc_error_break to debug



Abort trap: 6

#11

@wds15: that might be because I put print(...) inside the function that’s being mapped. Is that not safe?


#12

@ermeel: I think you’ve stumbled on a legit bug. I think it happens when STAN_NUM_THREADS = 2. I think I have enough to create a reproducible bug.


#13

Can you file an issue with that reproducible example? I will look at it.


#14

Yes, will put something up on both Stan and Math. The issue is with Math, but I’ll file an issue in Stan too so that we know that it’s there.

Here’s the example (I’ll put more detail in the issue):

functions {
  vector mapped_function(vector beta, vector theta, real[] x, int [] y) {
    return [0]';
  }
}
data {
}
transformed data {
  int shards = 3;
  vector[0] theta[shards];
  real x_r[shards, 0];
  int x_i[shards, 0];
}
parameters {
  vector[1] beta;
}
transformed parameters {
  vector[shards] lps = map_rect(mapped_function, beta, theta, x_r, x_i);
}
model {
  beta ~ normal(0, 1);
  target += lps;
}

with STAN_NUM_THREADS=2 this is what happens:

$ STAN_NUM_THREADS=2 ./parallel sample
num_threads = 2
method = sample (Default)
  sample
    num_samples = 1000 (Default)
    num_warmup = 1000 (Default)
    save_warmup = 0 (Default)
    thin = 1 (Default)
    adapt
      engaged = 1 (Default)
      gamma = 0.050000000000000003 (Default)
      delta = 0.80000000000000004 (Default)
      kappa = 0.75 (Default)
      t0 = 10 (Default)
      init_buffer = 75 (Default)
      term_buffer = 50 (Default)
      window = 25 (Default)
    algorithm = hmc (Default)
      hmc
        engine = nuts (Default)
          nuts
            max_depth = 10 (Default)
        metric = diag_e (Default)
        metric_file =  (Default)
        stepsize = 1 (Default)
        stepsize_jitter = 0 (Default)
id = 0 (Default)
data
  file =  (Default)
init = 2 (Default)
random
  seed = 2186709839
output
  file = output.csv (Default)
  diagnostic_file =  (Default)
  refresh = 100 (Default)

Unrecoverable error evaluating the log probability at the initial value.
Exception: assign: Rows of left-hand-side (21) and rows of right-hand-side (20) must match in size  (in '../withinchain_surv_stan/parallel.stan' at line 18)

Exception: assign: Rows of left-hand-side (21) and rows of right-hand-side (20) must match in size  (in '../withinchain_surv_stan/parallel.stan' at line 18)

#15

FYI… using MPI parallelism seems to work just fine and is not affected by this scheduling problem.