2.5x speed difference in runs of the identical model

I have been working on some performance improvements to a model. By accident, I started two runs of the same model, and noticed hugely different performance. One finished in 7 hours, the other took 19 hours.

As far as I can tell these should have been identical, and am looking for other sources of variation that I have not thought of. Here are some details:

  • There are huge differences in runtime:
$ tail run_r.sh.o52028819 run_r.sh.o52028600
==> run_r.sh.o52028819 <==

Iteration: 475 / 500 [ 95%]  (Sampling)
Iteration: 500 / 500 [100%]  (Sampling)

 Elapsed Time: 45921 seconds (Warm-up)
               21372.7 seconds (Sampling)
               67293.7 seconds (Total)

==> run_r.sh.o52028600 <==

Iteration: 475 / 500 [ 95%]  (Sampling)
Iteration: 500 / 500 [100%]  (Sampling)

 Elapsed Time: 14974.4 seconds (Warm-up)
               9979.23 seconds (Sampling)
               24953.7 seconds (Total)

  • They were started from the same seed
  • The md5sum of the parameters are the same
$ grep -v '^\#' out/stan_1_52028600*1.csv| md5sum           
d62450961f980a338170972314b60434  -
$ grep -v '^\#' out/stan_1_52028819*1.csv| md5sum
d62450961f980a338170972314b60434  -
  • They were started on the cluster with identical submit commands:
< submit_cmd:                 qsub -v SCRIPT=run_stan.R -l mfree=6G -l h_rt=3:0:0:0 -j y -cwd -q short.q run_r.sh
< start_time            1:    05/01/2018 14:56:44.286
> submit_cmd:                 qsub -v SCRIPT=run_stan.R, -l mfree=6G -l h_rt=3:0:0:0 -j y -cwd -q short.q run_r.sh
> start_time            1:    05/01/2018 14:04:39.937
  • They were run on different hosts, but the hosts have identical architectures, if I do a diff on /proc/cpuinfo on the two hosts, it is only minor things like this.
< cpu MHz		: 2494.035
> cpu MHz		: 2494.365
< bogomips	: 4988.08
> bogomips	: 4988.09
  • They both were compiled with identical CXXFLAGS:
@ cxxflags    : chr "CXXFLAGS = -g -O3 -march=native $(LTO)"
  • The delay is consistent between chains. I ran four chains on each in parrallel, on the slow run, the chains finished within 80 minutes of eachother, on the fast run the chains finished within 40 minutes of each other.
  • I am launching the chains in RStan, and then saving the results to csv.
  • I pulled the stan code out from the stan objects to confirm that they are the same:
$ md5sum 52028819.stan 52028600.stan
4127c58a8d3b1258f8b4bb0e8e37af8c  52028819.stan
4127c58a8d3b1258f8b4bb0e8e37af8c  52028600.stan

Any idea what is going on? If there is anyway to make sure I don’t end up with the slow, chains, I’d like to know. Also, with this kind of variability in the same model, it will be impossible to find changes that improve performance.

1 Like

Do you have LTO set to one thing on one host and something different on another?

I don’t have LTO set on either host. Actually, looking closer, I’m not sure that R got the CXXFLAGS correct. I am setting them using R_MAKEVARS_USER, and if I recall correctly, this isn’t actually captured in the saved stan output. These are the CXXFLAGS that I am setting:

CXXFLAGS =            -O3 -pipe -Wno-unused -march=native -DEIGEN_NO_DEBUG \

Actually, I have some custom caching code, so it is compiled the first time (which ended up being the faster run), and then reloaded from the saved RDS for the second run.

Check the distribution of number of leapfrog steps – if your model is tricky for HMC to explore then adaptation can be variable and you can get some chains that encounter nastier geometry earlier in their explorations running much longer than others.

1 Like

The differences were between successive runs with the same seeds, so the distribution of leapfrog steps (and samples generated) were identical.

I ran it again, this time there was only a 25% difference in runtime between the two runs. Still not sure what was going on before, and this still seems somewhat higher than expected…

Is your cluster configured right? Were other processes competing with Stan?

Good question! I emailed the cluster admin with it. Apparently the two nodes were on different chassis, one of which had an old firmware, that would sometimes put a power cap on the node which can throttle the max CPU. Sounds like that was the culprit!