I have been working on some performance improvements to a model. By accident, I started two runs of the same model, and noticed hugely different performance. One finished in 7 hours, the other took 19 hours.
As far as I can tell these should have been identical, and am looking for other sources of variation that I have not thought of. Here are some details:
- There are huge differences in runtime:
$ tail run_r.sh.o52028819 run_r.sh.o52028600
==> run_r.sh.o52028819 <==
Iteration: 475 / 500 [ 95%] (Sampling)
Iteration: 500 / 500 [100%] (Sampling)
Elapsed Time: 45921 seconds (Warm-up)
21372.7 seconds (Sampling)
67293.7 seconds (Total)
==> run_r.sh.o52028600 <==
Iteration: 475 / 500 [ 95%] (Sampling)
Iteration: 500 / 500 [100%] (Sampling)
Elapsed Time: 14974.4 seconds (Warm-up)
9979.23 seconds (Sampling)
24953.7 seconds (Total)
- They were started from the same seed
- The md5sum of the parameters are the same
$ grep -v '^\#' out/stan_1_52028600*1.csv| md5sum
d62450961f980a338170972314b60434 -
$ grep -v '^\#' out/stan_1_52028819*1.csv| md5sum
d62450961f980a338170972314b60434 -
- They were started on the cluster with identical submit commands:
< submit_cmd: qsub -v SCRIPT=run_stan.R -l mfree=6G -l h_rt=3:0:0:0 -j y -cwd -q short.q run_r.sh
< start_time 1: 05/01/2018 14:56:44.286
---
> submit_cmd: qsub -v SCRIPT=run_stan.R, -l mfree=6G -l h_rt=3:0:0:0 -j y -cwd -q short.q run_r.sh
> start_time 1: 05/01/2018 14:04:39.937
- They were run on different hosts, but the hosts have identical architectures, if I do a diff on /proc/cpuinfo on the two hosts, it is only minor things like this.
< cpu MHz : 2494.035
---
> cpu MHz : 2494.365
1243c1243
< bogomips : 4988.08
---
> bogomips : 4988.09
- They both were compiled with identical CXXFLAGS:
@ cxxflags : chr "CXXFLAGS = -g -O3 -march=native $(LTO)"
- The delay is consistent between chains. I ran four chains on each in parrallel, on the slow run, the chains finished within 80 minutes of eachother, on the fast run the chains finished within 40 minutes of each other.
- I am launching the chains in RStan, and then saving the results to csv.
- I pulled the stan code out from the stan objects to confirm that they are the same:
$ md5sum 52028819.stan 52028600.stan
4127c58a8d3b1258f8b4bb0e8e37af8c 52028819.stan
4127c58a8d3b1258f8b4bb0e8e37af8c 52028600.stan
Any idea what is going on? If there is anyway to make sure I don’t end up with the slow, chains, I’d like to know. Also, with this kind of variability in the same model, it will be impossible to find changes that improve performance.