yep, I grabbed the repo and used just the CmdStan models, w/ 2.25.0, with a/c adapter plugged in. One chain with logistic0
finishes with
Elapsed Time: 23.943 seconds (Warm-up)
20.146 seconds (Sampling)
44.089 seconds (Total)
and logistic1
Elapsed Time: 30.971 seconds (Warm-up)
35.471 seconds (Sampling)
66.442 seconds (Total)
This is with the default build options. I then added to make/local
, perhaps unnecessarily`,
CXX14FLAGS = -DSTAN_THREADS -pthread
CXX14FLAGS += -O3 -march=native -mtune=native
CXX14FLAGS += -fPIC
rebuilt Stan (clean-all
and -B -j build
takes 21s), rebuild the logistic models in the example, then take logistic1
for a spin
% for i in {1..8}; do STAN_NUM_THREADS=$i ./logistic1 sample data file=redcard_input.R | tail -n2 | head -n1; done
62.691 seconds (Total)
32.349 seconds (Total)
23.489 seconds (Total)
18.14 seconds (Total)
18.922 seconds (Total)
19.632 seconds (Total)
16.842 seconds (Total)
17.232 seconds (Total)
with logistic0
(recompiled) as a baseline again,
% ./logistic0 sample data file=redcard_input.R | tail -n2
44.929 seconds (Total)
As a apple to oranges comparison, a given production HPC site we use has E5-2690 v3 nodes (12 core Haswells), and the timings for the two models (v2.25.0, same make/local
, GCC 8.3) are
$ ./logistic0 sample data file=redcard_input.R | tail -n2 | head -n1
140.746 seconds (Total)
$ for i in {1..12}; do STAN_NUM_THREADS=$i ./logistic1 sample data file=redcard_input.R | tail -n2 | head -n1; done
142.539 seconds (Total)
67.85 seconds (Total)
52.471 seconds (Total)
38.98 seconds (Total)
32.013 seconds (Total)
27.088 seconds (Total)
22.436 seconds (Total)
20.239 seconds (Total)
17.952 seconds (Total)
16.299 seconds (Total)
17.573 seconds (Total)
14.551 seconds (Total)
So the two chips are on par with 9 threads Xeon vs 4 threads m1.
Given the m1 is a eight core chip but non uniform I think the flattening in scaling with cores is expected: 4 are “fast” and 4 are “slow” which is probably not what TBB expects, and smarter sharding e…g less work for the slower cores would probably help… or those cores don’t have the ALUs to keep up who knows. It’s still cool to see this in a fanless netbook which doesn’t even heat up while running this, especially matching an HPC compute node within a margin of 20%. The new Mac minis & MBPs should be able avoid thermal throttling for longer workloads, not to mention whatever arm they cook up for the bigger machines.