Link error on Apple M1

I grabbed the latest CmdStan release (2.25.0) from GitHub and did a make -j build and it went well. Building the Bernoulli example however tanked while building TBB with

$ make  examples/bernoulli/bernoulli
...
...
clang: warning: argument unused during compilation: '-mrtm' [-Wunused-command-line-argument]
clang++ -fPIC -o libtbb.dylib concurrent_hash_map.o concurrent_queue.o concurrent_vector.o dynamic_link.o itt_notify.o cache_aligned_allocator.o pipeline.o queuing_mutex.o queuing_rw_mutex.o reader_writer_lock.o spin_rw_mutex.o x86_rtm_rw_mutex.o spin_mutex.o critical_section.o mutex.o recursive_mutex.o condition_variable.o tbb_thread.o concurrent_monitor.o semaphore.o private_server.o rml_tbb.o tbb_misc.o tbb_misc_ex.o task.o task_group_context.o governor.o market.o arena.o scheduler.o observer_proxy.o tbb_statistics.o tbb_main.o concurrent_vector_v2.o concurrent_queue_v2.o spin_rw_mutex_v2.o task_v2.o   -ldl -lpthread -dynamiclib -install_name @rpath/libtbb.dylib -stdlib=libc++ -m32 -mmacosx-version-min=10.11 -Wl,-L,"/Users/duke/Downloads/cmdstan-2.25.0/stan/lib/stan_math/lib/tbb" -Wl,-rpath,"/Users/duke/Downloads/cmdstan-2.25.0/stan/lib/stan_math/lib/tbb"   -Wl,-exported_symbols_list,tbb.def
ld: unknown/unsupported architecture name for: -arch armv4t
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[1]: *** [libtbb.dylib] Error 1

is there a known fix or way to not use tbb?

You need to apply this fix: https://github.com/stan-dev/math/pull/2208

2 Likes

Thanks, that allowed me to then discover

% file bin/stanc 
bin/stanc: Mach-O 64-bit executable x86_64
% bin/stanc
zsh: bad CPU type in executable: bin/stanc

which is OK because I can just build stanc2,

% make -B bin/stanc2

which issues warnings but works, then

% make STANC2=1 examples/bernoulli/bernoulli

succeeds and the model runs so fast I need a better benchmark model…

Yeah, you probably could rename bin/stanc to something else and the make bin/stanc a script that emulates stanc in x86. We will make something like that for the next release if we do not find a way to build arm stanc.

I don’t think it’s even required since, I only now figured out, macOS has a translation layer that practically everyone would have installed already, and once you install it, trying to invoke an x86_64 binary like bin/stanc automatically uses the translation layer to run it.

1 Like

Oh I thought rosetta 2 comes preinstalled. Thanks for the info! Slowly but surely we will get all the info :)

There is also quite some people here and on twitter wondering how m1 macs run with Stan so any info would be greately appreciated!

I think Rosetta 2 was supposed to come preinstalled, but with 11.0.1 it’s now optional? It’s pretty quick install, no reboot required.

I’m happy to run some models and post numbers with & without the translation layer, if you can point me to something.

If you really would not mind, I would be interested in running the reduce sum redcard example: https://mc-stan.org/users/documentation/case-studies/reduce_sum_tutorial.html

Thanks in advance!

yep, I grabbed the repo and used just the CmdStan models, w/ 2.25.0, with a/c adapter plugged in. One chain with logistic0 finishes with

 Elapsed Time: 23.943 seconds (Warm-up)
               20.146 seconds (Sampling)
               44.089 seconds (Total)

and logistic1

 Elapsed Time: 30.971 seconds (Warm-up)
               35.471 seconds (Sampling)
               66.442 seconds (Total)

This is with the default build options. I then added to make/local, perhaps unnecessarily`,

CXX14FLAGS = -DSTAN_THREADS -pthread
CXX14FLAGS += -O3 -march=native -mtune=native
CXX14FLAGS += -fPIC

rebuilt Stan (clean-all and -B -j build takes 21s), rebuild the logistic models in the example, then take logistic1 for a spin

% for i in {1..8}; do STAN_NUM_THREADS=$i ./logistic1 sample data file=redcard_input.R | tail -n2 | head -n1; done
               62.691 seconds (Total)
               32.349 seconds (Total)
               23.489 seconds (Total)
               18.14 seconds (Total)
               18.922 seconds (Total)
               19.632 seconds (Total)
               16.842 seconds (Total)
               17.232 seconds (Total)

with logistic0 (recompiled) as a baseline again,

% ./logistic0 sample data file=redcard_input.R | tail -n2           
               44.929 seconds (Total)

As a apple to oranges comparison, a given production HPC site we use has E5-2690 v3 nodes (12 core Haswells), and the timings for the two models (v2.25.0, same make/local, GCC 8.3) are

$  ./logistic0 sample data file=redcard_input.R | tail -n2 | head -n1
               140.746 seconds (Total)
$ for i in {1..12}; do STAN_NUM_THREADS=$i ./logistic1 sample data file=redcard_input.R | tail -n2 | head -n1; done
               142.539 seconds (Total)
               67.85 seconds (Total)
               52.471 seconds (Total)
               38.98 seconds (Total)
               32.013 seconds (Total)
               27.088 seconds (Total)
               22.436 seconds (Total)
               20.239 seconds (Total)
               17.952 seconds (Total)
               16.299 seconds (Total)
               17.573 seconds (Total)
               14.551 seconds (Total)

So the two chips are on par with 9 threads Xeon vs 4 threads m1.

Given the m1 is a eight core chip but non uniform I think the flattening in scaling with cores is expected: 4 are “fast” and 4 are “slow” which is probably not what TBB expects, and smarter sharding e…g less work for the slower cores would probably help… or those cores don’t have the ALUs to keep up who knows. It’s still cool to see this in a fanless netbook which doesn’t even heat up while running this, especially matching an HPC compute node within a margin of 20%. The new Mac minis & MBPs should be able avoid thermal throttling for longer workloads, not to mention whatever arm they cook up for the bigger machines.

4 Likes