Speed difference between rstan and cmdstan for a simple model

mitzimorris · November 7, 2021, 4:46pm

data generation - more fine-grained instrumentation of the algorithms
data transport layer - protobufs etc.
downstream assembly - xarray et al.

items 1 and 2 fall under the core Stan C++ lib; items 2 and 3 are for downstream consumers (including Stan interfaces CmdStan, CmdStanX, RStan, PyStan, et al)

I think this fits in well with what we want from a data transport layer: streaming outputs with low-latency and I’m all for any similar protocol that’s robust and has C++, Python, R, and Julia interfaces.

as for items 1 and 3, the mismatch persists - Stan emits data row-wise, downstream analysis operates column-wise. the challenge is creating downstream monitors that can provide online diagnostics and estimates.

mike-lawrence · November 7, 2021, 5:02pm

Ah, aspects I’m sure you’ve probably already educated me on but no I’ve forgotten; thank you for the reminder!

rfc · November 7, 2021, 7:35pm

I used linux perf to measure where the CPU spends time when sampling from the model with both cmdstan 2.27.0 and cmdstan 2.28.1 .

# n.b "bench" is the name of the model
perf record -F 99 ./bench sample data file=bench-data.json output file=/dev/null random seed=1234
perf report

The perf report output shows the percentage of CPU time spent inside each function call. The raw output is a bit overwhelming, but when we print the time by shared library name per cmdstan version, we can see a big difference in the amount of time spent in libc-2.31.so:

Shared Object	cmdstan 2.27.0	cmdstan 2.28.1
[kernel.kallsyms]	6.05%	4.15%
bench	65.33%	44.51%
libc-2.31.so	19.15%	38.42%
libm-2.31.so	0.21%	0.03%
libpthread-2.31.so	0.27%	0.30%
libstdc++.so.6.0.28	8.68%	12.42%
total	99.69%	99.83%

When we drill down and compare the CPU time spent inside the top 5 function calls inside libc-2.31.so there’s quite a big difference in __GI___printf_fp_l:

Shared Object	Symbol	cmdstan 2.27.0	cmdstan 2.28.1
libc-2.31.so	__GI___printf_fp_l	7.24%	17.45%
libc-2.31.so	__memmove_avx_unaligned_erms	4.93%	4.24%
libc-2.31.so	hack_digit	1.53%	3.59%
libc-2.31.so	__vfprintf_internal	1.21%	2.98%
libc-2.31.so	__mpn_mul_1	0.82%	1.44%

Both __printf_fp_l and hack_digit appear to be internal parts of glibc used to print floating point values: glibc/printf_fp.c at glibc-2.31 · bminor/glibc · GitHub

This suggests cmdstan 2.28.1 might be spending 2x as much time as cmdstan 2.27.0 printing floating point values! Is this expected? Superficially, when I compare the file sizes of both output.csv files, they’re the same, so it doesn’t seem like the user is getting twice as many floats printed in 2.28.1 to compensate them for roughly the double CPU time.

rok_cesnovar · November 7, 2021, 7:45pm

Thanks @rfc!

I have identified the changes that caused this slowdown and also that the slowdown is not so evident with models that aren’t as trivial as this one.

The changes are from here: Running multiple chains in one Stan program by SteveBronder · Pull Request #987 · stan-dev/cmdstan · GitHub

I will move the performance regression discussion to Performance regression 2.28 · Issue #1055 · stan-dev/cmdstan · GitHub

rfc · November 7, 2021, 8:14pm

Looks suspiciously like the work to format and print diagnostic output (and throw it into the aether) is toggled on by default: Performance regression 2.28 · Issue #1055 · stan-dev/cmdstan · GitHub

rok_cesnovar · November 7, 2021, 8:31pm

Perfect catch @rfc! Spot on.

Topic		Replies	Views
Rstan 2.19.2 slower than 2.18.1 Developers rstan	15	1242	August 27, 2019
Runtime difference between cmdstan v2.28.2 and v2.27.0 CmdStan	7	958	December 10, 2021
Rstan versus cmdstan run times Developers cmdstan , rstan , cmdstanr	1	551	December 29, 2023
Slow cmdstanr/posterior vs. rstan summary CmdStan cmdstanr	5	1297	November 16, 2021
New computer...Mac or Windows, RStan or CmdStanR? General rstan , windows , mac , cmdstanr	23	3961	July 18, 2022

Speed difference between rstan and cmdstan for a simple model

Related topics