Hardware Advice

Only more and faster CPU caches will help you (L1, L2, L3) with seriously memory bound models. Accessing the RAM is enormously slow in comparison to doing some computations.

Aside from hardware it‘s even better is, as usual, to formulate the model with good geometry. The other thing to always take advantage of is using Stan provided reductions in case they are not used yet.

2 Likes

The new Ryzen CPUs look very good, with increases in both frequency and IPC.
The 12 and 16 core models feature 64MB of L3 cache, which should help.

I do see good benefit from multithreading even when testing memory bound functions.
For example, dot products using MKL in Julia:

julia> using LinearAlgebra, BenchmarkTools

julia> x = randn(10^6); y = randn(10^6);

julia> BLAS.set_num_threads(1);

julia> @btime dot($x, $y)
  673.124 μs (0 allocations: 0 bytes)
1194.8538568936592

julia> BLAS.set_num_threads(18);

julia> @btime dot($x, $y)
  12.164 μs (0 allocations: 0 bytes)
1194.8538568936654

I see a tremendous, superlinear, speedup – about 55x faster when using 18 cores.
If the computation were compute-bottlenecked, we’d expect to see 18x improvement at best when using 18 threads.

Being memory bottlenecked means we can actually see much bigger gains from multithreading. Why?
If doing it serially, all the data must churn through the caches of a single core. But if I do it in parallel, the memory we’re computing on can be divided up among cores and sit in the core-local L2 caches:

julia> (sizeof(x) + sizeof(y)) / (18 * (2 ^ 20)) # sizeof(x) = length(x) * 8
0.8477105034722222

My CPU has 18 cores featuring 2 ^ 20 bytes each. So if I do the dot product in parallel, the memory can simply be divided up among the L2 caches of the 18 cores doing the computation (taking up only about 85% of the available space), essentially eliminating any L2 <-> L3 memory churn, greatly reducing the memory bandwidth problem.

Greatly reducing, but the problem is still there and still quite bad:

julia> 2e-9length(x) / 12.164e-6 # calculate GFLOPS
164.41959881617888

because the CPU gets over 2 teraflops when performing matrix multiplication:

julia> A = randn(16_000, 16_000);

julia> B = randn(16_000, 16_000);

julia> C = similar(A);

julia> @time mul!(C, A, B); # run to compile
  4.053175 seconds (20 allocations: 1.562 KiB)

julia> 2e-9 * 16_000^3 / 4.053175
2021.1315820313703

julia> @time mul!(C, A, B);
  4.021825 seconds

julia> 2e-9 * 16_000^3 / 4.021825
2036.8862394559685

This is about 12x the floating point operations per second than we saw with the dot products.

The theoretical peak flops for a dot product on my CPU is only half of that of matrix multiplication. The CPU is limited at up to 2 memory loads and up to 2 multiply-adds per cycle; with matrix multiplication it will perform 2 multiply-adds a cycle, but with a dot product we need 2 memory loads per multiply add, limiting us to at best a single multiply-add. But the dot product was 12x slower, not 2x, meaning something else (presumably memory) was the bottleneck.

1 Like