Stan SIMD & Performance

I’m a few approaches would be possible, like CxxWrap.jl, but as seantalts suggested, it’d be easier to use one of the native Julia HMC libraries. They’re all copying Stan’s algorithms / based on the same set of references.

Julia and C++ code compiled with Clang should be about the same fast, when both are implemented in the same way.
The implementation differences account for the difference in performance, not the language.

Of course, some languages make certain types of implementations easier than others, eg neither C or Fortran have an equivalent to template metaprogamming, making it more difficult to optimize generic algorithms in them than it is in C++ or Julia.
You can also do this in C++ with templates, but I’d be interested in a C/Fortran implementation:

function fib(::Val{N}) where {N}
    N > 1 ? fib(Val(N-1)) + fib(Val(N-2)) : N
end
@time fib(Val(46))
using BenchmarkTools; @btime fib(Val(46))
@code_typed fib(Val(46))

(On the other hand, it’s a lot easier to end up with extremely slow code in Julia than it is in the other languages, eg if it is type unstable.)

Ah, yes, Bob said this earlier.
While (likely) much slower, the compiler can sometimes use clever permutations to vectorize the array of structs code, or at least some of the computations. A bigger benefit (vs what Stan does) is probably that it doesn’t have to constantly chase pointers. Adding to the Julia points comparison:

using StaticArrays, BenchmarkTools, StructArrays
mutable struct MPoint3D{T} <: FieldVector{3, T}
    x::T
    y::T
    z::T
end

By declaring it a mutable struct instead of a struct, it will be heap allocated with a constant memory address. Meaning they will not be stored inline in an array.

v_aos = [Point3D(rand(3))  for i = 1:1000]; # array of structs
v_soa = StructArray(v_aos); # struct of arrays
v_aop = [MPoint3D(v)  for v = v_aos]; # array of pimpl

Rerunning the previous benchmark (times for the array of structs and struct of arrays were about 1.6 and 0.5 microseconds, respectively) on the same laptop:

julia> @benchmark f!($v_aop)
BenchmarkTools.Trial: 
  memory estimate:  31.25 KiB
  allocs estimate:  1000
  --------------
  minimum time:     12.643 μs (0.00% GC)
  median time:      15.287 μs (0.00% GC)
  mean time:        20.145 μs (18.37% GC)
  maximum time:     5.701 ms (99.06% GC)
  --------------
  samples:          10000
  evals/sample:     1

If we instead modify the old point in place to avoid all these allocations:

julia> function f2!(v)
           @inbounds @simd ivdep for i in eachindex(v)
               vi = v[i]
               x, y, z = vi
               vi.x = 2x; vi.y = x*y; vi.z = x*z
           end
       end
f2! (generic function with 1 method)

julia> @benchmark f2!($v_aop)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.041 μs (0.00% GC)
  median time:      3.061 μs (0.00% GC)
  mean time:        3.075 μs (0.00% GC)
  maximum time:     13.013 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

About 6 times slower than the struct of arrays version. Up to 4x could be attributed to avx2, and the rest probably to data locality. (Stan’s memory pool might mean things are a little better?)
The assembly now also shows no pd operations, there is no simd at all:

julia> @code_native debuginfo=:none f2!(v_aop)
	.text
	pushq	%rax
	movq	%rsi, (%rsp)
	movq	(%rsi), %rcx
	movq	24(%rcx), %rax
	testq	%rax, %rax
	jle	L97
	movq	%rax, %rdx
	sarq	$63, %rdx
	andnq	%rax, %rdx, %rax
	movq	(%rcx), %rcx
	xorl	%edx, %edx
	nopw	%cs:(%rax,%rax)
	nopl	(%rax)
L48:
	movq	(%rcx,%rdx,8), %rsi
	testq	%rsi, %rsi
	je	L109
	vmovsd	(%rsi), %xmm0           # xmm0 = mem[0],zero
	vaddsd	%xmm0, %xmm0, %xmm1
	vmovsd	%xmm1, (%rsi)
	vmulsd	8(%rsi), %xmm0, %xmm1
	vmovsd	%xmm1, 8(%rsi)
	vmulsd	16(%rsi), %xmm0, %xmm0
	vmovsd	%xmm0, 16(%rsi)
	incq	%rdx
	cmpq	%rax, %rdx
	jb	L48
L97:
	movabsq	$jl_system_image_data, %rax
	popq	%rcx
	retq
L109:
	movabsq	$jl_throw, %rax
	movabsq	$jl_system_image_data, %rdi
	callq	*%rax
	nopw	%cs:(%rax,%rax)
	nopl	(%rax)

Not everything can be roughly analogous, but some can be.

Does Eigen support in place operations? If so, that might be a simpler approach, as it wouldn’t require hacking on Eigen. Before sampling begins, you’d preallocate every Eigen object needed, and then rewrite into these each time you evaluate the density / gradient.

It would likely require more memory (you probably be able to avoid incrementing a stack pointer in some occasions, reusing the same memory more than once in a single logdensity/gradient calculation), but should be a solid improvement.

The Julia library ReverseDiff.jl does this. It allocates each object needed while recording the tape, and then saves them for reuse.
Shouldn’t something like this be possible in Stan (even though it doesn’t use a compile-able tape)?

FWIW, it’d be great if the generic auto version can be made to work generally. If every function were defined in that manner, switching to struct of arrays should be relatively simple. Eg, define a class that holds two Eigen<double,-1,-1> (value and adjoint), as well as similar for function pointers if you need that.
Same story for using sparse matrices, like Bob Carpenter pushes for.

However, in one of your links some folks discuss pitfalls with the all-auto approach.

Disclaimer – unless you want to get into the weeds / help implement the library, I wouldn’t yet recommend trying it out. It’s currently still much to rough, and the master branch likely to spend a high fraction of the time broken, until I finally hit the first 0.1.0 release.

The simplistic idea is that using the sort of dual approach is slow, while operating on dense arrays is fast. Therefore prefer the latter, and fallback to the former when the first fails.
Many functions have versions that return also return adjoint-operator, that update the adjoint.
The source to source part simply swaps functions with the version returning this, and adds a backwards pass.
Importantly, I don’t support control flow yet, but my workaround is to cover those with a fallback approach.

Along with handling control flow, functions without versions returning adjoints will also be handled via the fall back approach, which will either be something like forward mode differentiation to get a dense Jacobian matrix (with which to get the reverse mode adjoint), or use a Zygote pullback.
IMO both have unacceptably bad performance at the moment.

Stan could perhaps somewhat use the reverse approach.
When multiplying matrices, Stan copies into matrices of doubles for matrix multiplication.
How does Stan handle the reverse update of the adjoints? Also by copying into matrices of doubles and performing the multiplication? If so, then it does in a sense take the reverse approach – defaulting to be generic / flexible, and in cases that can be highly optimized, it switches to specialized routines.

Source to source can be very efficient for linear algebra (admittedly, I have a long way to go to actually support this myself). Here is a paper providing lots of derivative results useful for calculating adjoints in the reverse pass.

Of course, you can also probably do this in the overload approach.
I’d argue that this is another point in favor of the struct-of-arrays data layout. For all of these, you’d want to take advantage of the optimzied linear algebra routines on doubles for both calculating the values, and then again calculating the adjoints. It’s faster to not first copy everything into an array of doubles.

I’d sign up myself! =)
Here is a paper (by Goto, whose GotoBLAS turned into OpenBLAS, and van de Geijn). I need to study this paper more myself.
So far, all I’ve been doing is optimizing micro kernels and crossing my fingers that I don’t get memory throttled too badly (and dispatching to MKL for some operations, like multiplying matrices larger that 60x60 or so).

My applications have been more along the lines of fitting a model on a million small data sets, rather than fitting one model on a very large data set. Meaning scaling with size of the inputs hasn’t been a major concern, but it’s something I plan to start addressing.

It’s Julia’s take on expression templates. Unfortunately, they don’t have fusion of matrix multiplication yet. I figured I may try and implement it by having a different operator (eg, \times, although that may be interpreted as cross) return an expression template.

julia> x = collect(1:10); x'
1×10 LinearAlgebra.Adjoint{Int64,Array{Int64,1}}:
 1  2  3  4  5  6  7  8  9  10

julia> y = fill(3, 5); y'
1×5 LinearAlgebra.Adjoint{Int64,Array{Int64,1}}:
 3  3  3  3  3

julia> sin.(x .- y') .+ exp(0.4)
10×5 Array{Float64,2}:
 0.582527  0.582527  0.582527  0.582527  0.582527
 0.650354  0.650354  0.650354  0.650354  0.650354
 1.49182   1.49182   1.49182   1.49182   1.49182
 2.3333    2.3333    2.3333    2.3333    2.3333
 2.40112   2.40112   2.40112   2.40112   2.40112
 1.63294   1.63294   1.63294   1.63294   1.63294
 0.735022  0.735022  0.735022  0.735022  0.735022
 0.5329    0.5329    0.5329    0.5329    0.5329
 1.21241   1.21241   1.21241   1.21241   1.21241
 2.14881   2.14881   2.14881   2.14881   2.14881

julia> Meta.@lower sin.(x .- y') .+ exp(0.4)
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = Base.adjoint(y)
│   %2 = Base.broadcasted(-, x, %1)
│   %3 = Base.broadcasted(sin, %2)
│   %4 = exp(0.4)
│   %5 = Base.broadcasted(+, %3, %4)
│   %6 = Base.materialize(%5)
└──      return %6
))))

julia> bc = Base.broadcasted(+, Base.broadcasted(sin, Base.broadcasted(-, x, y')), exp(0.4))
Base.Broadcast.Broadcasted(+, (Base.Broadcast.Broadcasted(sin, (Base.Broadcast.Broadcasted(-, ([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [3 3 … 3 3])),)), 1.4918246976412703))

julia> typeof(bc)
Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(+),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(sin),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(-),Tuple{Array{Int64,1},LinearAlgebra.Adjoint{Int64,Array{Int64,1}}}}}},Float64}}

julia> Base.materialize(bc)
10×5 Array{Float64,2}:
 0.582527  0.582527  0.582527  0.582527  0.582527
 0.650354  0.650354  0.650354  0.650354  0.650354
 1.49182   1.49182   1.49182   1.49182   1.49182
 2.3333    2.3333    2.3333    2.3333    2.3333
 2.40112   2.40112   2.40112   2.40112   2.40112
 1.63294   1.63294   1.63294   1.63294   1.63294
 0.735022  0.735022  0.735022  0.735022  0.735022
 0.5329    0.5329    0.5329    0.5329    0.5329
 1.21241   1.21241   1.21241   1.21241   1.21241
 2.14881   2.14881   2.14881   2.14881   2.14881

The macro Meta.@lower reveals that the “dots” get translated into calls to the Base.broadcasted function, which creates the lazy expressions.
materialize is the equivalent of Eigen’s .eval(), to turn then into an array. There is also materialize!, which stores the results into a pre-allocated array.

I think this makes it simple to choose when the expression is evaluated, and which operations get fused by making it a part of the language’s semantics.

So, to implement it similarly in Stan, you’d have to lowering them into calls to build up the Eigen expressions ( perhaps also adding rep_(row_)vector/rep_matrix, as appropriate ) with a .eval() at the end.

Speaking of virtual functions – I believe a virtual function call in C++ is much faster than a dynamic dispatch in Julia (where you’d try to eliminate all dynamic dispatches in performance-critical code), but can these be devirtualized and inlined by a real compiler, or is it feasible?

Is it because a full history of every vari object must exist to be able to perform the reverse pass at the end?
As is, the old information would be destroyed. Hence, the need to copy upon setting to preserve it.

In that case, a static (immutable) type makes sense. It should still support linear algebra, broadcasting, and elementwise operations (including vectorized distrbution syntax).

Most uses of vectors/arrays could probably be expressed with immutable arrays, although I haven’t done a survey of models.

@bbbales2 Thanks, when I’m back in town I’ll rerun that benchmark and update the results.
However, the minimum, median, and mean times (out of 27) samples were all pretty similar, at 172, 193, and 192 milliseconds, respectively. That doesn’t look suspicious, or what I’d imagine a linearly increasing runtime to look like. I’d expect the minimum to be much lower than the mean and median in this case. The maximum also isn’t much higher here.
Although reclaiming memory may still boost performance, so please disregaurd the old results until I rerun them.

1 Like