Stan SIMD & Performance

Is that done in this code here? Feel like a half semester of a C++ optimization class could be spent on Eigen’s product folder haha

I would love for stan to steal this idea. A consistent and nice shorthand for broadcasting would be awesome

The rest of what you said above this line is extremely cool but a bit higher than my paygrade haha. But one other thing about our var type is that the base vari inside of it is 24 bytes (2 doubles and 2 virtual functions). This has always felt very not cache friendly since most modern caches are 64 bytes we can only fit 2 varis in cache at once. If we could remove those virtual functions we could fit 4 varis in each cache line. @Bob_Carpenter I think you said before you had some scheme for that?

Bob and I have spoken about this before and I think the post here talks about adding a static matrix type for that. I never fully understood why the static matrix type would force a copy on setting

I wonder if we made Stan’s stack allocator more general if we could do something like this as well? When I ran heaptrack over one of my stan programs I found it took a lot of time allocating and deallocating the point_ps types when it built the tree.

I think with a bit of weirdness and elbow grease this is feasible in our current stuff. Something something using our own stack allocator and Eigen::Map to construct matrices instead of Eigen’s allocator.

Orrr you could help us do some of this! :-)

I tried this as well though I also don’t think I tried hard enough. I think you need the local allocator for each object to then actually point to a global allocator that’s never destroyed. Then do the standard tracking of space used.

Also goofed around with this, I think we would need to use Eigen::Map if we want to allocate our own memory for Eigen. They don’t have any macros or templates for using your own allocator (that I know of).

Yeah if we can remove the pimpl idiom we currently use in var->vari that would remove a lot of pointer chasing. I’ve tried goofing with this but can’t find a pattern that works well. Though I’m sure there is something out there to do. The main thing is dealing with the global AD tape that holds a pointer to those varis in var.

Yes and it’s excellent! Tadej also worked on something similar to this for the elementwise functions that’s seeing a nice average speedbump of 3% on the performance benchmarks. (meta, it’s v rare to see PRs that give speedups to all of the benchmarks so that is very nice)

Make sure when you’re benchmarking the autodiff stack to call recover_memory() after you’re done with each grad (

When you call lp.grad(), Stan will go through everything on the stack even if it wasn’t part of the calculation. So like if you do:

var a = 1;
var b = a + a;
var c = 2 * a;


I’m pretty sure the adjoint stuff for c will happen. It’s just the adjoint for c is zero so you don’t see it. b.grad() calls: – it’s not really a linked list or anything that is careful about dependencies.

Anyway this can lead to weird benchmarking behavior cause if you do some calculations and then don’t clear them from the stack when you’re done with them, they’ll get redone and stuff will look super slow.

I’m a few approaches would be possible, like CxxWrap.jl, but as seantalts suggested, it’d be easier to use one of the native Julia HMC libraries. They’re all copying Stan’s algorithms / based on the same set of references.

Julia and C++ code compiled with Clang should be about the same fast, when both are implemented in the same way.
The implementation differences account for the difference in performance, not the language.

Of course, some languages make certain types of implementations easier than others, eg neither C or Fortran have an equivalent to template metaprogamming, making it more difficult to optimize generic algorithms in them than it is in C++ or Julia.
You can also do this in C++ with templates, but I’d be interested in a C/Fortran implementation:

function fib(::Val{N}) where {N}
    N > 1 ? fib(Val(N-1)) + fib(Val(N-2)) : N
@time fib(Val(46))
using BenchmarkTools; @btime fib(Val(46))
@code_typed fib(Val(46))

(On the other hand, it’s a lot easier to end up with extremely slow code in Julia than it is in the other languages, eg if it is type unstable.)

Ah, yes, Bob said this earlier.
While (likely) much slower, the compiler can sometimes use clever permutations to vectorize the array of structs code, or at least some of the computations. A bigger benefit (vs what Stan does) is probably that it doesn’t have to constantly chase pointers. Adding to the Julia points comparison:

using StaticArrays, BenchmarkTools, StructArrays
mutable struct MPoint3D{T} <: FieldVector{3, T}

By declaring it a mutable struct instead of a struct, it will be heap allocated with a constant memory address. Meaning they will not be stored inline in an array.

v_aos = [Point3D(rand(3))  for i = 1:1000]; # array of structs
v_soa = StructArray(v_aos); # struct of arrays
v_aop = [MPoint3D(v)  for v = v_aos]; # array of pimpl

Rerunning the previous benchmark (times for the array of structs and struct of arrays were about 1.6 and 0.5 microseconds, respectively) on the same laptop:

julia> @benchmark f!($v_aop)
  memory estimate:  31.25 KiB
  allocs estimate:  1000
  minimum time:     12.643 μs (0.00% GC)
  median time:      15.287 μs (0.00% GC)
  mean time:        20.145 μs (18.37% GC)
  maximum time:     5.701 ms (99.06% GC)
  samples:          10000
  evals/sample:     1

If we instead modify the old point in place to avoid all these allocations:

julia> function f2!(v)
           @inbounds @simd ivdep for i in eachindex(v)
               vi = v[i]
               x, y, z = vi
               vi.x = 2x; vi.y = x*y; vi.z = x*z
f2! (generic function with 1 method)

julia> @benchmark f2!($v_aop)
  memory estimate:  0 bytes
  allocs estimate:  0
  minimum time:     3.041 μs (0.00% GC)
  median time:      3.061 μs (0.00% GC)
  mean time:        3.075 μs (0.00% GC)
  maximum time:     13.013 μs (0.00% GC)
  samples:          10000
  evals/sample:     8

About 6 times slower than the struct of arrays version. Up to 4x could be attributed to avx2, and the rest probably to data locality. (Stan’s memory pool might mean things are a little better?)
The assembly now also shows no pd operations, there is no simd at all:

julia> @code_native debuginfo=:none f2!(v_aop)
	pushq	%rax
	movq	%rsi, (%rsp)
	movq	(%rsi), %rcx
	movq	24(%rcx), %rax
	testq	%rax, %rax
	jle	L97
	movq	%rax, %rdx
	sarq	$63, %rdx
	andnq	%rax, %rdx, %rax
	movq	(%rcx), %rcx
	xorl	%edx, %edx
	nopw	%cs:(%rax,%rax)
	nopl	(%rax)
	movq	(%rcx,%rdx,8), %rsi
	testq	%rsi, %rsi
	je	L109
	vmovsd	(%rsi), %xmm0           # xmm0 = mem[0],zero
	vaddsd	%xmm0, %xmm0, %xmm1
	vmovsd	%xmm1, (%rsi)
	vmulsd	8(%rsi), %xmm0, %xmm1
	vmovsd	%xmm1, 8(%rsi)
	vmulsd	16(%rsi), %xmm0, %xmm0
	vmovsd	%xmm0, 16(%rsi)
	incq	%rdx
	cmpq	%rax, %rdx
	jb	L48
	movabsq	$jl_system_image_data, %rax
	popq	%rcx
	movabsq	$jl_throw, %rax
	movabsq	$jl_system_image_data, %rdi
	callq	*%rax
	nopw	%cs:(%rax,%rax)
	nopl	(%rax)

Not everything can be roughly analogous, but some can be.

Does Eigen support in place operations? If so, that might be a simpler approach, as it wouldn’t require hacking on Eigen. Before sampling begins, you’d preallocate every Eigen object needed, and then rewrite into these each time you evaluate the density / gradient.

It would likely require more memory (you probably be able to avoid incrementing a stack pointer in some occasions, reusing the same memory more than once in a single logdensity/gradient calculation), but should be a solid improvement.

The Julia library ReverseDiff.jl does this. It allocates each object needed while recording the tape, and then saves them for reuse.
Shouldn’t something like this be possible in Stan (even though it doesn’t use a compile-able tape)?

FWIW, it’d be great if the generic auto version can be made to work generally. If every function were defined in that manner, switching to struct of arrays should be relatively simple. Eg, define a class that holds two Eigen<double,-1,-1> (value and adjoint), as well as similar for function pointers if you need that.
Same story for using sparse matrices, like Bob Carpenter pushes for.

However, in one of your links some folks discuss pitfalls with the all-auto approach.

Disclaimer – unless you want to get into the weeds / help implement the library, I wouldn’t yet recommend trying it out. It’s currently still much to rough, and the master branch likely to spend a high fraction of the time broken, until I finally hit the first 0.1.0 release.

The simplistic idea is that using the sort of dual approach is slow, while operating on dense arrays is fast. Therefore prefer the latter, and fallback to the former when the first fails.
Many functions have versions that return also return adjoint-operator, that update the adjoint.
The source to source part simply swaps functions with the version returning this, and adds a backwards pass.
Importantly, I don’t support control flow yet, but my workaround is to cover those with a fallback approach.

Along with handling control flow, functions without versions returning adjoints will also be handled via the fall back approach, which will either be something like forward mode differentiation to get a dense Jacobian matrix (with which to get the reverse mode adjoint), or use a Zygote pullback.
IMO both have unacceptably bad performance at the moment.

Stan could perhaps somewhat use the reverse approach.
When multiplying matrices, Stan copies into matrices of doubles for matrix multiplication.
How does Stan handle the reverse update of the adjoints? Also by copying into matrices of doubles and performing the multiplication? If so, then it does in a sense take the reverse approach – defaulting to be generic / flexible, and in cases that can be highly optimized, it switches to specialized routines.

Source to source can be very efficient for linear algebra (admittedly, I have a long way to go to actually support this myself). Here is a paper providing lots of derivative results useful for calculating adjoints in the reverse pass.

Of course, you can also probably do this in the overload approach.
I’d argue that this is another point in favor of the struct-of-arrays data layout. For all of these, you’d want to take advantage of the optimzied linear algebra routines on doubles for both calculating the values, and then again calculating the adjoints. It’s faster to not first copy everything into an array of doubles.

I’d sign up myself! =)
Here is a paper (by Goto, whose GotoBLAS turned into OpenBLAS, and van de Geijn). I need to study this paper more myself.
So far, all I’ve been doing is optimizing micro kernels and crossing my fingers that I don’t get memory throttled too badly (and dispatching to MKL for some operations, like multiplying matrices larger that 60x60 or so).

My applications have been more along the lines of fitting a model on a million small data sets, rather than fitting one model on a very large data set. Meaning scaling with size of the inputs hasn’t been a major concern, but it’s something I plan to start addressing.

It’s Julia’s take on expression templates. Unfortunately, they don’t have fusion of matrix multiplication yet. I figured I may try and implement it by having a different operator (eg, \times, although that may be interpreted as cross) return an expression template.

julia> x = collect(1:10); x'
1×10 LinearAlgebra.Adjoint{Int64,Array{Int64,1}}:
 1  2  3  4  5  6  7  8  9  10

julia> y = fill(3, 5); y'
1×5 LinearAlgebra.Adjoint{Int64,Array{Int64,1}}:
 3  3  3  3  3

julia> sin.(x .- y') .+ exp(0.4)
10×5 Array{Float64,2}:
 0.582527  0.582527  0.582527  0.582527  0.582527
 0.650354  0.650354  0.650354  0.650354  0.650354
 1.49182   1.49182   1.49182   1.49182   1.49182
 2.3333    2.3333    2.3333    2.3333    2.3333
 2.40112   2.40112   2.40112   2.40112   2.40112
 1.63294   1.63294   1.63294   1.63294   1.63294
 0.735022  0.735022  0.735022  0.735022  0.735022
 0.5329    0.5329    0.5329    0.5329    0.5329
 1.21241   1.21241   1.21241   1.21241   1.21241
 2.14881   2.14881   2.14881   2.14881   2.14881

julia> Meta.@lower sin.(x .- y') .+ exp(0.4)
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = Base.adjoint(y)
│   %2 = Base.broadcasted(-, x, %1)
│   %3 = Base.broadcasted(sin, %2)
│   %4 = exp(0.4)
│   %5 = Base.broadcasted(+, %3, %4)
│   %6 = Base.materialize(%5)
└──      return %6

julia> bc = Base.broadcasted(+, Base.broadcasted(sin, Base.broadcasted(-, x, y')), exp(0.4))
Base.Broadcast.Broadcasted(+, (Base.Broadcast.Broadcasted(sin, (Base.Broadcast.Broadcasted(-, ([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [3 3 … 3 3])),)), 1.4918246976412703))

julia> typeof(bc)

julia> Base.materialize(bc)
10×5 Array{Float64,2}:
 0.582527  0.582527  0.582527  0.582527  0.582527
 0.650354  0.650354  0.650354  0.650354  0.650354
 1.49182   1.49182   1.49182   1.49182   1.49182
 2.3333    2.3333    2.3333    2.3333    2.3333
 2.40112   2.40112   2.40112   2.40112   2.40112
 1.63294   1.63294   1.63294   1.63294   1.63294
 0.735022  0.735022  0.735022  0.735022  0.735022
 0.5329    0.5329    0.5329    0.5329    0.5329
 1.21241   1.21241   1.21241   1.21241   1.21241
 2.14881   2.14881   2.14881   2.14881   2.14881

The macro Meta.@lower reveals that the “dots” get translated into calls to the Base.broadcasted function, which creates the lazy expressions.
materialize is the equivalent of Eigen’s .eval(), to turn then into an array. There is also materialize!, which stores the results into a pre-allocated array.

I think this makes it simple to choose when the expression is evaluated, and which operations get fused by making it a part of the language’s semantics.

So, to implement it similarly in Stan, you’d have to lowering them into calls to build up the Eigen expressions ( perhaps also adding rep_(row_)vector/rep_matrix, as appropriate ) with a .eval() at the end.

Speaking of virtual functions – I believe a virtual function call in C++ is much faster than a dynamic dispatch in Julia (where you’d try to eliminate all dynamic dispatches in performance-critical code), but can these be devirtualized and inlined by a real compiler, or is it feasible?

Is it because a full history of every vari object must exist to be able to perform the reverse pass at the end?
As is, the old information would be destroyed. Hence, the need to copy upon setting to preserve it.

In that case, a static (immutable) type makes sense. It should still support linear algebra, broadcasting, and elementwise operations (including vectorized distrbution syntax).

Most uses of vectors/arrays could probably be expressed with immutable arrays, although I haven’t done a survey of models.

@bbbales2 Thanks, when I’m back in town I’ll rerun that benchmark and update the results.
However, the minimum, median, and mean times (out of 27) samples were all pretty similar, at 172, 193, and 192 milliseconds, respectively. That doesn’t look suspicious, or what I’d imagine a linearly increasing runtime to look like. I’d expect the minimum to be much lower than the mean and median in this case. The maximum also isn’t much higher here.
Although reclaiming memory may still boost performance, so please disregaurd the old results until I rerun them.

1 Like

What were those doing? Presumably notthing involved with linear algebra like matrix multiplication or Cholesky decomposition, which are \mathcal{O}(N^3) operations with \mathcal{O}(N^2) data to copy. Whenever I profiled Stan, it was always calls to log and exp taking up all the time (they’d take up much less time if they could be well pipelined in the CPU).

For background, the reason we do this all dynamically is to allow flexibility in the language. That was a good choice for flexiblity and letting us get off the ground, but it may turn out to be a bad long-term choice for efficiency. Stan is like PyTorch in this regard, but unlike the standard static TensorFlow graphs (though TF has added an eager mode which I don’t understand that I think adds back some of this flexiblity).

That’s what Autograd does. The restriction you wind up with is that it’s illegal to assign into a matrix because there’s no way to maintain locality when you do that without copying the whole matrix. We’ve been discussing adding a static matrix type with comprehension functions to define the entries—that could be kept memory local.

Stan’s memory isn’t allocated one struct at a time, but Eigen::Matrix and std::vector both call out to malloc, so they have that problem.

I agree. But it’s a tricky problem in that we don’t want everyting allocated on the autodiff stack because a lot of our standard vectors and Eigen matrices are temporary.

That’s sad, but true. It’s the only way to support assignable matrices in general. But like I said above, we can add a less flexible matrix type that doesn’t allow assignment into it and has the nice locality properties. Then we could get away with

struct static_matrix_vari {
  Eigen::MatrixXd val_;
  Eigen::MatrixXd adj_;

The only problem is that we can’t support:

static_matrix a = ...;

We need things like this all over the place in Stan programs, but I think we can replace some of it, like our use in Gaussian processes, with comprehension functions.