I’m a few approaches would be possible, like CxxWrap.jl, but as seantalts suggested, it’d be easier to use one of the native Julia HMC libraries. They’re all copying Stan’s algorithms / based on the same set of references.
Julia and C++ code compiled with Clang should be about the same fast, when both are implemented in the same way.
The implementation differences account for the difference in performance, not the language.
Of course, some languages make certain types of implementations easier than others, eg neither C or Fortran have an equivalent to template metaprogamming, making it more difficult to optimize generic algorithms in them than it is in C++ or Julia.
You can also do this in C++ with templates, but I’d be interested in a C/Fortran implementation:
function fib(::Val{N}) where {N}
N > 1 ? fib(Val(N-1)) + fib(Val(N-2)) : N
end
@time fib(Val(46))
using BenchmarkTools; @btime fib(Val(46))
@code_typed fib(Val(46))
(On the other hand, it’s a lot easier to end up with extremely slow code in Julia than it is in the other languages, eg if it is type unstable.)
Ah, yes, Bob said this earlier.
While (likely) much slower, the compiler can sometimes use clever permutations to vectorize the array of structs code, or at least some of the computations. A bigger benefit (vs what Stan does) is probably that it doesn’t have to constantly chase pointers. Adding to the Julia points comparison:
using StaticArrays, BenchmarkTools, StructArrays
mutable struct MPoint3D{T} <: FieldVector{3, T}
x::T
y::T
z::T
end
By declaring it a mutable struct
instead of a struct
, it will be heap allocated with a constant memory address. Meaning they will not be stored inline in an array.
v_aos = [Point3D(rand(3)) for i = 1:1000]; # array of structs
v_soa = StructArray(v_aos); # struct of arrays
v_aop = [MPoint3D(v) for v = v_aos]; # array of pimpl
Rerunning the previous benchmark (times for the array of structs and struct of arrays were about 1.6 and 0.5 microseconds, respectively) on the same laptop:
julia> @benchmark f!($v_aop)
BenchmarkTools.Trial:
memory estimate: 31.25 KiB
allocs estimate: 1000
--------------
minimum time: 12.643 μs (0.00% GC)
median time: 15.287 μs (0.00% GC)
mean time: 20.145 μs (18.37% GC)
maximum time: 5.701 ms (99.06% GC)
--------------
samples: 10000
evals/sample: 1
If we instead modify the old point in place to avoid all these allocations:
julia> function f2!(v)
@inbounds @simd ivdep for i in eachindex(v)
vi = v[i]
x, y, z = vi
vi.x = 2x; vi.y = x*y; vi.z = x*z
end
end
f2! (generic function with 1 method)
julia> @benchmark f2!($v_aop)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.041 μs (0.00% GC)
median time: 3.061 μs (0.00% GC)
mean time: 3.075 μs (0.00% GC)
maximum time: 13.013 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 8
About 6 times slower than the struct of arrays version. Up to 4x could be attributed to avx2, and the rest probably to data locality. (Stan’s memory pool might mean things are a little better?)
The assembly now also shows no pd
operations, there is no simd at all:
julia> @code_native debuginfo=:none f2!(v_aop)
.text
pushq %rax
movq %rsi, (%rsp)
movq (%rsi), %rcx
movq 24(%rcx), %rax
testq %rax, %rax
jle L97
movq %rax, %rdx
sarq $63, %rdx
andnq %rax, %rdx, %rax
movq (%rcx), %rcx
xorl %edx, %edx
nopw %cs:(%rax,%rax)
nopl (%rax)
L48:
movq (%rcx,%rdx,8), %rsi
testq %rsi, %rsi
je L109
vmovsd (%rsi), %xmm0 # xmm0 = mem[0],zero
vaddsd %xmm0, %xmm0, %xmm1
vmovsd %xmm1, (%rsi)
vmulsd 8(%rsi), %xmm0, %xmm1
vmovsd %xmm1, 8(%rsi)
vmulsd 16(%rsi), %xmm0, %xmm0
vmovsd %xmm0, 16(%rsi)
incq %rdx
cmpq %rax, %rdx
jb L48
L97:
movabsq $jl_system_image_data, %rax
popq %rcx
retq
L109:
movabsq $jl_throw, %rax
movabsq $jl_system_image_data, %rdi
callq *%rax
nopw %cs:(%rax,%rax)
nopl (%rax)
Not everything can be roughly analogous, but some can be.
Does Eigen support in place operations? If so, that might be a simpler approach, as it wouldn’t require hacking on Eigen. Before sampling begins, you’d preallocate every Eigen object needed, and then rewrite into these each time you evaluate the density / gradient.
It would likely require more memory (you probably be able to avoid incrementing a stack pointer in some occasions, reusing the same memory more than once in a single logdensity/gradient calculation), but should be a solid improvement.
The Julia library ReverseDiff.jl does this. It allocates each object needed while recording the tape, and then saves them for reuse.
Shouldn’t something like this be possible in Stan (even though it doesn’t use a compile-able tape)?
FWIW, it’d be great if the generic auto version can be made to work generally. If every function were defined in that manner, switching to struct of arrays should be relatively simple. Eg, define a class that holds two Eigen<double,-1,-1> (value and adjoint), as well as similar for function pointers if you need that.
Same story for using sparse matrices, like Bob Carpenter pushes for.
However, in one of your links some folks discuss pitfalls with the all-auto approach.
Disclaimer – unless you want to get into the weeds / help implement the library, I wouldn’t yet recommend trying it out. It’s currently still much to rough, and the master branch likely to spend a high fraction of the time broken, until I finally hit the first 0.1.0 release.
The simplistic idea is that using the sort of dual approach is slow, while operating on dense arrays is fast. Therefore prefer the latter, and fallback to the former when the first fails.
Many functions have versions that return also return adjoint-operator, that update the adjoint.
The source to source part simply swaps functions with the version returning this, and adds a backwards pass.
Importantly, I don’t support control flow yet, but my workaround is to cover those with a fallback approach.
Along with handling control flow, functions without versions returning adjoints will also be handled via the fall back approach, which will either be something like forward mode differentiation to get a dense Jacobian matrix (with which to get the reverse mode adjoint), or use a Zygote pullback.
IMO both have unacceptably bad performance at the moment.
Stan could perhaps somewhat use the reverse approach.
When multiplying matrices, Stan copies into matrices of doubles for matrix multiplication.
How does Stan handle the reverse update of the adjoints? Also by copying into matrices of doubles and performing the multiplication? If so, then it does in a sense take the reverse approach – defaulting to be generic / flexible, and in cases that can be highly optimized, it switches to specialized routines.
Source to source can be very efficient for linear algebra (admittedly, I have a long way to go to actually support this myself). Here is a paper providing lots of derivative results useful for calculating adjoints in the reverse pass.
Of course, you can also probably do this in the overload approach.
I’d argue that this is another point in favor of the struct-of-arrays data layout. For all of these, you’d want to take advantage of the optimzied linear algebra routines on doubles for both calculating the values, and then again calculating the adjoints. It’s faster to not first copy everything into an array of doubles.
I’d sign up myself! =)
Here is a paper (by Goto, whose GotoBLAS turned into OpenBLAS, and van de Geijn). I need to study this paper more myself.
So far, all I’ve been doing is optimizing micro kernels and crossing my fingers that I don’t get memory throttled too badly (and dispatching to MKL for some operations, like multiplying matrices larger that 60x60 or so).
My applications have been more along the lines of fitting a model on a million small data sets, rather than fitting one model on a very large data set. Meaning scaling with size of the inputs hasn’t been a major concern, but it’s something I plan to start addressing.
It’s Julia’s take on expression templates. Unfortunately, they don’t have fusion of matrix multiplication yet. I figured I may try and implement it by having a different operator (eg, \times
, although that may be interpreted as cross) return an expression template.
julia> x = collect(1:10); x'
1×10 LinearAlgebra.Adjoint{Int64,Array{Int64,1}}:
1 2 3 4 5 6 7 8 9 10
julia> y = fill(3, 5); y'
1×5 LinearAlgebra.Adjoint{Int64,Array{Int64,1}}:
3 3 3 3 3
julia> sin.(x .- y') .+ exp(0.4)
10×5 Array{Float64,2}:
0.582527 0.582527 0.582527 0.582527 0.582527
0.650354 0.650354 0.650354 0.650354 0.650354
1.49182 1.49182 1.49182 1.49182 1.49182
2.3333 2.3333 2.3333 2.3333 2.3333
2.40112 2.40112 2.40112 2.40112 2.40112
1.63294 1.63294 1.63294 1.63294 1.63294
0.735022 0.735022 0.735022 0.735022 0.735022
0.5329 0.5329 0.5329 0.5329 0.5329
1.21241 1.21241 1.21241 1.21241 1.21241
2.14881 2.14881 2.14881 2.14881 2.14881
julia> Meta.@lower sin.(x .- y') .+ exp(0.4)
:($(Expr(:thunk, CodeInfo(
@ none within `top-level scope`
1 ─ %1 = Base.adjoint(y)
│ %2 = Base.broadcasted(-, x, %1)
│ %3 = Base.broadcasted(sin, %2)
│ %4 = exp(0.4)
│ %5 = Base.broadcasted(+, %3, %4)
│ %6 = Base.materialize(%5)
└── return %6
))))
julia> bc = Base.broadcasted(+, Base.broadcasted(sin, Base.broadcasted(-, x, y')), exp(0.4))
Base.Broadcast.Broadcasted(+, (Base.Broadcast.Broadcasted(sin, (Base.Broadcast.Broadcasted(-, ([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [3 3 … 3 3])),)), 1.4918246976412703))
julia> typeof(bc)
Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(+),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(sin),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(-),Tuple{Array{Int64,1},LinearAlgebra.Adjoint{Int64,Array{Int64,1}}}}}},Float64}}
julia> Base.materialize(bc)
10×5 Array{Float64,2}:
0.582527 0.582527 0.582527 0.582527 0.582527
0.650354 0.650354 0.650354 0.650354 0.650354
1.49182 1.49182 1.49182 1.49182 1.49182
2.3333 2.3333 2.3333 2.3333 2.3333
2.40112 2.40112 2.40112 2.40112 2.40112
1.63294 1.63294 1.63294 1.63294 1.63294
0.735022 0.735022 0.735022 0.735022 0.735022
0.5329 0.5329 0.5329 0.5329 0.5329
1.21241 1.21241 1.21241 1.21241 1.21241
2.14881 2.14881 2.14881 2.14881 2.14881
The macro Meta.@lower
reveals that the “dots” get translated into calls to the Base.broadcasted
function, which creates the lazy expressions.
materialize
is the equivalent of Eigen’s .eval()
, to turn then into an array. There is also materialize!
, which stores the results into a pre-allocated array.
I think this makes it simple to choose when the expression is evaluated, and which operations get fused by making it a part of the language’s semantics.
So, to implement it similarly in Stan, you’d have to lowering them into calls to build up the Eigen expressions ( perhaps also adding rep_(row_)vector/rep_matrix, as appropriate ) with a .eval()
at the end.
Speaking of virtual functions – I believe a virtual function call in C++ is much faster than a dynamic dispatch in Julia (where you’d try to eliminate all dynamic dispatches in performance-critical code), but can these be devirtualized and inlined by a real compiler, or is it feasible?
Is it because a full history of every vari object must exist to be able to perform the reverse pass at the end?
As is, the old information would be destroyed. Hence, the need to copy upon setting to preserve it.
In that case, a static (immutable) type makes sense. It should still support linear algebra, broadcasting, and elementwise operations (including vectorized distrbution syntax).
Most uses of vectors/arrays could probably be expressed with immutable arrays, although I haven’t done a survey of models.
@bbbales2 Thanks, when I’m back in town I’ll rerun that benchmark and update the results.
However, the minimum, median, and mean times (out of 27) samples were all pretty similar, at 172, 193, and 192 milliseconds, respectively. That doesn’t look suspicious, or what I’d imagine a linearly increasing runtime to look like. I’d expect the minimum to be much lower than the mean and median in this case. The maximum also isn’t much higher here.
Although reclaiming memory may still boost performance, so please disregaurd the old results until I rerun them.