A clarification: the Julia library was faster than Stan.

I am using source to source auto diff, and tried to ensure as much of the computation is amenable to SIMD as possible.

Unless I’m mistaken, Stan’s autodiff causes autovectorizers a lot of problems. For example, an array (or Eigen data structure) of vars would interleave the actual data with pointers, making it harder to efficiently load vectors of data into registers.

Then there’d have to be some tricky to get SIMD on the reverse path. Not sure how much of that is handled with methods overloaded on vars.

Eigen itself is bad at vectorizing small matrix multiplication, even when all they hold is simple doubles (at least for avx512).

I benchmarked my Julia library against Eigen, Blaze, and Intel MKL JIT here for all combinations M = 3,…,32, N = 3,…32 for `C = A * B`

, where C is MxN, A is Mx32, and B is 32xN.

Eigen’s statically sized arrays were far behind the other three libraries. It was closer to `gfortran`

's `matmul`

intrinsic.

My library isn’t doing any real optimizations either right now, so I would be interested in learning about any that are planned, and figuring out what is possible.

A couple ideas I have:

- Don’t just do a forward and then reverse pass for gradients. Reorder computation so that parts of the reverse pass more closely follow the associated parts of the first pass when it’s possible, to try and get more done with the associated memory while it is still hot in the cache.
- Try and find a way to eliminate unnecessary computations like the
`log(sigma)`

necessary for calculation a normal pdf, where sigma is a simple `real[lower=0]`

(meaning sigma = exp(unconstrained_sigma)); ideally the normal density could access the original constrained parameter directly.