Stanc3 optimization and analyses walkthrough during StanCon

Hey all,

@enetsee wants to talk about the optimization and analyses passes in stanc3 with @Matthijs sometime during StanCon, hoping to get a walk through of what the code is intending to do (especially for lazy code motion). I’m also interested in attending if I can make it. Anyone else interested in this topic?



I’m in, obviously!

Me and @tadej are also interested, even if its probably above our “paygrade”. The issue is that we will only be at StanCon on Thursday & Friday. But no worries if others would rather schedule it on the tutorial days.

1 Like

Will this material be made available online / elsewhere?

I’m very interested, but wont be at StanCon.

I have been slowly working on ProbabilityModels.jl, a Julia library that provides a DSL for fitting with NUTS. It is light on features, and basic control flow like if/else and loops are a ways away (but I plan on at least supporting broadcasting soon), but the focus has been on high performance.
The example in the README sampled something like 20 times the effective samples per second, so I am interested in all the optimization work going on.

1 Like

There’s no optimization you’d be seeing yet, haha. You may be more interested in the Stan automatic differentiation paper for looking into why Stan is so fast.

A clarification: the Julia library was faster than Stan.
I am using source to source auto diff, and tried to ensure as much of the computation is amenable to SIMD as possible.

Unless I’m mistaken, Stan’s autodiff causes autovectorizers a lot of problems. For example, an array (or Eigen data structure) of vars would interleave the actual data with pointers, making it harder to efficiently load vectors of data into registers.
Then there’d have to be some tricky to get SIMD on the reverse path. Not sure how much of that is handled with methods overloaded on vars.

Eigen itself is bad at vectorizing small matrix multiplication, even when all they hold is simple doubles (at least for avx512).
I benchmarked my Julia library against Eigen, Blaze, and Intel MKL JIT here for all combinations M = 3,…,32, N = 3,…32 for C = A * B, where C is MxN, A is Mx32, and B is 32xN.
Eigen’s statically sized arrays were far behind the other three libraries. It was closer to gfortran's matmul intrinsic.

My library isn’t doing any real optimizations either right now, so I would be interested in learning about any that are planned, and figuring out what is possible.
A couple ideas I have:

  1. Don’t just do a forward and then reverse pass for gradients. Reorder computation so that parts of the reverse pass more closely follow the associated parts of the first pass when it’s possible, to try and get more done with the associated memory while it is still hot in the cache.
  2. Try and find a way to eliminate unnecessary computations like the log(sigma) necessary for calculation a normal pdf, where sigma is a simple real[lower=0] (meaning sigma = exp(unconstrained_sigma)); ideally the normal density could access the original constrained parameter directly.

You raise some interesting items for discussion, though this thread is about a specific meeting at Stan Con tomorrow (it will be 1-2pm btw). That meeting is about stanc3 compiler optimizations, which for the moment doesn’t do much w.r.t. Eigen or anything specific to the Stan Math backend.

I’m confused about what you’re saying and you may also be confused about what Stan is doing under the hood, haha. Take a look at the autodiff paper, I think it’s the best source on some of this but it also doesn’t talk about everything you mention. For the rest I’d suggest starting a new thread about the Math library and algorithms