Stanc3 optimization and analyses walkthrough during StanCon

seantalts · August 14, 2019, 2:43pm

Hey all,

@enetsee wants to talk about the optimization and analyses passes in stanc3 with @Matthijs sometime during StanCon, hoping to get a walk through of what the code is intending to do (especially for lazy code motion). I’m also interested in attending if I can make it. Anyone else interested in this topic?

-Sean

Matthijs · August 15, 2019, 8:20am

I’m in, obviously!

rok_cesnovar · August 16, 2019, 7:20am

Me and @tadej are also interested, even if its probably above our “paygrade”. The issue is that we will only be at StanCon on Thursday & Friday. But no worries if others would rather schedule it on the tutorial days.

chriselrod · August 18, 2019, 5:20pm

Will this material be made available online / elsewhere?

I’m very interested, but wont be at StanCon.

I have been slowly working on ProbabilityModels.jl, a Julia library that provides a DSL for fitting with NUTS. It is light on features, and basic control flow like if/else and loops are a ways away (but I plan on at least supporting broadcasting soon), but the focus has been on high performance.
The example in the README sampled something like 20 times the effective samples per second, so I am interested in all the optimization work going on.

seantalts · August 19, 2019, 5:37am

There’s no optimization you’d be seeing yet, haha. You may be more interested in the Stan automatic differentiation paper for looking into why Stan is so fast.

chriselrod · August 22, 2019, 10:09am

A clarification: the Julia library was faster than Stan.
I am using source to source auto diff, and tried to ensure as much of the computation is amenable to SIMD as possible.

Unless I’m mistaken, Stan’s autodiff causes autovectorizers a lot of problems. For example, an array (or Eigen data structure) of vars would interleave the actual data with pointers, making it harder to efficiently load vectors of data into registers.
Then there’d have to be some tricky to get SIMD on the reverse path. Not sure how much of that is handled with methods overloaded on vars.

Eigen itself is bad at vectorizing small matrix multiplication, even when all they hold is simple doubles (at least for avx512).
I benchmarked my Julia library against Eigen, Blaze, and Intel MKL JIT here for all combinations M = 3,…,32, N = 3,…32 for C = A * B, where C is MxN, A is Mx32, and B is 32xN.
Eigen’s statically sized arrays were far behind the other three libraries. It was closer to gfortran's matmul intrinsic.

My library isn’t doing any real optimizations either right now, so I would be interested in learning about any that are planned, and figuring out what is possible.
A couple ideas I have:

Don’t just do a forward and then reverse pass for gradients. Reorder computation so that parts of the reverse pass more closely follow the associated parts of the first pass when it’s possible, to try and get more done with the associated memory while it is still hot in the cache.
Try and find a way to eliminate unnecessary computations like the log(sigma) necessary for calculation a normal pdf, where sigma is a simple real[lower=0] (meaning sigma = exp(unconstrained_sigma)); ideally the normal density could access the original constrained parameter directly.

seantalts · August 22, 2019, 10:25am

You raise some interesting items for discussion, though this thread is about a specific meeting at Stan Con tomorrow (it will be 1-2pm btw). That meeting is about stanc3 compiler optimizations, which for the moment doesn’t do much w.r.t. Eigen or anything specific to the Stan Math backend.

I’m confused about what you’re saying and you may also be confused about what Stan is doing under the hood, haha. Take a look at the autodiff paper, I think it’s the best source on some of this but it also doesn’t talk about everything you mention. For the rest I’d suggest starting a new thread about the Math library and algorithms

Topic		Replies	Views
Stan SIMD & Performance Algorithms	23	4506	January 23, 2020
Stanc3 question: Vectorization of loops Developers	6	503	January 28, 2020
Integrating GPU support Developers	23	5473	January 31, 2017
Addressing Stan speed claims in general Stan Governance	81	13840	January 17, 2021
Any way to make Stan competitive with Tensorflow for maximum likelihood? Algorithms	43	6249	July 31, 2019

Stanc3 optimization and analyses walkthrough during StanCon

Related topics