Stan SIMD & Performance

Is that done in this code here? Feel like a half semester of a C++ optimization class could be spent on Eigen’s product folder haha

I would love for stan to steal this idea. A consistent and nice shorthand for broadcasting would be awesome

The rest of what you said above this line is extremely cool but a bit higher than my paygrade haha. But one other thing about our var type is that the base vari inside of it is 24 bytes (2 doubles and 2 virtual functions). This has always felt very not cache friendly since most modern caches are 64 bytes we can only fit 2 varis in cache at once. If we could remove those virtual functions we could fit 4 varis in each cache line. @Bob_Carpenter I think you said before you had some scheme for that?

Bob and I have spoken about this before and I think the post here talks about adding a static matrix type for that. I never fully understood why the static matrix type would force a copy on setting

I wonder if we made Stan’s stack allocator more general if we could do something like this as well? When I ran heaptrack over one of my stan programs I found it took a lot of time allocating and deallocating the point_ps types when it built the tree.

I think with a bit of weirdness and elbow grease this is feasible in our current stuff. Something something using our own stack allocator and Eigen::Map to construct matrices instead of Eigen’s allocator.

Orrr you could help us do some of this! :-)

I tried this as well though I also don’t think I tried hard enough. I think you need the local allocator for each object to then actually point to a global allocator that’s never destroyed. Then do the standard tracking of space used.

Also goofed around with this, I think we would need to use Eigen::Map if we want to allocate our own memory for Eigen. They don’t have any macros or templates for using your own allocator (that I know of).

Yeah if we can remove the pimpl idiom we currently use in var->vari that would remove a lot of pointer chasing. I’ve tried goofing with this but can’t find a pattern that works well. Though I’m sure there is something out there to do. The main thing is dealing with the global AD tape that holds a pointer to those varis in var.

Yes and it’s excellent! Tadej also worked on something similar to this for the elementwise functions that’s seeing a nice average speedbump of 3% on the performance benchmarks. (meta, it’s v rare to see PRs that give speedups to all of the benchmarks so that is very nice)