Operands and partials with tuples as edge container?


#1

Hi!

… as I have dealt now a lot with this super cool beast I was wondering if a tuple based design would make sense for the edges. Right now we have 5 edges maximum fixed such that any function using less will have added overhead due to the extra edges never being used. A tuple as edge container should allow to adjust at compile time the number of available edges. Is it worthwhile to think more about this? I mean is a reduced size really relevant for performance?

Sebastian


#2

You’re right—I thought we got rid of that, but it’s still there. In the old design we had, everything was an array, so there wasn’t any wasted space. This seems like we should be specializing to the number of edges we need. Those ops_partials_edge structs are storing double values, so they are 64 bits minimum.


#3

The overhead is not that bad. @seantalts introduced a nice trick with the empty_broadcast_array classes which cut down the overhead per unused edge to a single byte. That leaves you with 5 bytes of wasted storage for the empty configuration. That does not sound like a lot, but as this class is used a lot it could add up.

Before embarking on a large refactor here, it would be good to know if saving those bits is worthwhile. The other advantage of a tuple based design is that you are not any more limited to a fixed number of edges, but could extend just as needed.


#4

I measured the speed between the old and new and found them to be the same, for whatever that’s worth.

I actually started with a tuple-based design when I first attempted this. I think ultimately it’s probably the way to go, but I encountered a lot of issues and found that others had a lot of difficulty understanding tuples and the design based on tuples. That said, now what we hammered out the non-tuple design more thoroughly, it might be a much shorter jump to tuples and save those bytes.

As far as prioritization and return-on-investment go, I would suspect this to be fairly marginal vs. other things that need work in Stan. But it might be worth measuring and seeing how much space we could save in the best-case scenario (i.e. we could assume a tuple-based design saved 4 bytes per instantiation of operands_and_partials for the sake of an upper-bound on memory) for some larger models and see if that buys us anything tangible in memory-land (since it didn’t give us a speed decrease to use the extra memory in the first refactor). Tuples also take longer to compile, for what it’s worth.


#5

Though I also thought the current design didn’t use more memory than the old one… I forget the details though.


#6

I have never seen Stan programs run into memory issues. I think the extra size may lead to worse use of high-speed caches and this could be the critical. It is probably not really worth to spent resources on this at this point as it is a big effort for likely little return if I understand you right. If we haven’t seen a speed decrease with this refactor (which increased the size) then we should be good.

Do we have a facility to easily do performance tests? Currently it is a bit tedious to do as I compile two binaries with redefining the MATH variable in between. All of this could be automated a lot.


#7

Not that I know of - maybe @syclik has done something like this? The only performance test I know of is in the Stan repo and just does a logistic regression 100 (or 1000?) times.

I’m not sure what the facility you’re thinking of looks like, but if there was something that made it easy to actually answer these questions empirically that would probably be pretty high ROI given how much of what we do is in the name of performance.


#8

Yes!


#9

+1!

We used to have performance tests against all the BUGS models. They were pulled out ages ago. Perhaps that was a mistake.

I’ve benchmarked specific things, but haven’t come up with a general framework for benchmarking the Stan C++ code. (Last time we thought about it, the conversation diverged.)