The AD stack size (or whatever more adequate statistics) should be a useful quantity for users to quickly diagnose if their attempts to optimize their Stan program were successful.
Would others agree and would it make sense to make those statistics somehow accessible? Or is this already there and I haven’t found it…
I think it’d be useful. It’s available in C++, but we don’t print it anywhere. We could add it as part of the diagnostic mode. (I thought we’d have more in the diagnostic mode, but we haven’t really done much since it was introduced.)
Of course, size alone doesn’t determine it, but it’s a very useful start.
Did you have a suggestion for where to add it? Diagnostic mode’s problematic because in high dimensions, it essentially hangs due to the finite differences. I’d think the place we print out the time per log density evaluation would be the place. And I think it makes sense to report it as memory allocated (the blocks, which is what it looks like to the OS) and how much is used (how big the expression graph is in that space). I also think this would be a good opportunity to change that message to something that doesn’t try to anticipate how long the whole thing will take, but simply warns the user this is for one log density and gradient, and there may be many per iteration.
No, my thinking was not that far. I was just wondering if others agree that this metric would be useful. Seems the answer to that is yes. Putting it as an info the diagnose mode makes certainly sense. Putting it as well into the startup message from cmdstan may also make sense for the reason you mention, Bob.
The issue with making this number show up by default is that it is very technical and most users will simply be confused by this number. On the other hand this is message will come as an information and as such may just trigger interest from the user to read about this a bit more in the manual, which would be a good thing.
Maybe we have a short discussion on our next meeting on this?
Personally, I would find it interesting to see the scaling of the AD stack size with the problem size for a given model (which I would have to script myself once this feature is available, of course).
I don’t think it’ll be so confusing if we report it as memory allocated for log density gradient evaluations.
There’s a constant for the prior, then it grows linearly with data (unless you’re doing something crazy like a GP). That linear growth with data has a constant factor based on the structure of the data and how it’s used in the program.