Proposal for profiling Stan models

I think it would helpful if there was a way to see which math contributed to the size of the AD stack to guide model refactoring. If I know that some expression accounts for a large proportion of the AD stack then I can think about how to factor out terms to avoid redundant computations.

4 Likes

That’s really clever. And it’s be great to have something like this.

One hint: we probably don’t want to evaluate the first execution of the log density because that has to do the malloc.

A few questions:

  • What happens if the start and end aren’t matched, as in one is one block and one in the other?

  • What happens if both start and end are in a loop?

  • How would we profile transforms due to constraints?

  • How would we profile operations in generated quantities?

No, because we encapsulate memory for them using RAII :-) But scope does exist and @bbbales2’s suggestion to treat this as a block wold solve the first problem.

I like that idea and it’d be easy to implement with the var on the stack approach (though it’s not quite as simple as a pointer diff because of the growing stacks).

Yes, please.

I don’t think anyone ever worries about that! If someone wants to do something, great.

2 Likes

In both cases that would be empty profile sections, meaning 0 execution time. I was also thinking that maybe those cases could throw runtime errors, as well as double start (start on a running profile section) and double stop, but I am not sure if that would be helpful or annoying.

That counts towards the same section which is how I would expect a profiler to work. Open to suggestions though.

Good question. Need to test transforms.

The start/stop functions at the C++ level will use a template parameter to decide whether to use the “prim” version (no var and no backward pass) or the “rev” version. That is going to complicate things a bit for user-defined functions as it will have to pass another template parameter, kind of like propto for user-defined _lpdf functions.

For now my local implementation does not support using profiling start/stop in UDF, but will have to add that before I go live with this.

Good call, yes. Have to think about how to make that work.

I will definitely use names, that is already done. Thanks for the questions!

I think this would be a great feature a lot of the proposed options are reasonable for a first implementation. In addition I would like to provide a bit more ambitious goal state - not that it should be implemented right now, but to make sure that some doors are not closed unnecessarily.

In short, I would like Stan to eventually have the same profiling experience as most high-level languages, this would mean:

  • No changes to model code needed for profiling
  • Every statement is automatically instrumented (i.e. each Stan statement gets it’s own profiling block)
  • Every Stan block (anything within {}) is automatically instrumented
  • Recursion (multiple nested entry-exit in a block) is supported,
  • Standardized output to let me use external analysis/visualisation tools (I think Callgrind is one of the more commonly used - profilers for Ruby and PHP use it, but it is a bit weird format…)
  • Speculatively: allow “sampling” approach to profiling: Instead of actually measuring time spent in each block - which can have huge overhead and skew results towards frequently called functions - you instead just keep track of the currently executing block. Then each short_time_interval you record the currently executing block.

Also speculatively: wouldn’t the sampling approach actually be easier to implement? (not sure how the backward pass factors into that though).

Looking forward to being able to use this…

4 Likes

Good point… but isn’t that exactly what we are timing right now? At least I am not sure about this point. We should check that.

Me, too! Though I have to confess I find profilers hard to use and wind up mainly doing what is suggested in this PR even with C++. The problem is that debuggers/tracers are just too invasive to the code and that they tend to consolidate calls to a function across contexts and be very hard to trace when they don’t.

No, because it requires everything the non-sampling method requires, plus some low-level system magic to do the sampling.

If it’s like traditional programming, the chain() functions are different things to profile than the constructor of the vari. The profiling would presumably be at the particular type level and not instance level.

To get the kind of profiling being suggested here in a traditional profiler, you have to go into the stepwise debugger like piece and set profile points. It’s a huge pain in my experience compared to just instrumenting the code.

Also, calls to multivariate normal, etc., will be consolidated if this runs like a traditional profiler.

This doesn’t require changes to code, but it does require change to compilation, and having all that recording can negatively impact performance.

Just bumping this. Was there any progress on outputting the AD size? Going forward, if one wanted to visualize the AD tree (similar to what Dask does for compute graphs in python https://docs.dask.org/en/latest/graphviz.html ), would that data be available anywhere?

1 Like

Still living on a branch on my computer. Not prime time ready yet. Still too busy unfortunately. We are able to output the AD size, that will definitely be a part of the final implementation.

I am not sure if you would be able to get info for visualization just from what this thread is proposing. Visualizing all the expressions and dependencies would require some help from the compiler side.

1 Like

Thanks for the update. Of course the impetus behind the question was for profiling/debugging gradient calculations in Stan code.

There is diagnostic mode already… it’s not well documented. We could add a few more things there immediately. There is a way to get the gradient of an exact input… use the init file to specify the constrained parameters. If that’d help immediately, please let me know and I can write that up somewhere.

That would help! Basically understanding where in the gradient most of the calcuation time is taking would help immensely!

1 Like

So… this will help with identifying problematic gradient computations, but it won’t do what you’re looking for.

For any CmdStan generated executable, you can run:

./program diagnose

with the same arguments for data, seed, etc. This will provide the gradient computed with autodiff and the gradient computed by finite differences. (The finite difference computation might take a long time.) Any place where the difference is really large could be problematic.

To check the gradient for specific values of the parameter space, set up an initialization file with the parameters set to that value.

3 Likes

A design doc for this feature is now live at https://github.com/stan-dev/design-docs/pull/31

Feel free to comment. Thanks!

2 Likes