Proposal for profiling Stan models

Hi all,

this is an initial proposal on profiling Stan models to identify bottlenecks in the gradient evaluation in the model/transformed parameters blocks and the function evaluations in the generated quantities and transformed data block (though the latter two are hardly every bottlenecks).

I will explain my idea on an example which will hopefully spark some discussion so we hash out if I am missing something really obvious. If

A user would specify profiling sections inside a model with the calls to start_profiling and stop_profiling which would be supplied an ID of the section. In this example, these are integer values, but we could also make the section labels with strings.

data {
  int<lower=1> N;
  real x[N];
  int k[N];
}

parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  vector[N] f_tilde;
}

transformed parameters {
  vector[N] f;
  {
    start_profiling(0);
    matrix[N, N] cov =   cov_exp_quad(x, alpha, rho) + diag_matrix(rep_vector(1e-10, N));
    matrix[N, N] L_cov = cholesky_decompose(cov);
    stop_profiling(0);

    start_profiling(1);
    f = L_cov * f_tilde;
    stop_profiling(1);
  }
}

model {
  start_profiling(2);
  rho ~ gamma(25, 4);
  alpha ~ normal(0, 2);
  f_tilde ~ normal(0, 1);

  k ~ poisson_log(f);
  stop_profiling(2);
}

If the model is compiled with a stanc flag --profile, the profiling would get compiled to the C++ level, without the flag, the start_profiling and stop_profiling calls would not be written to the C++ level.

If profiling is used, after the sampling completes, the profiling output would get either printed to the std output or written in a file (format TBD). Example of the output could be something like:

id,forward_pass_time,backward_pass_time,stack_used 
0,12.5,30.6,1000
1,6.5,12.7,1200

We could print the forward and backward pass of the autodiff separately or just the cumulative or both. The users will probably only care about the cumulative, while the separate times would be interesting to developers. We could additionally also supply other information like the amount the autodiff stack grew between the start and stop calls. Any other ideas on what could be profiled are obviously welcome.

There is a preliminary math implementation here: https://github.com/stan-dev/math/pull/1902
This currently assumes that start_profiling and stop_profiling adds a vari to the autodiff stack so we can profile the backward pass. This is a bit wasteful if the profiled operations dont use var’s, but this wastes 2*num_of_profile_ids vars to the autodiff stack, which is probably not that bad. The alternative would be to have separate calls that would be used in transformed data/generated quantites. But I am not sure its really worth the extra maintanance cost.

The stanc3 and cmdstan interface details are waiting for things to get hashed out.

Open questions:

  • would this be useful to the Stan users?
  • does the Stan model API look reasonable?
  • any suggestions for names used
  • any comments on the Math implementation.

I think it would be interesting to the users as its generally hard to identify the bottleneck in a model without deep knowledge of the Math library. This would help them avoid optimizing or parallelizing a part of their model that represents something like 10% of the entire execution time.

p.s.:

We don’t need to focus on whether it’s worth the “resources” for someone to work on this. My PhD committee has told me to add some stuff that requires this type of profiling. So some variation of this will be made regardless of whatever we decide here. its just a question of whether we want to make a version that would be suitable to merge in the Math/Stanc3 repositories and whether we would want to expose it to the Stan user.

13 Likes

This would be super cool!

As a start we should output somewhere the size of the AD tree, which we do not do right now as I recall. The size of the AD tree is a good measure of the expensiveness of a Stan program. That can be added to the cmdstan diagnose output very quickly.

A few thoughts on what I would do differently to what you proposed:

  • don’t require start / stop. Instead just let users declare a profiling object which they name and the scope of the object determines the range of things being captured. Basically a RAII type of working. Maybe make the object go out of scope at the end of each block should that be possible.
  • Either use the name of the objects as suggested or use the Stan model line-number as the ID of the profiling block. I would prefer automatic names; at least as a default.

I really think we need this to help users navigate their performance issues.

2 Likes

As a Stan user, I definitely want some sort of profiling.

Basically a RAII type of working. Maybe make the object go out of scope at the end of each block should that be possible.

Fair, but RAII doesn’t exist as a concept in the Stan language. I think you’re right as far as implementation but I don’t know what the Stan syntax would look like.

Maybe:

profile("profile_name") {
  ...
}

Maybe:

{
  profile("profile_name");
  ...
}

I think the second is better cause then we can put them at the top of any block and just profile that block. I guess a downside to start/stop is we can’t put a function call before a define-declare currently in Stan, though I suppose that limitation could change.

This reminds me of this library: GPTL timing library Home Page

That was a very handy way to profile things.

I guess the behavior of these profilers needs defined wrt threads and processes? Or maybe that was in the design doc. I assume we want them to be as fine grained as possible (not automatically lumping together all threads).

2 Likes

I said that with waaay too much confidence. I don’t know anything about languages or compilers lol, just for clarity.

1 Like

Ok, it seems like we have dislikes on the start/stop. I was probably thinking too much of the C++ level implementation, which does not have to be the same at stan-level. My bad.

This one would probably be easier to implement and to me it’s easier to read what belongs to a certain “profile”:

Profiling each line gets messy with loops. Handling loops separately would complicate the stanc implementation. One problem that this approach has but start/stop functions do not, is problems with variable scope.Example:

profile("cov_exp") {
matrix[N, N] cov =   cov_exp_quad(x, alpha, rho) + diag_matrix(rep_vector(1e-10, N));
}
matrix[N, N] L_cov = cholesky_decompose(cov); // cov does not exist in this scope

You could define cov before the scope, but that makes profiling more invasive to the model.

I did think about this, I did not want to throw too many details in this preliminary post. We would use the thread id + profile section id as the key to the map that will hold this profiling information. std::map is thread safe if distinct threads access distinct elements.

For MPI, I think we should just profile the rank 0 processes. Otherwise we need to receive profiling info at the end from other processes and I am not sure it brings much additional value.

Just throwing one additional idea:

transformed parameters {
  vector[N] f;
  {
    profile("cov+chol");
    matrix[N, N] cov =   cov_exp_quad(x, alpha, rho) + diag_matrix(rep_vector(1e-10, N));
    matrix[N, N] L_cov = cholesky_decompose(cov);

    profile("mul");
    f = L_cov * f_tilde;
  }
}

model {
  profile("model");
  rho ~ gamma(25, 4);
  alpha ~ normal(0, 2);
  f_tilde ~ normal(0, 1);

  k ~ poisson_log(f);
}

The section is defined form the profile call to the next profile call or end of the current block. Or is that difficult to doc/explain?

EDIT: This is probably very close to the second Ben’s example on second thought.

RAII is a C++ idiom. It means start/stop is encapsulated inside the constructor and destructor of a manager class

class profiler {
  int id_;
  public:
  profiler(int id) : id_(id) {
    start_profiling(id_);
  }
  ~profiler() {
    stop_profiling(id_);
  }
};

Then you create a local profiler object at the start of the block to be profiled. Profiling ends automatically when the object goes out of scope. It ensures you always call stop_profiling, even if an exception is thrown in the middle. The sampler can recover from exceptions so proper exception handling is a must.

On Stan language side, I don’t think profiled sections need explicit names. The compiler can insert the line numbers automatically and that’s enough.
Also no need to mark the entire model block for profiling. That should be on by default. Hmm, how much overhead does profiling have? Maybe you could just profile every local block when profiling is enabled so no special markers needed?

1 Like

Nice, I really like that. Plus the exception point is a good one yeah.

Minimal, basically calls the chrono timer, checks the stack size and that is it.
So you mean like this? (the profiled sections are labeled in comments)

transformed parameters { // // section line W
    vector[N] f;
    matrix[N, N] cov;
    matrix[N, N] L_cov;
    { // section line X
        cov =   cov_exp_quad(x, alpha, rho) + diag_matrix(rep_vector(1e-10, N));
        L_cov = cholesky_decompose(cov);
    }
    { // section line Y
        f = L_cov * f_tilde;
    }
}

model { // section line Z
    rho ~ gamma(25, 4);
    alpha ~ normal(0, 2);
    f_tilde ~ normal(0, 1);

    k ~ poisson_log(f);
}
1 Like

I personally like the names.

Easier to keep thinking organized.

If the high level blocks were automatically instrumented automatically, those already have names so I suppose we don’t need to supply any.

So I guess we have 2 ways of doing this

  • profile local blocks with no names (using line numbers)
  • the profile(“name”) { … } solution

In both cases we would profile entire model/tr. params/tr. data/gq blocks.

For the first solution we could optionally add names with the annotation idea that has been circulating around. Something like
@label(“name”)
{
…
}

But that obviously makes the first implementation sub-optimal as we would only get names if this annotation thing ever comes around. Or we hold off with this until it does.

1 Like

In that case I think then it makes sense to expose a start(label) stop(label) version of this in a beta branch somewhere so we can work out exactly what the I/O looks like for this, how well it works, etc.

2 Likes

I am fine with labels, but requiring start and stop sounds erroneous. I would prefer to have blocks defining regions of what is profiled (I mean any block, not just the model blocks).

I agree. I just don’t want this hung up on a language feature. We can do whatever before there’s a real release.

Sure. We needed such a feature already for a while.

… this is why I would really suggest to add to the diagnose output of cmdstan a printout of the AD tree size. That should be a very good measure for model efficiency. The smaller the AD tree, the better the model is using efficient representations.

That would be very quick to do, I think.

2 Likes

What do you mean by diagnose output? Do you mean the diagnose method or just the standard output/csv?

bernoulli diagnose …

so I guess the diagnose method in your terms… though we could also include it right after “gradient evaluation took … s” in the startup output of every model.

I don’t mind where it goes, but we need it. Maybe @mitzimorris has an opinion on this?

1 Like

thanks Sebastian, of course I have an opinion! first off, agreed, this would be very cool.

At the interface level, this looks like what CmdStan calls a “method” - currently one of {sample, optimize, variational, diagnose, generated quantities}.

For the API, this should be added to the stan::services package - the best name would be profile. As Sebastian notes, this would look a lot like diagnose - cf. https://github.com/stan-dev/stan/tree/develop/src/stan/services/diagnose

1 Like

Thanks everyone for the very valuable feedback! It seems that we agree that this would be useful.

I will make a test branch of cmdstan with stanc3 + math to test this. The stanc3 parts will be done in a way that is easiest/fastest to implement. Like Ben said the first focus will be “if it works” and IO. Then the language API can get optimized.

It does make sense to have a method “profile” for the bigger profiling mode @rok_cesnovar wants to have.

I am suggesting that we add in a “light” way a very valuable output - the AD stack size - to the outputs we have now. This would be easy and quick to implement. How about we revise the current message when calling sampling which reads

Gradient evaluation took 1.5e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds.
Adjust your expectations accordingly!

toward something even more useful as

Gradient evaluation with XXX terms took 1.5e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds.
Adjust your expectations accordingly!

I mean, this message is intended to serve as a way to profile things the quick and easy way - but it missed out the information on the number of terms, which I would suggest to add in there.

Does that make sense to everyone?

… and we should also change from CPU time in terms of clock cycles to wall time here (as well as in the outputs).

2 Likes

Yes. That is ready, just needs a review: Replace clock_t with std::chrono by rok-cesnovar · Pull Request #2922 · stan-dev/stan · GitHub