Proposal for profiling Stan models

This would be super cool!

As a start we should output somewhere the size of the AD tree, which we do not do right now as I recall. The size of the AD tree is a good measure of the expensiveness of a Stan program. That can be added to the cmdstan diagnose output very quickly.

A few thoughts on what I would do differently to what you proposed:

  • don’t require start / stop. Instead just let users declare a profiling object which they name and the scope of the object determines the range of things being captured. Basically a RAII type of working. Maybe make the object go out of scope at the end of each block should that be possible.
  • Either use the name of the objects as suggested or use the Stan model line-number as the ID of the profiling block. I would prefer automatic names; at least as a default.

I really think we need this to help users navigate their performance issues.

2 Likes

As a Stan user, I definitely want some sort of profiling.

Basically a RAII type of working. Maybe make the object go out of scope at the end of each block should that be possible.

Fair, but RAII doesn’t exist as a concept in the Stan language. I think you’re right as far as implementation but I don’t know what the Stan syntax would look like.

Maybe:

profile("profile_name") {
  ...
}

Maybe:

{
  profile("profile_name");
  ...
}

I think the second is better cause then we can put them at the top of any block and just profile that block. I guess a downside to start/stop is we can’t put a function call before a define-declare currently in Stan, though I suppose that limitation could change.

This reminds me of this library: https://jmrosinski.github.io/GPTL/

That was a very handy way to profile things.

I guess the behavior of these profilers needs defined wrt threads and processes? Or maybe that was in the design doc. I assume we want them to be as fine grained as possible (not automatically lumping together all threads).

2 Likes

I said that with waaay too much confidence. I don’t know anything about languages or compilers lol, just for clarity.

1 Like

Ok, it seems like we have dislikes on the start/stop. I was probably thinking too much of the C++ level implementation, which does not have to be the same at stan-level. My bad.

This one would probably be easier to implement and to me it’s easier to read what belongs to a certain “profile”:

Profiling each line gets messy with loops. Handling loops separately would complicate the stanc implementation. One problem that this approach has but start/stop functions do not, is problems with variable scope.Example:

profile("cov_exp") {
matrix[N, N] cov =   cov_exp_quad(x, alpha, rho) + diag_matrix(rep_vector(1e-10, N));
}
matrix[N, N] L_cov = cholesky_decompose(cov); // cov does not exist in this scope

You could define cov before the scope, but that makes profiling more invasive to the model.

I did think about this, I did not want to throw too many details in this preliminary post. We would use the thread id + profile section id as the key to the map that will hold this profiling information. std::map is thread safe if distinct threads access distinct elements.

For MPI, I think we should just profile the rank 0 processes. Otherwise we need to receive profiling info at the end from other processes and I am not sure it brings much additional value.

Just throwing one additional idea:

transformed parameters {
  vector[N] f;
  {
    profile("cov+chol");
    matrix[N, N] cov =   cov_exp_quad(x, alpha, rho) + diag_matrix(rep_vector(1e-10, N));
    matrix[N, N] L_cov = cholesky_decompose(cov);

    profile("mul");
    f = L_cov * f_tilde;
  }
}

model {
  profile("model");
  rho ~ gamma(25, 4);
  alpha ~ normal(0, 2);
  f_tilde ~ normal(0, 1);

  k ~ poisson_log(f);
}

The section is defined form the profile call to the next profile call or end of the current block. Or is that difficult to doc/explain?

EDIT: This is probably very close to the second Ben’s example on second thought.

RAII is a C++ idiom. It means start/stop is encapsulated inside the constructor and destructor of a manager class

class profiler {
  int id_;
  public:
  profiler(int id) : id_(id) {
    start_profiling(id_);
  }
  ~profiler() {
    stop_profiling(id_);
  }
};

Then you create a local profiler object at the start of the block to be profiled. Profiling ends automatically when the object goes out of scope. It ensures you always call stop_profiling, even if an exception is thrown in the middle. The sampler can recover from exceptions so proper exception handling is a must.

On Stan language side, I don’t think profiled sections need explicit names. The compiler can insert the line numbers automatically and that’s enough.
Also no need to mark the entire model block for profiling. That should be on by default. Hmm, how much overhead does profiling have? Maybe you could just profile every local block when profiling is enabled so no special markers needed?

1 Like

Nice, I really like that. Plus the exception point is a good one yeah.

Minimal, basically calls the chrono timer, checks the stack size and that is it.
So you mean like this? (the profiled sections are labeled in comments)

transformed parameters { // // section line W
    vector[N] f;
    matrix[N, N] cov;
    matrix[N, N] L_cov;
    { // section line X
        cov =   cov_exp_quad(x, alpha, rho) + diag_matrix(rep_vector(1e-10, N));
        L_cov = cholesky_decompose(cov);
    }
    { // section line Y
        f = L_cov * f_tilde;
    }
}

model { // section line Z
    rho ~ gamma(25, 4);
    alpha ~ normal(0, 2);
    f_tilde ~ normal(0, 1);

    k ~ poisson_log(f);
}
1 Like

I personally like the names.

Easier to keep thinking organized.

If the high level blocks were automatically instrumented automatically, those already have names so I suppose we don’t need to supply any.

So I guess we have 2 ways of doing this

  • profile local blocks with no names (using line numbers)
  • the profile(“name”) { … } solution

In both cases we would profile entire model/tr. params/tr. data/gq blocks.

For the first solution we could optionally add names with the annotation idea that has been circulating around. Something like
@label(“name”)
{

}

But that obviously makes the first implementation sub-optimal as we would only get names if this annotation thing ever comes around. Or we hold off with this until it does.

1 Like

In that case I think then it makes sense to expose a start(label) stop(label) version of this in a beta branch somewhere so we can work out exactly what the I/O looks like for this, how well it works, etc.

2 Likes

I am fine with labels, but requiring start and stop sounds erroneous. I would prefer to have blocks defining regions of what is profiled (I mean any block, not just the model blocks).

I agree. I just don’t want this hung up on a language feature. We can do whatever before there’s a real release.

Sure. We needed such a feature already for a while.

… this is why I would really suggest to add to the diagnose output of cmdstan a printout of the AD tree size. That should be a very good measure for model efficiency. The smaller the AD tree, the better the model is using efficient representations.

That would be very quick to do, I think.

2 Likes

What do you mean by diagnose output? Do you mean the diagnose method or just the standard output/csv?

bernoulli diagnose …

so I guess the diagnose method in your terms… though we could also include it right after “gradient evaluation took … s” in the startup output of every model.

I don’t mind where it goes, but we need it. Maybe @mitzimorris has an opinion on this?

1 Like

thanks Sebastian, of course I have an opinion! first off, agreed, this would be very cool.

At the interface level, this looks like what CmdStan calls a “method” - currently one of {sample, optimize, variational, diagnose, generated quantities}.

For the API, this should be added to the stan::services package - the best name would be profile. As Sebastian notes, this would look a lot like diagnose - cf. https://github.com/stan-dev/stan/tree/develop/src/stan/services/diagnose

1 Like

Thanks everyone for the very valuable feedback! It seems that we agree that this would be useful.

I will make a test branch of cmdstan with stanc3 + math to test this. The stanc3 parts will be done in a way that is easiest/fastest to implement. Like Ben said the first focus will be “if it works” and IO. Then the language API can get optimized.

It does make sense to have a method “profile” for the bigger profiling mode @rok_cesnovar wants to have.

I am suggesting that we add in a “light” way a very valuable output - the AD stack size - to the outputs we have now. This would be easy and quick to implement. How about we revise the current message when calling sampling which reads

Gradient evaluation took 1.5e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds.
Adjust your expectations accordingly!

toward something even more useful as

Gradient evaluation with XXX terms took 1.5e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds.
Adjust your expectations accordingly!

I mean, this message is intended to serve as a way to profile things the quick and easy way - but it missed out the information on the number of terms, which I would suggest to add in there.

Does that make sense to everyone?

… and we should also change from CPU time in terms of clock cycles to wall time here (as well as in the outputs).

1 Like

Yes. That is ready, just needs a review: https://github.com/stan-dev/stan/pull/2922

I think it would helpful if there was a way to see which math contributed to the size of the AD stack to guide model refactoring. If I know that some expression accounts for a large proportion of the AD stack then I can think about how to factor out terms to avoid redundant computations.

2 Likes