Hi all,
this is an initial proposal on profiling Stan models to identify bottlenecks in the gradient evaluation in the model/transformed parameters blocks and the function evaluations in the generated quantities and transformed data block (though the latter two are hardly every bottlenecks).
I will explain my idea on an example which will hopefully spark some discussion so we hash out if I am missing something really obvious. If
A user would specify profiling sections inside a model with the calls to start_profiling and stop_profiling which would be supplied an ID of the section. In this example, these are integer values, but we could also make the section labels with strings.
data {
int<lower=1> N;
real x[N];
int k[N];
}
parameters {
real<lower=0> rho;
real<lower=0> alpha;
vector[N] f_tilde;
}
transformed parameters {
vector[N] f;
{
start_profiling(0);
matrix[N, N] cov = cov_exp_quad(x, alpha, rho) + diag_matrix(rep_vector(1e-10, N));
matrix[N, N] L_cov = cholesky_decompose(cov);
stop_profiling(0);
start_profiling(1);
f = L_cov * f_tilde;
stop_profiling(1);
}
}
model {
start_profiling(2);
rho ~ gamma(25, 4);
alpha ~ normal(0, 2);
f_tilde ~ normal(0, 1);
k ~ poisson_log(f);
stop_profiling(2);
}
If the model is compiled with a stanc flag --profile
, the profiling would get compiled to the C++ level, without the flag, the start_profiling
and stop_profiling
calls would not be written to the C++ level.
If profiling is used, after the sampling completes, the profiling output would get either printed to the std output or written in a file (format TBD). Example of the output could be something like:
id,forward_pass_time,backward_pass_time,stack_used
0,12.5,30.6,1000
1,6.5,12.7,1200
We could print the forward and backward pass of the autodiff separately or just the cumulative or both. The users will probably only care about the cumulative, while the separate times would be interesting to developers. We could additionally also supply other information like the amount the autodiff stack grew between the start and stop calls. Any other ideas on what could be profiled are obviously welcome.
There is a preliminary math implementation here: https://github.com/stan-dev/math/pull/1902
This currently assumes that start_profiling and stop_profiling adds a vari to the autodiff stack so we can profile the backward pass. This is a bit wasteful if the profiled operations dont use var’s, but this wastes 2*num_of_profile_ids vars to the autodiff stack, which is probably not that bad. The alternative would be to have separate calls that would be used in transformed data/generated quantites. But I am not sure its really worth the extra maintanance cost.
The stanc3 and cmdstan interface details are waiting for things to get hashed out.
Open questions:
- would this be useful to the Stan users?
- does the Stan model API look reasonable?
- any suggestions for names used
- any comments on the Math implementation.
I think it would be interesting to the users as its generally hard to identify the bottleneck in a model without deep knowledge of the Math library. This would help them avoid optimizing or parallelizing a part of their model that represents something like 10% of the entire execution time.
p.s.:
We don’t need to focus on whether it’s worth the “resources” for someone to work on this. My PhD committee has told me to add some stuff that requires this type of profiling. So some variation of this will be made regardless of whatever we decide here. its just a question of whether we want to make a version that would be suitable to merge in the Math/Stanc3 repositories and whether we would want to expose it to the Stan user.