@yizhang asked this question and I thought I would put up the question and answer here so people can find it.
No. There’s a good reason. First, the only block that takes a while is the model block, unless there’s an error with an infinite loop or something really bad in transformed data, transformed parameters, or generated quantities. (Think <5% in those blocks. Maybe down to <1% for complicated models.)
Second, speeding up the model block doesn’t actually correspond to increased n_eff / time or even total wall time! Yes, it’s a good measurement to have, but not when trying to speed up a model.
Third, all of the time is in the computation of the log joint and computing the gradient with respect to the parameters. There once was a time when building up the expression graph (computing the function value) and applying the chain rule to that graph (computing all the gradients) were separated and measuring the two sweeps would have resulted in a good estimate of how much time was spent in computing the gradient. Not now. Now we greedily build the adjoints in the chain rule when we can reuse computations while computing the function. This makes the time to compute the value longer, but greatly decreases the time to compute the gradients. So, it’s tricky.
Hopefully that explains some of the reason why we haven’t tried to build one generally. If you want to do it, by all means, you have my support. If you have any more questions, fire away.