I need to benchmark different variants of my Stan model. The simple approach would be to start a timer before and after sampling, but I worry that this would be inaccurate since iterations may be slower before the step size has been automatically tuned. In reality, this may not be an issue, in which case I would appreciate someone pointing this out.
Instead, I wanted to obtain low-level logs of the sampling, including the time the sample was made and the current step size. Then I could extract the time intervals only from the portion of time when the step size is stable.
I am aware that I could let the model warm-up to find the step size and then use this as a fixed step size for a zero warm-up run but because I don’t know how many warm-up steps are needed, this would involve me running more iterations than ideal (the models are very large so are slow to fit).
I am happy to use the Python, R, or CMD interface. Any advice?