After seeing larger than expected performance differences, I think we should just leave things as they are for the current release for sure. I wonder if we have done some optimisations to Stan Math in the meantime which did lead to greater benefits for the non-threaded version. My impression is that the SIR example runs quite a bit faster than it used to a while back.
As suggested on the respective PR, here is the behavior which I think we should implement for CmdStan:
- we build the pre-build stuff of CmdStan with and without threads
- if the user does not define STAN_THREADS, then we simply auto-detect it from the model. That is relatively easy for the moment since this is merely a grep for “map_rect” or “reduce_sum”. If these are found in the model, then you get threading otherwise not.
- Whenever the user defines STAN_THREADS, then he gets what he asks for
Yet, we should keep exploring the option to get fully rid of the non-threading stuff for the purpose of easing our lives as developers. It’s just that we need a bit more data and time to make that decision. I can imagine that with better compilers (RTools 4.0) this question should anyway be revisited.
So 2.23 can go out as is (WITHOUT that PR merged which makes STAN_THREADS a default) from my perspective.
EDIT: sorry I wanted to say that we DO NOT merge the PR for now…