following up on discussion during the 4/16 math meeting -
there’s currently a problem with the CmdStan makefiles in that the compiler flag -DSTAN_THREADS must have the same value when compiling the model as when compiling the rest of CmdStan and the stan libraries, because the autodiff implementation is different if STAN_THREADS is on or off. If there’s a mismatch, then model compilation will fail with a linker error.
Two options:
We could change the makefiles so that -DSTAN_THREADS is always on - this will be a win for users who use reduce_sum and map_rect.
We could eliminate use of STAN_THREADS altogether and just go with threading everywhere.
When this was first implemented, it made sense to have the STAN_THREADS flag because this was an experimental feature. Are we now more confident that this is a good thing?
Option #2 would make life much easier for CmdStan, CmdStanR, and CmdStanPy.
Is there a way to check how CmdStan was built? If not, perhaps we should inject that information from the build process into the executable so it can tell us at build time.
I think that’d help… if we were able to stop the build process before starting to compile by checking flags. That way we could trap it and put out a reasonable message.
Do we know the consequences of this? Things I’m thinking about:
is this going to be slower for some users? And under what conditions?
will this prevent a subset of users from being able to install?
are there any systems where this introduces more installation problems? I’m remembering the days when getting threading working was non-trivial; could some mismatch of a system library cause issues?
we found very few models which stress the ad tape so much that these are 2-3% slower when threading is on. However, these are models which are fast anyway as all these models do is fill the ad tape rather really computing things. Thus the slowdown affects models which anyway run in seconds.
I think the current implementation is very robust since we use the __thread keyword which is around for a very long time and is thus well supported by compilers. That change even allows windows users with 4.9.3 g++ to run threaded.
I am not aware of system library issues…and again we use a very old compiler feature which has been around very long.
I am all for making Stan threads the only way there is. Given that reduce sum is now there, the risk to confuse users and frustrating them is worthwhile the minimal downside to slow down only minimally very few models. That’s a judgement call, of course! I am for threads only.
Edit: one thing to check again is the speed penalty on windows. Maybe the old g++ was not ideal there which is hopefully resolved with the new g++ 8 from rtools 4.
I am also for turning threads on by default for the following reasons (most were already written by other):
There are no differences in what libraries are built/linked. So no changes in the installation. We already build and link TBB and we are past the initial TBB hiccups with Windows makefiles and CI and all. It wasnt an issue for other OS.
the slowdown was only observed on really fast models where a 2% slowdown is not noticeable (at least that is my personal judgement call)
simplifies the interface for users. I suspect reduce_sum could be quite popular. And this is really the largest feature (the only really important one) of 2.23 I think its worth doing this.
The only regression we would observe is with single threaded reduce_sum/map_rect. At least on Windows. Because threaded reduce_sum/map_rect with 1 thread is slower than by using “sequential” reduce_sum/map_rect without threading. But as I dont know why anyone would try to use reduce_sum/map_rect with a single thread (outside of debugging their model), this seems more of a non-issue. Performance in debugging should not be a priority imo.
I also want to point out that if a user wants to turn threading off for some reason they can always do that with
I would like to do this for 2.23 by merging this PR https://github.com/stan-dev/cmdstan/pull/862
Do we wait for the Stan meeting on Thursday or just go with it if no one objects?
After seeing larger than expected performance differences, I think we should just leave things as they are for the current release for sure. I wonder if we have done some optimisations to Stan Math in the meantime which did lead to greater benefits for the non-threaded version. My impression is that the SIR example runs quite a bit faster than it used to a while back.
As suggested on the respective PR, here is the behavior which I think we should implement for CmdStan:
we build the pre-build stuff of CmdStan with and without threads
if the user does not define STAN_THREADS, then we simply auto-detect it from the model. That is relatively easy for the moment since this is merely a grep for “map_rect” or “reduce_sum”. If these are found in the model, then you get threading otherwise not.
Whenever the user defines STAN_THREADS, then he gets what he asks for
Yet, we should keep exploring the option to get fully rid of the non-threading stuff for the purpose of easing our lives as developers. It’s just that we need a bit more data and time to make that decision. I can imagine that with better compilers (RTools 4.0) this question should anyway be revisited.
So 2.23 can go out as is (WITHOUT that PR merged which makes STAN_THREADS a default) from my perspective.
EDIT: sorry I wanted to say that we DO NOT merge the PR for now…
agree that we shouldn’t merge this PR for the 2.23 release. that will give us time to investigate completely, also to work out the best way to set up defaults and overrides and document them properly.