… I think we need to revise our test platforms to get threading properly tested. We just broke threading support under clang in develop due to this part not being tested. This is due to a complicated situation:
Travis only has a super old Linux Ubuntu LTS 14.04 which causes a ton of problems => we can test things with g++, but not with clang (which needs extra hacks which we do not ship at the moment with stan-math as its beyond me to reliably configure the build system to detect systems which need this fix apparently needed for 14.04).
our Jenkins runs with a pre-2016 clang Apple compiler which does not support the C++11 thread_local keyword. Hence, we cannot test on the Jenkins box any of the threading stuff.
So much for the background. A solution from my angle would be to use on the Jenkins box an open-source clang which does support the thread_local keyword. My suggestion here would be to use the so-called clang4 as used for CRAN binaries for macosx. Would others agree on that this is a good idea to switch the target compiler on macos to the clang4?
On an additional note I wonder what our policy is for supported compilers. That is, with the threading I am running into all sorts of walls as travis shipping too old distributions, compiler having bugs, etc… It is really hard for me to digest all this and even harder to get a handle on systems which I do not have access to. I have started to download virtualbox to have a chance to understand this, but honestly, I do not really like to go down that route unless we decide that’s needed. I mean, it is a lot to expect from me to juggle virtualbox LTS images in order to run tests - my opinion.
Threading is super important, but this feature does not fly on all compilers/os combinations equally well. How can we deal with that in a way that makes development easy for us?
Any thoughts here would be helpful. The goal is to find a good solution for developers and users.
It sounds like you’re getting bogged down with this. I can prioritize helping testing this over my pet projects (sigh…) once the next conference is over (April 30th). I agree that what we need to finalize the PR is to stop moving the target (discovering new broken configurations that “might” be important).
On Linux would mean covering: stable and latest supported red-hat, stable and latest supported Ubuntu b/c of clusters and desk-tops respectively. Even clusters with old red-hat installs seem to usually allow you to select an up-to-date compiler. I tend to use odd distributions for personal stuff (Gentoo and Arch) but those are usually more up-to-date than Ubuntu/Red-Hat so this is a non-issue there.
For Windows we’re pretty well tied to whatever Rtools distributes
For OS X given the trouble the native compiler has given us we could just go with whatever is current on Homebrew if we link/maintain decent install instructions.
We can always evolve our target and fix things if we accidentally cut out users we think we should be supporting but hidden requirements are just going to kill progress on any deep changes like this.
ubuntu trusty is too old for threading. It requires weird hacks which can make it work, but we decided against including these. trusty is from April 2014 and super old.
Yeah, this threading stuff is taking a serious amount of time by now. There is good news: It looks like that I can test on travis on osx with the Xcode 9.3 clang compiler suite. That is a recent enough Apple clang which supports thread_local.
Having gone through a lot of trouble here is what I suggest for the threading:
The code will by default (no threading) stick to an implementation which should give no performance penalty. This will use static variables of a class and as such gives us external linkage. This is essentially the old behavior (which suffers from being prone against the static initialization order catastrophe).
When threading is turned on I am suggesting to use the apparently standard C++ approach to let a static function return a reference to a static variable declared inside that function. This will incur some performance penalty (we have seen up to ~15%, but it really depends on the model). On the upside is that this is the standard way to implement thread-safe singletons. Hence, we can expect that this sort of code is well supported by the compilers. From messing with all of this we have learned the hard way that deviations from it easily kill portability.
To make this happen, I am affraid that we will need to define a macro to access the stack. This macro definition will be different depending on threads being used or not. I hope everyone is OK with this.
I was just saying in the meeting that it could be worth investigating what I think is essentially @ChrisChiasson’s idea presented here: https://github.com/stan-dev/math/pull/840#issuecomment-382127187 . The only thing this buys us is the lack of a macro, and it could potentially have a performance penalty although the way he suggests it, we wouldn’t face that in anything Stan or even RStanArm. But I won’t have much time to devote to it and Bob doesn’t care much about the macro, so I’m open to just doing it that way too.
Ah. So you mean we always call a context() function, but for the non-threading case this just returns the global and in the threading case we return the function local static?
I had that idea as well. It would look very odd…but could do it. Is that how you understood it as well?
Yeah - but I haven’t had time to dig into what it would actually look like. So if you think a macro would be cleaner feel free to go with that, or experiment and choose one, either way.