Future rstan is broken with clang++

@ariddell is this the same problem we have with httpstan/pystan?

I think so, yes.

So that linking stuff was the root cause!!!

I just compiled with clang++ (Mojave, Apple clang) and I am happily sampling with very little RAM use the mvn_orange model which was compiled relatively fast:

> post  <- sampling(sm, init=list(init), data=data, seed=1234, chains=1)

SAMPLING FOR MODEL 'mvn_orange' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.000115 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.15 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 9.74904 seconds (Warm-up)
Chain 1:                1.01898 seconds (Sampling)
Chain 1:                10.768 seconds (Total)
Chain 1: 
> 

I pushed stuff to its own branch, see below. @bgoodri note that I have commented out things I did not understand, so there could be changes which you do not want to take over. What is currently not super efficient is the stan_fit model class handling (a few more than necessary copies are done)ā€¦ but anyway. The key is really to link in the stan services statically with the model. Then it works just fine. Letā€™s cross fingers that this still works for packages like rstanarm with multiple modelsā€¦ to separate out everything as needed I had to make a virtual stan_fit_base class and trick Rcpp a little bit to play with it.

@ahartikainen & @ariddell I do think that PyStan has the same problems. This would explain the really weird crashes which we saw.

This issue was really hard to get to the bottom to. @bgoodri let me know if you need any further explanations. Looking forward to 2.21.1 rstan!

Branch:

3 Likes

This sounds good. The diff is kind of large because there are a lot of unrelated changes due to clang-format or something. Can you undo the formatting changes until after we are sure we have figured everything out?

One thing I am worried about is whether sampling in parallel is safe? The way it works now

  1. If we are in a terminal, parallel::mclapply is called, which does forking, but calls the command function as many times as the cores argument
  2. Otherwise, if cores > 1, parallel::parLapplyLB is called, which does sockets, but still calls the command function as many times as the cores argument

If the rstan shared object is statically linked into the userā€™s shared object, then wouldnā€™t the multiple processes be trying to use the same global variables? If so, and perhaps even if not, we could handle the chain parallelism from C++ using TLS, which I think @ariddell is planning to do for PyStan soon anyway.

I donā€™t run clang-format over it. The diff is large because you used to keep class declaration and class definition in the same file. I had to separate that out for stan_fit class.

The C++ changes are really just about separating declaration & definition of stan_fit and making sure we statically link the services.

I donā€™t think that technically anything changed wrt to parallel sampling. So if it worked before then it should work now. However, we can also just be safe and turn on STAN_THREADS by default. I recall that we have only a ~2% performance loss due to that (depending on the model it can be less or a bit more - heavier models tend to have a smaller loss while fast models a bit moreā€¦so it doesnā€™t matter in practice, I think).

Turning on STAN_THREADS for the mvn_orange model gives me 11.1s total sampling time, which is basically the same as the run without and actually within noise.

EDIT: Actually, there is one more consideration for turning on STAN_THREADS permanentlyā€¦as we pre-compile the services, this pre-compiles the AD stack as well. Thus, the choice of STAN_THREADS is made at the time-point of compiling rstan. So rstan users cannot anymore turn on STAN_THREADS or not - it is on or not, depending on how it was installed.

OK, I was able to add ?w=1 to the URL for the diff to ignore all the formatting changes due to whatever it was that changed them. I guess I should just pull this into for_2.20 and we can work on it more from there.

I agree it does not crash when sampling in parallel, but I still donā€™t understand. When we were compiling header-only, the dynamic shared object got loaded into four isolated R sessions. Now, that happens but they all have the same copy of rstan.so linked into them that is allocating the global stuff.

If RStan links to TBB and the userā€™s model statically links to RStan, does the userā€™s model need to link to TBB separately or is it transitive?

Do we have a plan for how this is going to work on Windows where all linking is dynamic? We only have to worry about g++, so maybe it will work for whatever reason g++ was working on Linux.

No. Since we link the services into each model I donā€™t think that something changesā€¦ but given that STAN_THREADS does not cost you performance (from what I have seenā€¦maybe double check Windows) and it is not anymore a choice after rstan is compiled, I think we should turn STAN_THREADS on and thatā€™s it.

For Windows: Maybe it all just works, donā€™t know. The services are linked in just like any other .o file. Note that I havenā€™t updated the Makevars.win.

Sure. Do that. Note that I created a rstan_next include dir to avoid filename clashes with stan_fit.hpp. Feel free to change that so that it makes sense.

When we compile the users model, we give the linker the -ltbb flag which tells it that the tbb symbols can be resolved. The TBB then just needs to be loaded in the context of R and all is good, I think.

The TBB must be working just fine, since the global AD stack would otherwise not be initialized. TBB code is required to make that work.

Do you know what to do about this warning:

In file included from file54782f6766ab.cpp:7:
In file included from /home/ben/r-devel/library/Rcpp/include/Rcpp.h:46:
/home/ben/r-devel/library/Rcpp/include/Rcpp/XPtr.h:31:5: warning: delete called on ā€˜rstan::stan_fit_baseā€™ that is abstract but has non-virtual destructor [-Wdelete-non-virtual-dtor]
delete obj;
^
/home/ben/r-devel/library/Rcpp/include/Rcpp/XPtr.h:46:26: note: in instantiation of function template specialization ā€˜Rcpp::standard_delete_finalizerrstan::stan_fit_baseā€™ requested here
void Finalizer(T*) = standard_delete_finalizer,
^

We may need to register a finalizer as mentioned in

I would try to make the destructor virtual.

It appears we do not have an explicit destructor. Do we need to create one and free memory manually?

On Windows, the models actually seem to compile, but then the linker says

cannot find -ltbb

RcppParallelLibs() says

-LC:/Users/Stan/Documents/R/win-library/3.6/RcppParallel/lib/x64 -ltbb -ltbbmalloc

but dir("C:/Users/Stan/Documents/R/win-library/3.6/RcppParallel/libs/x64") says

[1] ā€œRcppParallel.dllā€ ā€œsymbols.rdsā€

So, are we supposed to pass -lRcppParallel to get TBB?

Never mind about that. The RcppParallel package puts the TBB stuff under the lib directory rather than libs. Although when I fix that, I get a different linker error

file1ebc12ba81e.o:file1ebc12ba81e.cpp:(.text$_ZN4Rcpp13Constructor_2I17stan_model_holderN5rstan2io21rlist_ref_var_contextEjE7get_newEPP7SEXPRECi[_ZN4Rcpp13Constructor_2I17stan_model_holderN5rstan2io21rlist_ref_var_contextEjE7get_newEPP7SEXPRECi]+0x1ba): undefined reference to `rstan::stan_fit::stan_fit(Rcpp::XPtr<stan::model::model_base, Rcpp::PreserveStorage, &(void Rcpp::standard_delete_finalizerstan::model::model_base(stan::model::model_base*)), false>, int)ā€™
collect2.exe: error: ld returned 1 exit status

That is fixed by commit 194c9177 . So, we are down to the destructor thing and that @ssp3nc3r says that the for_2.20 branch does not fix the Catalina problems.

Edit: And the fact that Windows now eats all the RAM with the mvn_orange demo.

I am a bit unsure why the ChainableStack thing is part of command rather than part of the userā€™s model class. Suppose someone does some AD stuff with StanHeaders but not RStan like in the vignette

https://cran.r-project.org/web/packages/StanHeaders/vignettes/stanmath.html

where do they do ChainableStack?

The ChainableStack stuff is a left-over which I just removed. It should not be necessary in client code if everything works out. I had it put there to make sure I do not need to worry about it.

The warning about the destructor should be straightforward to fix. I will look into it later today.

I also just reworked the compile logic for windows. So the makefiles now also build for windows the services to a static library which is linked into the model binary during linking of the model. I donā€™t quite see why that is not possible. I mean, the services can be linked together statically with the model and then we load dynamically the whole thing into R. I am not able to test it right now as I donā€™t have a virtual windows available here; can do that later.

What do you think about turning on STAN_THREADS as a default? If you do not want that, then you would have to compile the stan services twice - once with and once without it. The compilation calls would then need to know what the user wants to select the right services to link to. Unless we still find a real obstacle we should just turn it on; thatā€™s what I would do.

Can the Rstan package as is still support the old packages on CRAN? Or is this the next headache to sort out?

I think setting STAN_THREADS by default is fine. We do need to try to make it as simple as possible for rstanarm, RBesT, etc. to work with rstan >= 2.21 but I think their header-only C++ will continue to compile and work when rstan 2.21 is released on CRAN. It is just when they regenerate their C++ code using rstan 2.21 that C++ code with be CRTP and then they will have to change their Makevars{.win} before they can upload to CRAN.

I will look into the Windows thing. The documentation

https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Windows

refers to DLLs, whereas above for non-Windows they recommend static libraries. But I guess the RStudio people have static libraries for RcppParallel on Windows, so maybe it can work.

Canā€™t we strongly encourage everyone to move to the rstantools setup? This way the make stuff will be generated from rstantools in the standard case. This way the make could even change with a new Rstan release and the R package source would still work just fine as long as rstantools is available in a newer version.

I can spin on a windows virtual machine later to verify things work unless you sort that out before me.

We should then issue a -DSTAN_THREADS in the plugins.R for all user models and also make that define for building Rstan itself. Thatā€™s really the easiest and safest choice to make here.

Yeah, I donā€™t think there are any packages that come with compiled models that never used rstantools. The rstantools on CRAN now is a lot different than what it used to be, but there is a function to mostly convert and old package to a new format and it can set the Makevars{.win} appropriately. So, I am not too worried about those packages.

I got it to work and not eat all the RAM on Windows with a few fixes that I have pushed to the branch.

2 Likes

Whow! Windows worksā€¦ thatā€™s really magic.

Whatā€™s left to be done? As far as I understand everything is working, rightā€¦ and thatā€™s a lot as it includes fast compilation & the TBB.

Should we turn on linking against the tbbmalloc_proxy on MacOS? That gives a significant performance boost (~20%) for any stan program (including single-core runs). Itā€™s just that there should be an option to turn it on or off, I thinkā€¦ so maybe we leave that for another day. Thoughts?

@ahartikainen & @ariddell : PyStan+httpstan is almost certainly suffering from the very same issues. So in short, you have to link the model statically with the services and load that into the running process. The services must not be loaded already. Otherwise reference to the global AD tape is messed up by the linker. Very annoying.

I would say go ahead and put in whatever stuff like -DSTAN_THREADS and tbbmalloc_proxy you think would be good.

What remains is that exceptions are still messed up on Catalina. And a lot of testing.

Okā€¦ so we hard-code STAN_THREADS for all platforms and the tbbmalloc_proxy for macOSā€¦ fine for me. That corresponds to the default in cmdstan. I will add that then.

I can upgrade to Catalinaā€¦ which takes Mojave away from me, but it could be helpful to get a sense of the misery. I am not really keen to do so given all the hiccups we are seeing, but wellā€¦