Future rstan is broken with clang++

ahartikainen · November 3, 2019, 10:49am

@ariddell is this the same problem we have with httpstan/pystan?

wds15 · November 3, 2019, 11:31am

I think so, yes.

wds15 · November 3, 2019, 12:33pm

So that linking stuff was the root cause!!!

I just compiled with clang++ (Mojave, Apple clang) and I am happily sampling with very little RAM use the mvn_orange model which was compiled relatively fast:

> post  <- sampling(sm, init=list(init), data=data, seed=1234, chains=1)

SAMPLING FOR MODEL 'mvn_orange' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.000115 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.15 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 9.74904 seconds (Warm-up)
Chain 1:                1.01898 seconds (Sampling)
Chain 1:                10.768 seconds (Total)
Chain 1: 
>

I pushed stuff to its own branch, see below. @bgoodri note that I have commented out things I did not understand, so there could be changes which you do not want to take over. What is currently not super efficient is the stan_fit model class handling (a few more than necessary copies are done)… but anyway. The key is really to link in the stan services statically with the model. Then it works just fine. Let’s cross fingers that this still works for packages like rstanarm with multiple models… to separate out everything as needed I had to make a virtual stan_fit_base class and trick Rcpp a little bit to play with it.

@ahartikainen & @ariddell I do think that PyStan has the same problems. This would explain the really weird crashes which we saw.

This issue was really hard to get to the bottom to. @bgoodri let me know if you need any further explanations. Looking forward to 2.21.1 rstan!

Branch:

bgoodri · November 3, 2019, 2:23pm

This sounds good. The diff is kind of large because there are a lot of unrelated changes due to clang-format or something. Can you undo the formatting changes until after we are sure we have figured everything out?

One thing I am worried about is whether sampling in parallel is safe? The way it works now

If we are in a terminal, parallel::mclapply is called, which does forking, but calls the command function as many times as the cores argument
Otherwise, if cores > 1, parallel::parLapplyLB is called, which does sockets, but still calls the command function as many times as the cores argument

If the rstan shared object is statically linked into the user’s shared object, then wouldn’t the multiple processes be trying to use the same global variables? If so, and perhaps even if not, we could handle the chain parallelism from C++ using TLS, which I think @ariddell is planning to do for PyStan soon anyway.

wds15 · November 3, 2019, 2:59pm

I don’t run clang-format over it. The diff is large because you used to keep class declaration and class definition in the same file. I had to separate that out for stan_fit class.

The C++ changes are really just about separating declaration & definition of stan_fit and making sure we statically link the services.

I don’t think that technically anything changed wrt to parallel sampling. So if it worked before then it should work now. However, we can also just be safe and turn on STAN_THREADS by default. I recall that we have only a ~2% performance loss due to that (depending on the model it can be less or a bit more - heavier models tend to have a smaller loss while fast models a bit more…so it doesn’t matter in practice, I think).

Turning on STAN_THREADS for the mvn_orange model gives me 11.1s total sampling time, which is basically the same as the run without and actually within noise.

EDIT: Actually, there is one more consideration for turning on STAN_THREADS permanently…as we pre-compile the services, this pre-compiles the AD stack as well. Thus, the choice of STAN_THREADS is made at the time-point of compiling rstan. So rstan users cannot anymore turn on STAN_THREADS or not - it is on or not, depending on how it was installed.

bgoodri · November 3, 2019, 4:33pm

OK, I was able to add ?w=1 to the URL for the diff to ignore all the formatting changes due to whatever it was that changed them. I guess I should just pull this into for_2.20 and we can work on it more from there.

I agree it does not crash when sampling in parallel, but I still don’t understand. When we were compiling header-only, the dynamic shared object got loaded into four isolated R sessions. Now, that happens but they all have the same copy of rstan.so linked into them that is allocating the global stuff.

If RStan links to TBB and the user’s model statically links to RStan, does the user’s model need to link to TBB separately or is it transitive?

Do we have a plan for how this is going to work on Windows where all linking is dynamic? We only have to worry about g++, so maybe it will work for whatever reason g++ was working on Linux.

wds15 · November 3, 2019, 4:41pm

No. Since we link the services into each model I don’t think that something changes… but given that STAN_THREADS does not cost you performance (from what I have seen…maybe double check Windows) and it is not anymore a choice after rstan is compiled, I think we should turn STAN_THREADS on and that’s it.

For Windows: Maybe it all just works, don’t know. The services are linked in just like any other .o file. Note that I haven’t updated the Makevars.win.

Sure. Do that. Note that I created a rstan_next include dir to avoid filename clashes with stan_fit.hpp. Feel free to change that so that it makes sense.

When we compile the users model, we give the linker the -ltbb flag which tells it that the tbb symbols can be resolved. The TBB then just needs to be loaded in the context of R and all is good, I think.

The TBB must be working just fine, since the global AD stack would otherwise not be initialized. TBB code is required to make that work.

bgoodri · November 3, 2019, 7:09pm

Do you know what to do about this warning:

In file included from file54782f6766ab.cpp:7:
In file included from /home/ben/r-devel/library/Rcpp/include/Rcpp.h:46:
/home/ben/r-devel/library/Rcpp/include/Rcpp/XPtr.h:31:5: warning: delete called on ‘rstan::stan_fit_base’ that is abstract but has non-virtual destructor [-Wdelete-non-virtual-dtor]
delete obj;
^
/home/ben/r-devel/library/Rcpp/include/Rcpp/XPtr.h:46:26: note: in instantiation of function template specialization ‘Rcpp::standard_delete_finalizerrstan::stan_fit_base’ requested here
void Finalizer(T*) = standard_delete_finalizer,
^

We may need to register a finalizer as mentioned in

wds15 · November 3, 2019, 7:31pm

I would try to make the destructor virtual.

bgoodri · November 3, 2019, 7:56pm

It appears we do not have an explicit destructor. Do we need to create one and free memory manually?

On Windows, the models actually seem to compile, but then the linker says

cannot find -ltbb

RcppParallelLibs() says

-LC:/Users/Stan/Documents/R/win-library/3.6/RcppParallel/lib/x64 -ltbb -ltbbmalloc

but dir("C:/Users/Stan/Documents/R/win-library/3.6/RcppParallel/libs/x64") says

[1] “RcppParallel.dll” “symbols.rds”

So, are we supposed to pass -lRcppParallel to get TBB?

bgoodri · November 3, 2019, 8:11pm

Never mind about that. The RcppParallel package puts the TBB stuff under the lib directory rather than libs. Although when I fix that, I get a different linker error

file1ebc12ba81e.o:file1ebc12ba81e.cpp:(.text$_ZN4Rcpp13Constructor_2I17stan_model_holderN5rstan2io21rlist_ref_var_contextEjE7get_newEPP7SEXPRECi[_ZN4Rcpp13Constructor_2I17stan_model_holderN5rstan2io21rlist_ref_var_contextEjE7get_newEPP7SEXPRECi]+0x1ba): undefined reference to `rstan::stan_fit::stan_fit(Rcpp::XPtr<stan::model::model_base, Rcpp::PreserveStorage, &(void Rcpp::standard_delete_finalizerstan::model::model_base(stan::model::model_base*)), false>, int)’
collect2.exe: error: ld returned 1 exit status

bgoodri · November 3, 2019, 8:22pm

That is fixed by commit 194c9177 . So, we are down to the destructor thing and that @ssp3nc3r says that the for_2.20 branch does not fix the Catalina problems.

Edit: And the fact that Windows now eats all the RAM with the mvn_orange demo.

bgoodri · November 3, 2019, 11:35pm

I am a bit unsure why the ChainableStack thing is part of command rather than part of the user’s model class. Suppose someone does some AD stuff with StanHeaders but not RStan like in the vignette

https://cran.r-project.org/web/packages/StanHeaders/vignettes/stanmath.html

where do they do ChainableStack?

wds15 · November 4, 2019, 8:14am

The ChainableStack stuff is a left-over which I just removed. It should not be necessary in client code if everything works out. I had it put there to make sure I do not need to worry about it.

The warning about the destructor should be straightforward to fix. I will look into it later today.

I also just reworked the compile logic for windows. So the makefiles now also build for windows the services to a static library which is linked into the model binary during linking of the model. I don’t quite see why that is not possible. I mean, the services can be linked together statically with the model and then we load dynamically the whole thing into R. I am not able to test it right now as I don’t have a virtual windows available here; can do that later.

What do you think about turning on STAN_THREADS as a default? If you do not want that, then you would have to compile the stan services twice - once with and once without it. The compilation calls would then need to know what the user wants to select the right services to link to. Unless we still find a real obstacle we should just turn it on; that’s what I would do.

Can the Rstan package as is still support the old packages on CRAN? Or is this the next headache to sort out?

bgoodri · November 4, 2019, 2:58pm

I think setting STAN_THREADS by default is fine. We do need to try to make it as simple as possible for rstanarm, RBesT, etc. to work with rstan >= 2.21 but I think their header-only C++ will continue to compile and work when rstan 2.21 is released on CRAN. It is just when they regenerate their C++ code using rstan 2.21 that C++ code with be CRTP and then they will have to change their Makevars{.win} before they can upload to CRAN.

I will look into the Windows thing. The documentation

https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Windows

refers to DLLs, whereas above for non-Windows they recommend static libraries. But I guess the RStudio people have static libraries for RcppParallel on Windows, so maybe it can work.

wds15 · November 4, 2019, 3:07pm

Can’t we strongly encourage everyone to move to the rstantools setup? This way the make stuff will be generated from rstantools in the standard case. This way the make could even change with a new Rstan release and the R package source would still work just fine as long as rstantools is available in a newer version.

I can spin on a windows virtual machine later to verify things work unless you sort that out before me.

We should then issue a -DSTAN_THREADS in the plugins.R for all user models and also make that define for building Rstan itself. That’s really the easiest and safest choice to make here.

bgoodri · November 4, 2019, 4:22pm

Yeah, I don’t think there are any packages that come with compiled models that never used rstantools. The rstantools on CRAN now is a lot different than what it used to be, but there is a function to mostly convert and old package to a new format and it can set the Makevars{.win} appropriately. So, I am not too worried about those packages.

I got it to work and not eat all the RAM on Windows with a few fixes that I have pushed to the branch.

wds15 · November 4, 2019, 4:53pm

Whow! Windows works… that’s really magic.

What’s left to be done? As far as I understand everything is working, right… and that’s a lot as it includes fast compilation & the TBB.

Should we turn on linking against the tbbmalloc_proxy on MacOS? That gives a significant performance boost (~20%) for any stan program (including single-core runs). It’s just that there should be an option to turn it on or off, I think… so maybe we leave that for another day. Thoughts?

@ahartikainen & @ariddell : PyStan+httpstan is almost certainly suffering from the very same issues. So in short, you have to link the model statically with the services and load that into the running process. The services must not be loaded already. Otherwise reference to the global AD tape is messed up by the linker. Very annoying.

bgoodri · November 4, 2019, 5:02pm

I would say go ahead and put in whatever stuff like -DSTAN_THREADS and tbbmalloc_proxy you think would be good.

What remains is that exceptions are still messed up on Catalina. And a lot of testing.

wds15 · November 4, 2019, 5:05pm

Ok… so we hard-code STAN_THREADS for all platforms and the tbbmalloc_proxy for macOS… fine for me. That corresponds to the default in cmdstan. I will add that then.

I can upgrade to Catalina… which takes Mojave away from me, but it could be helpful to get a sense of the misery. I am not really keen to do so given all the hiccups we are seeing, but well…

Topic		Replies	Views
R 4.0.0 and CRAN MacOS binaries RStan	27	7983	April 6, 2021
Warning to others: compilation errors and memory/cache corruption General	3	1081	June 15, 2017
Stan now (mostly) working on RStudio Cloud General	58	7849	March 30, 2020
RStan crashing with large files on AWS Modeling	2	736	May 23, 2019
Not able to run stan due to compiler RStan	1	325	July 4, 2021

Future rstan is broken with clang++

Related topics