From fast to slow sampling on cluster after reset and older rstan version installed

willemhc · January 26, 2021, 9:26pm

Hello all,

Throughout the late Summer and Fall, I have been working on a large simulation study evaluating a few candidate hierarchical models for an applied problem (essentially a hierarchical meta-analysis). I have been running my simulations on a cluster – my University’s center for high performance computing (HPC). Everything was going well until mid-November when there was some scheduled cluster downtime. After that downtime, I was no longer able to fit models using rstan without error. A staff member of our HPC was able to help me install a version of rstan so that I would not be getting such errors. However, the sampling for my models is now so incredibly slow that I am now stuck again. This is without having made any substantial changes to my r or stan code. On my own laptop, for example, fitting one of my models to a simulated dataset, I am able to run three chains in parallel for 100,000 iterations over about 2 hours. On any core in the HPC environment that I get my jobs submitted to, even 12 hours will only get me through 20% of the sampling for one chain. I am wondering if anyone can help me figure out what has changed/is going wrong. I am not a developer so it’s difficult for me to understand why the same r/stan code would have worked a few months ago, and is not now.

The instructions I received from our HPC staff were the following (I implemented these instructions). As a starting point, does anyone recognize issues here?

“I have this working. I installed rstan version 2.21.1 into our gcc-compiled version of R 3.5.1 along with its dependencies and successfully ran both your SLURM scripts last night. Here is how you can install and use is:

Your $HOME/.R/Makevars file should contain the following:
CXX14=g++
CXX14FLAGS=-std=c++14
CXX14FLAGS += -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION
CXX14FLAGS += -fPIC
LDFLAGS += -fPIC

If you have other settings in that file that you want to keep, either make a copy of the file, or comment them out with a # character.

You can execute my install script which is ~ID/Incidents/YourName_rstan/install_rstan.sh . That is a bash script that will create the directory R/mropen under your home directory and install rstan with its dependencies there. The installation is performed with our copy of Microsoft R version 3.5.1, which I chose because it is compiled with gcc. I couldn’t get this working with any of our copies of R compiled with the Intel compiler.
In your csh scripts like Analysis.sh or Simple_Analysis.sh you must add the lines:
module load mropen gcc/4.9.2
setenv R_LIBS_USER $HOME/R/mropen”

If there are no obvious issues with this setup, I can provide my r and stan code, or of course any additional detail that might be helpful. I will note that all cores that are being utilized have CPUs at 100% and there is low memory use, so it does not appear as though the cluster isn’t working hard. Thanks for any help or suggestions!

mike-lawrence · January 27, 2021, 11:18am

What version of RStan was the cluster running before the change and what version is it running now?

andrjohns · January 27, 2021, 2:06pm

While I don’t think it’s responsible for the full slowdown, the Makevars file is omitting compiler optimisations which will result in a slower model. Try updating the CXX14FLAGS to:

CXX14FLAGS += -fPIC -O3 -march=native -mtune=native

willemhc · January 27, 2021, 3:06pm

Prior to the change, I had to reinstall rstan a few times to keep things working over the course of August-October. The most recent version I had used was 2.21.0, but if there had been updates to rstan during that timeframe, I had also used others prior to 2.21.0 successfully. I believe I was wrong to say that the HPC staff helped me install an older version. We are now running 2.21.1.

willemhc · January 27, 2021, 3:10pm

If I add these CXX14FLAGS to my makevars file I end up getting the same issue I was having before the HPC staff helped me re-install rstan. The short version is that once the chains begin sampling, they produce an errror “double free or corruption (out)” and then begin to display pages worth of documentation under either “memory map” or “back trace.” Does this help indicate a problem in any way? Thank you for your help.

andrjohns · January 27, 2021, 3:13pm

That is odd. Try with just O3:

CXX14FLAGS += -fPIC -O3

willemhc · January 27, 2021, 3:38pm

With just O3 the sampling is working without that error. But I am not yet able to tell if things have sped up.

mike-lawrence · January 27, 2021, 4:10pm

To gauge the speed, maybe try using ezStan, which has a progress bar with ETA.

Also, have you considered using cmdstanr? If that’s possible in the HPC environment, it’ll give you the latest/greatest speed/features.

willemhc · January 27, 2021, 9:15pm

Hi everyone. Thanks for taking the time to respond to this post. I will look into cmdstanr as I move forward with my projects. In the meantime, adding “O3” to the CXX14FLAGS seems to have fixed the slow sampling issue. Obviously it’s probably possible to find a way to speed things up further, but I am at least moving as quickly as I was back in November with this small fix. I am a bit dumbfounded but very happy, so thanks again!!

Topic		Replies	Views
Cmdstan cluster sampling speed CmdStan	3	81	January 10, 2025
Fit time monotonically slower with progressive simulation iterations RStan rstan , performance	3	613	November 21, 2020
Rstan 2.19.2 slower than 2.18.1 Developers rstan	15	1243	August 27, 2019
How to speed up my Stan code and sampling in rstan? Modeling rstan , fitting-issues , performance	8	1162	May 28, 2021
Rstan on remote servers General	9	1952	December 14, 2020

From fast to slow sampling on cluster after reset and older rstan version installed

Related topics