Using Rstan on a cluster

rstan
loo

#1

Hello Everyone,

  • Operating System: CentOS 7
  • RStan Version: 2.17.3
  • Output of writeLines(readLines(file.path(Sys.getenv("HOME"), ".R/Makevars")))
    CXXFLAGS=-O3 -Wno-unused-variable -Wno-unused-function
    CXXFLAGS=-O3 -Wno-unused-variable -Wno-unused-function
    CXXFLAGS += -Wno-ignored-attributes -Wno-deprecated-declarations
  • Output of devtools::session_info("rstan"):
    > devtools::session_info(“rstan”)
    Session info ------------------------------------------------------------------
    setting value
    version R version 3.5.0 (2018-04-23)
    system x86_64, linux-gnu
    ui unknown
    language (EN)
    collate en_US.UTF-8
    tz America/Winnipeg
    date 2018-08-06

Packages ----------------------------------------------------------------------
package * version date source
assertthat 0.2.0 2017-04-11 CRAN (R 3.5.0)
BH 1.66.0-1 2018-02-13 CRAN (R 3.5.0)
cli 1.0.0 2017-11-05 CRAN (R 3.5.0)
colorspace 1.3-2 2016-12-14 CRAN (R 3.5.0)
crayon 1.3.4 2017-09-16 CRAN (R 3.5.0)
dichromat 2.0-0 2013-01-24 CRAN (R 3.5.0)
digest 0.6.15 2018-01-28 CRAN (R 3.5.0)
fansi 0.2.3 2018-05-06 CRAN (R 3.5.0)
ggplot2 3.0.0 2018-07-03 CRAN (R 3.5.0)
glue 1.3.0 2018-07-17 CRAN (R 3.5.0)
graphics * 3.5.0 2018-05-17 local
grDevices * 3.5.0 2018-05-17 local
grid 3.5.0 2018-05-17 local
gridExtra 2.3 2017-09-09 CRAN (R 3.5.0)
gtable 0.2.0 2016-02-26 CRAN (R 3.5.0)
inline 0.3.15 2018-05-18 CRAN (R 3.5.0)
labeling 0.3 2014-08-23 CRAN (R 3.5.0)
lattice 0.20-35 2017-03-25 CRAN (R 3.5.0)
lazyeval 0.2.1 2017-10-29 CRAN (R 3.5.0)
magrittr 1.5 2014-11-22 CRAN (R 3.5.0)
MASS 7.3-49 2018-02-23 CRAN (R 3.5.0)
Matrix 1.2-14 2018-04-13 CRAN (R 3.5.0)
methods * 3.5.0 2018-05-17 local
mgcv 1.8-23 2018-01-21 CRAN (R 3.5.0)
munsell 0.5.0 2018-06-12 CRAN (R 3.5.0)
nlme 3.1-137 2018-04-07 CRAN (R 3.5.0)
pillar 1.3.0 2018-07-14 CRAN (R 3.5.0)
plyr 1.8.4 2016-06-08 CRAN (R 3.5.0)
R6 2.2.2 2017-06-17 CRAN (R 3.5.0)
RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.5.0)
Rcpp 0.12.18 2018-07-23 CRAN (R 3.5.0)
RcppEigen 0.3.3.4.0 2018-02-07 CRAN (R 3.5.0)
reshape2 1.4.3 2017-12-11 CRAN (R 3.5.0)
rlang 0.2.1 2018-05-30 CRAN (R 3.5.0)
rstan 2.17.3 2018-01-20 CRAN (R 3.5.0)
scales 0.5.0 2017-08-24 CRAN (R 3.5.0)
StanHeaders 2.17.2 2018-01-20 CRAN (R 3.5.0)
stats * 3.5.0 2018-05-17 local
stats4 3.5.0 2018-05-17 local
stringi 1.2.4 2018-07-20 CRAN (R 3.5.0)
stringr 1.3.1 2018-05-10 CRAN (R 3.5.0)
tibble 1.4.2 2018-01-22 CRAN (R 3.5.0)
tools 3.5.0 2018-05-17 local
utf8 1.1.4 2018-05-24 CRAN (R 3.5.0)
utils * 3.5.0 2018-05-17 local
viridisLite 0.3.0 2018-02-01 CRAN (R 3.5.0)
withr 2.1.2 2018-03-15 CRAN (R 3.5.0)

When I load the Rstan package on compute node on a cluster I get the following error message, but when I load this packages on the management node I don’t get any error.

Traceback:
1: stanc(file = file, model_code = model_code, model_name = model_name, verbose = verbose, obfuscate_model_name = obfuscate_model_name, allow_undefined = allow_undefined)
2: stan_model(file, model_name = model_name, model_code = model_code, stanc_ret = NULL, boost_lib = boost_lib, eigen_lib = eigen_lib, save_dso = save_dso, verbose = verbose)
3: stan(model_code = model1, data = data_irt, init = initf, control = list(adapt_delta = 0.99, max_treedepth = 15), iter = 2500, warmup = 1000, chains = 4)

Coul you please help me with this issue?


Compilation error on Linux server
#2

The error message does not seem very informative to me. It would also be useful if you say what scheduler you are using and if you could paste your jobscript.
Anyhow you could check:

  • Is the (correct) compiler available on the node?
  • Does the model have enough RAM? (Compilation can use more RAM than the sampling)
  • Did you get another error message from the scheduler?

One way to go into compile your model on a management node with autowrite=T, and to only load the resulting RDS file when doing the sampling on a node.

Maybe you are already doing this, but if you are on a slurm cluster, you can use qlogin to start an interactive job on a node, where debugging is easier (that’s surely also possible with other schedulers, but I know only slurm).


#3

Dear Guido Biele,

Thank you for your quick reply.
We generally use SLURM as a scheduler. We have several server nodes with at least 64 GB of memory each.

Here’s my job script:
#!/bin/bash
#SBATHC -J MyProgram
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH -o out.%j
module load R
Rscript RSTAN.R

However, the problem is that when I ssh to a compute node I cannot even load the library, I get the following error:

R version 3.4.2 (2017-09-28) – “Short Summer”
Copyright © 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type ‘license()’ or ‘licence()’ for distribution details.

R is a collaborative project with many contributors.
Type ‘contributors()’ for more information and
‘citation()’ on how to cite R or R packages in publications.

Type ‘demo()’ for some demos, ‘help()’ for on-line help, or
‘help.start()’ for an HTML browser interface to help.
Type ‘q()’ to quit R.

[Previously saved workspace restored]

During startup - Warning messages:
1: Setting LC_CTYPE failed, using “C”
2: Setting LC_COLLATE failed, using “C”
3: Setting LC_TIME failed, using “C”
4: Setting LC_MESSAGES failed, using “C”
5: Setting LC_MONETARY failed, using “C”
6: Setting LC_PAPER failed, using “C”
7: Setting LC_MEASUREMENT failed, using “C”

library(‘rstan’)
Loading required package: ggplot2

*** caught illegal operation ***
address 0x7fad0e397efa, cause ‘illegal operand’

Traceback:
1: dyn.load(file, DLLpath = DLLpath, …)
2: library.dynam(lib, package, package.lib)
3: loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]])
4: asNamespace(ns)
5: namespaceImportFrom(ns, loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]), i[[2L]], from = package)
6: loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]])
7: namespaceImport(ns, loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]), from = package)
8: loadNamespace(package, lib.loc)
9: doTryCatch(return(expr), name, parentenv, handler)
10: tryCatchOne(expr, names, parentenv, handlers[[1L]])
11: tryCatchList(expr, classes, parentenv, handlers)
12: tryCatch({ attr(package, “LibPath”) <- which.lib.loc ns <- loadNamespace(package, lib.loc) env <- attachNamespace(ns, pos = pos, deps)}, error = function(e) { P <- if (!is.null(cc <- conditionCall(e))) paste(" in", deparse(cc)[1L]) else “” msg <- gettextf(“package or namespace load failed for %s%s:\n %s”, sQuote(package), P, conditionMessage(e)) if (logical.return) message(paste(“Error:”, msg), domain = NA) else stop(msg, call. = FALSE, domain = NA)})
13: library(pkg, character.only = TRUE, logical.return = TRUE, lib.loc = lib.loc, quietly = quietly)
14: .getRequiredPackages2(pkgInfo, quietly = quietly)
15: library(“rstan”)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection:


#4

I am not sure how to interpret the warning message you are getting. a guess is that some environment variables are not found. are your sure that you are in the same shell when you are on the management node and when you logged in to a computing node and that .bash_login and similar files were read?

Anyhow, I would send the error message you get to the helpdesk for your cluster.

Quickly about working memory on nodes: The fact that you have 64GB RAM for a node does not mean that you have as much RAM on each worker (CPU, core), because each node typically as more than one CPU (or core). As far as I know, each worker (CPU) gets by default less RAM than there are on the entire node, so it makes sense to specify the amount of RAM you want to have for the workers.


#5

Judging from what I’ve Googled, one trigger for the “illegal operand” error is R trying to load the wrong BLAS library, such as one that is compiled for the wrong CPU.


#6

Rstan might also be trying to write somewhere it doesn’t have permission to write. Have you tried setting the working directory to your home directory in the slurm job prior to calling R?


#7

This post might have the answer: Compilation error on Linux server

The catch is that I don’t see “-mtune=native -march=native” in your CXXFLAGS, but an “illegal operand” error does look like the kind of error you’d get from trying to run or load a binary compiled for one CPU on a slightly different CPU.


#8

It actually looks like rstan was installed with incorrect cflags. Could you try reinstalling it on a cluster node with the current cflags?


#9

Hi sakrejda,
I have the same problem. I made the changes you suggested ans set the working directory to the home directory in the slurm job but it does not help. I got the same error message.


#10

Hi jjramsey,

I removed that statement from my Makevars file but I’m still getting the same illegal operand error


#11

If you change flags you might need to recompile to see if it fixes things (so uninstall and reinstall rstan)


#12

Ok, after uinstalling and reinstalling rstan, the error message has changed to the following:

During startup - Warning messages:

1: Setting LC_CTYPE failed, using “C”
2: Setting LC_COLLATE failed, using “C”
3: Setting LC_TIME failed, using “C”
4: Setting LC_MESSAGES failed, using “C”
5: Setting LC_MONETARY failed, using “C”
6: Setting LC_PAPER failed, using “C”
7: Setting LC_MEASUREMENT failed, using “C”

Loading required package: ggplot2
Loading required package: StanHeaders
rstan (Version 2.17.3, GitRev: 2e1f913d3ca3)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).

To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)

Error in system2(file.path(R.home(component = “bin”), “R”), args = paste(“CMD config”, :
error in running command
Calls: stan_model -> get_CXX -> system2
Execution halted

Any idea what does this error mean?

Thank you


#13

Did you load the gcc module on the cluster before doing this?


#14

Sorry, I do not get you. What do you mean by loading the gcc module?

I verified that the toolchain is working by this code:

Verify that your toolchain works by executing the code below in R and checking that it returns the value 10:

fx <- inline::cxxfunction( signature(x = “integer”, y = “numeric” ) , ’
return ScalarReal( INTEGER(x)[0] * REAL(y)[0] ) ;
’ )

fx( 2L, 5 ) # should be 10

I got the correct number “10”


#15

On the cluster I use you always have access to gcc but some of the related configuration for R is not loaded until you do ‘module load gcc/’


#16

The cluster did not recognize ( module load gcc/ ). I got the following:

Lmod has detected the following error: The following module(s) are unknown: “gcc”

Please check the spelling or version number. Also try “module spider …”
It is also possible your cache file is out-of-date; it may help to try:
$ module --ignore-cache load “gcc”

Also make sure that all modulefiles written in TCL start with the string #%Module


#17

Right, this will vary by cluster so it’s a good question to ask the cluster admins. It looks like rstan is having trouble getting the compiler or linker to run.


#18

Ok. I will ask them. Thank you so much for your help.


#19

To see what modules are available on the cluster, try running “module avail”, without the quotes.


#20

After running “module avail”, there is no module called “gcc” on the cluster.

But I found a solution to get around the compilation error. First, I compiled the stan model, without data, on the “Management Server” and saved the compiled model with the extension .RData as follow:

sm <- stan_model(file = ‘a.stan’, save_dso = TRUE)
save(‘sm’, file = ‘sm.RData’)

Then, I submitted my R file to the SLURM and it works :)

The R file includes the following:

load(“sm.RData”)

… simulated data…

fit <- sampling(sm, data=list(K, N, J, y, dir_alpha ), pars=c(“pi”, “mu”, “theta”, “beta”, “alpha”, “prob”), warmup = 2000, iter = 5000, chains = 3)

Thanks a lot for everyone helped to solve this problem. Thanks for the Stan forums.