Error in unserialize(socklist[[n]]) with large-ish input data


#1
  • Operating System : RHEL 7
  • RStan Version : 2.18.1
  • Output of writeLines(readLines(file.path(Sys.getenv("HOME"), ".R/Makevars"))):
makevars
CXX14FLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -fPIC -std=c++14
CXX14=g++

CXX14FLAGS+=-flto -Wno-unused-local-typedefs -Wno-ignored-attributes -Wno-deprecated-declarations
  • Output of devtools::session_info("rstan")
sessionInfo

Session info -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
setting value
version R version 3.4.2 (2017-09-28)
system x86_64, linux-gnu
ui RStudio (99.9.9)
language (EN)
collate en_US.UTF-8
tz America/New_York
date 2018-10-27

Packages ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
package * version date source
assertthat 0.2.0 2017-04-11 CRAN (R 3.4.2)
backports 1.1.0 2017-05-22 CRAN (R 3.4.2)
base64enc 0.1-3 2015-07-28 CRAN (R 3.4.2)
BH 1.66.0-1 2018-02-13 CRAN (R 3.4.2)
callr 3.0.0 2018-08-24 CRAN (R 3.4.2)
cli 1.0.1 2018-09-25 CRAN (R 3.4.2)
colorspace 1.3-2 2016-12-14 CRAN (R 3.4.2)
compiler 3.4.2 2018-03-09 local
crayon 1.3.4 2017-09-16 CRAN (R 3.4.2)
desc 1.2.0 2018-05-01 CRAN (R 3.4.2)
digest 0.6.12 2017-01-27 CRAN (R 3.4.2)
fansi 0.3.0 2018-08-13 CRAN (R 3.4.2)
ggplot2 * 3.0.0 2018-07-03 CRAN (R 3.4.2)
graphics * 3.4.2 2018-03-09 local
grDevices * 3.4.2 2018-03-09 local
grid 3.4.2 2018-03-09 local
gridExtra 2.2.1 2016-02-29 CRAN (R 3.4.2)
gtable 0.2.0 2016-02-26 CRAN (R 3.4.2)
inline 0.3.14 2015-04-13 CRAN (R 3.4.2)
labeling 0.3 2014-08-23 CRAN (R 3.4.2)
lattice 0.20-35 2017-03-25 CRAN (R 3.4.2)
lazyeval 0.2.0 2016-06-12 CRAN (R 3.4.2)
loo 2.0.0 2018-04-11 CRAN (R 3.4.2)
magrittr 1.5 2014-11-22 CRAN (R 3.4.2)
MASS 7.3-47 2017-04-21 CRAN (R 3.4.2)
Matrix 1.2-14 2018-04-09 CRAN (R 3.4.2)
matrixStats 0.54.0 2018-07-23 CRAN (R 3.4.2)
methods * 3.4.2 2018-03-09 local
mgcv 1.8-24 2018-06-18 CRAN (R 3.4.2)
munsell 0.5.0 2018-06-12 CRAN (R 3.4.2)
nlme 3.1-131 2017-02-06 CRAN (R 3.4.2)
parallel * 3.4.2 2018-03-09 local
pillar 1.3.0 2018-07-14 CRAN (R 3.4.2)
pkgbuild 1.0.2 2018-10-16 CRAN (R 3.4.2)
plyr * 1.8.4 2016-06-08 CRAN (R 3.4.2)
prettyunits 1.0.2 2015-07-13 CRAN (R 3.4.2)
processx 3.2.0 2018-08-16 CRAN (R 3.4.2)
ps 1.1.0 2018-08-10 CRAN (R 3.4.2)
R6 2.2.2 2017-06-17 CRAN (R 3.4.2)
RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.4.2)
Rcpp 0.12.19 2018-10-01 CRAN (R 3.4.2)
RcppEigen 0.3.3.4.0 2018-02-07 CRAN (R 3.4.2)
reshape2 1.4.3 2017-12-11 CRAN (R 3.4.2)
rlang 0.2.2 2018-08-16 CRAN (R 3.4.2)
rprojroot 1.2 2017-01-16 CRAN (R 3.4.2)
rstan * 2.18.1 2018-10-16 CRAN (R 3.4.2)
scales 1.0.0 2018-08-09 CRAN (R 3.4.2)
StanHeaders * 2.18.0 2018-10-07 CRAN (R 3.4.2)
stats * 3.4.2 2018-03-09 local
stats4 3.4.2 2018-03-09 local
stringi 1.1.6 2018-03-16 local
stringr 1.2.0 2017-02-18 CRAN (R 3.4.2)
tibble 1.4.2 2018-01-22 CRAN (R 3.4.2)
tools 3.4.2 2018-03-09 local
utf8 1.1.4 2018-05-24 CRAN (R 3.4.2)
utils * 3.4.2 2018-03-09 local
viridisLite 0.2.0 2017-03-24 CRAN (R 3.4.2)
withr 2.1.2 2018-03-15 CRAN (R 3.4.2)

From rooting around on this forum, my impression has been that this error occurs when the compiled model segfaults while allocating memory either for the input data or the posterior samples. In my particular case, this occurs before the first sample (after SAMPLING). I can watch it start copying the input data to the chains via watch -n 1 -d free, but it definitely doesn’t finish multiplying the object size by 4 before it trashes the process with the title’s error.

I have 512 GB of RAM and am running 4 chains with a , highlighting for comparison:
image
75MB object in R and 512 GB of RAM.

I save ~550,000 reals (35,000 parameters, 515,000 generated quantities) per iteration, which is a lot of values but isn’t much smaller in my models that use slightly less input data. I suspect the problem is related to my input data, but maybe not the size of it. I did some tests (below) to largely rule out the size of the posterior samples and hit the threshold at which the model starts having trouble.

Tests:

  • 75MB input, 5 warmup, 10 iter, 4 chains : Title’s error
  • 75MB input, 5 warmup, 10 iter, 4 chains, save_warmup=F : Title’s error
  • 75MB input, 5 warmup, 10 iter, 1 chain : Crashes RStudio Server with an anonymous error in rstudio/src/cpp/session/SessionMain.cpp:1859
  • 66MB input, 250 warmup, 1250 iter, 4 chains : No error
  • 70MB input, 250 warmup, 1250 iter, 4 chains : Title’s error

Analyzing the 70MB input, there are no NAs, NANs, or Infs.

I am down to try and compare the data that throws the error vs. doesn’t, but does anyone have an idea what I should be looking for? Diving further into 70MB of data without a plan seems unlikely to yield positive results.


#2

It isn’t exactly the size of the data but the size of the autodiff tree that may cause problems. Also, there is a long-standing Rcpp issue with giant lists needed to hold that many values that only @Krzysztof_Sakrejda has any understanding of.

Does it generally work if you exclude the generated quantities by specifying include = FALSE and pars = whatever the symbol names at the top level of generated quantities are? Does it work if you do pars = "" but specify the sample_file argument to write the values to the disk?


#3

With include = FALSE, pars = c('O','P','D'), which are the generated quantities, the Title error still occurs.

With pars = "", sample_file = "~/samples.csv", I get:

log

starting worker pid=37735 on localhost:11084 at 19:49:07.622
starting worker pid=37747 on localhost:11084 at 19:49:08.287
starting worker pid=37758 on localhost:11084 at 19:49:08.920
starting worker pid=37768 on localhost:11084 at 19:49:09.546
no parameter ; sampling not done
no parameter ; sampling not done
no parameter ; sampling not done
no parameter ; sampling not done
here are whatever error messages were returned
[[1]]
Stan model ‘tb1’ does not contain samples.

[[2]]
Stan model ‘tb1’ does not contain samples.

[[3]]
Stan model ‘tb1’ does not contain samples.

[[4]]
Stan model ‘tb1’ does not contain samples.

It asks me to rerun with 1 chain for debugging purposes, which generates one instance of no parameter ; sampling not done

Thanks for the quick reply.


#4

I’m sorry. I meant to say pars = NA, include = FALSE, sample_file = "~/samples.csv" in order to exclude everything (except lp__). If that works, then you would have to do read_stan_csv() on the samples#.csv files in your home directory in order to access O, P, and D.

Also, with big models, it is almost always wise to specify save_warmup = FALSE in the call to stan() or sampling().


#5

That does work for me! Thanks

Let me know if there is anything else you would like me to test.


#6

The model finished fitting and did not generate a samples.csv file. The stanfit does have __lp, as expected. Any idea what’s up?

I am trying include=TRUE, pars=c('O','P','D') now and it seems to have gotten past the normal unserialize error time, but hasn’t started sampling yet.


#7

Does anyone have an idea about this? I am stumped and @Krzysztof_Sakrejda last logged on in March.


#8

Not sure, but perhaps see: Error in unserialize, rstan 2.18, problems with native mtune and march


#9

Thanks for the suggestion and I do agree the symptoms are similar.

Unfortunately, the suggested fix (using -march=core2) does not seem to fix my issue.

CPU info

Intel® Xeon® CPU E7-8890 v4 @ 2.20GHz
192 processors
24 cores/processor

setdiff(native,basic)
"#define __tune_haswell__ 1"                    "#define __SSE4_1__ 1"                          "#define __core_avx2__ 1"                      
"#define __POPCNT__ 1"                          "#define __ABM__ 1"                             "#define __F16C__ 1"                           
"#define __XSAVEOPT__ 1"                        "#define __BIGGEST_ALIGNMENT__ 32"              "#define __PRFCHW__ 1"                         
"#define __SSE4_2__ 1"                          "#define __AVX__ 1"                             "#define __LZCNT__ 1"                          
"#define __RTM__ 1"                             "#define __FP_FAST_FMAF 1"                      "#define __PCLMUL__ 1"                         
"#define __XSAVE__ 1"                           "#define __FMA__ 1"                             "#define __AVX2__ 1"                           
"#define __haswell 1"                           "#define __tune_core_avx2__ 1"                  "#define __SSSE3__ 1"                          
"#define __RDRND__ 1"                           "#define __core_avx2 1"                         "#define __FP_FAST_FMA 1"                      
"#define __FSGSBASE__ 1"                        "#define __RDSEED__ 1"                          "#define __BMI2__ 1"                           
"#define __AES__ 1"                             "#define __haswell__ 1"                         "#define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_16 1"
"#define __ADX__ 1"                             "#define __BMI__ 1"                             "#define __SSE3__ 1"       

#10

I set up CmdStan and it compiled / appears to be running with a case that fails in rstan, which again implies that this problem is related to the Rcpp issue that @bgoodri mentioned earlier.

I will just make a CmdStan shim that I use instead of rstan::stan().