STAN on multiple cores occasionally crashing Linux without overwhelming memory

mathesong · September 2, 2020, 8:41am

I’m trying to fit a nonlinear model on not particularly many observations, and it occasionally crashes my computer forcing me to reset. Running on 1 core seems to always work, 2 cores sometimes crashes, and 3 cores crashes fairly regularly: it’s stochastic, which is most annoying of all. I would have expected I was overwhelming RAM, but I’m running a pretty powerful machine: 16 virtual cores and 64GB RAM, so I don’t think it’s that. Any help would be very much appreciated. I’m fairly new to working more on Linux, so I might be doing something stupid somewhere.

I’m running Pop!_OS 20.04 LTS, and using brms_2.13.9 through RStudio, though I encounter the issue using either rstan_2.19.2 or cmdstanr_0.1.3 (cmdstan-2.24.1) backends. The model I’m fitting isn’t really that large: it’s a nonlinear model fitting 1000 observations from 50 groups, with 5 parameters per group, and their associated SD etc --> comes to 822 parameters total.

I opened htop to watch RAM usage while fitting yesterday, and had the computer crash while it was open. Had to screenshot with my phone as the computer had hung.

… so, not even 5% memory usage, and only 3 cores active.

I’m a bit stumped. If anyone has any suggestions, I’d be thrilled.

Thanks so much in advance!

rok_cesnovar · September 2, 2020, 9:05am

Can you paste any snippet on how you call your model? That might help us debug this.

mathesong · September 2, 2020, 9:18am

I can definitely do that! Maybe I should have done so sooner. Sorry about that.

logtwotcm_prior <- c(
  set_prior("normal(-3, 0.2)", nlpar = "logk1"),
  set_prior("normal(-1.5, 0.2)", nlpar = "logvnd"),
  set_prior("normal(1, 0.2)", nlpar = "logbpnd"),
  set_prior("normal(-4, 0.2)", nlpar = "logk4"),
  set_prior("normal(-2, 1)", nlpar = "logvb"),
  set_prior("normal(0, 0.2)", nlpar = "logk1", class = "sd"),
  set_prior("normal(0, 0.2)", nlpar = "logvnd", class = "sd"),
  set_prior("normal(0, 0.2)", nlpar = "logbpnd", class = "sd"),
  set_prior("normal(0, 0.2)", nlpar = "logk4", class = "sd"),
  set_prior("normal(0, 0.2)", nlpar = "logvb", class = "sd"),
  set_prior("normal(0, 0.005)", class = "sigma"))


logtwotcm_fit_formula <- bf( meas_tac ~ twotcm_log_stan(logk1, logvnd, logbpnd,
                                                       logk4, logvb, MidTime, 
                                    b_pfc, 
                                    lambda1_pfc, lambda2_pfc, lambda3_pfc, 
                                    A1_pfc, A2_pfc, A3_pfc, tstar_pfc, 
                                    b_tot, 
                                    lambda1_tot, lambda2_tot, lambda3_tot, 
                                    A1_tot, A2_tot, A3_tot, 
                                    tstar_tot, indicator),
     # Nonlinear variables
     logk1 + logvnd + logbpnd + logk4 + logvb ~ 1 + (1|m|ID),
     # Nonlinear fit
     nl = TRUE)

  logtwotcm_fit <- brm(
    logtwotcm_fit_formula,
    family=gaussian(), 
    data = modeldat,
    prior = logtwotcm_prior,
    stanvars = stanvar(scode = two_compartment_log_stan, 
                     block="functions"),
    chains = 3,
    cores = 2,
    backend = "cmdstanr")

I’m not sure if I can share the model function definition itself just yet, but it’s pretty straightforward. It defines the real variables, exponentiates the log variables, and then runs a very long, analytical solution to a pharmacokinetic model. It’s just one “line”, over about 20-30 lines on the screen.

As said, the crashes are stochastic. It works some of the time, and fails other times. And with more cores, it fails more regularly.

Thanks in advance for any help. I’m very happy to run any kinds of checks which might be useful - I just don’t know what these could be.

rok_cesnovar · September 2, 2020, 9:32am

And if you run with

chains = 3,
cores = 1

it runs fine?

mathesong · September 2, 2020, 9:38am

Yup. I’ve yet to have a crash with cores=1. Occasionally crashes with cores=2; and with cores=3it’s quite frequently (probably >50% of the time - though I actually can’t remember if it worked at least once).

This is just a prototype implementation for now in small simulated datasets. The plan is to upscale this model to bigger datasets, with a more complicated hierarchical structure, so then I would worry about cores=1 failing too. Otherwise, I might otherwise just have run these as single chain models and stick them together.

rok_cesnovar · September 2, 2020, 9:53am

Ok thanks. Lets first check whether the issue is at the stan level or brms level.

In order to do that, you have to make the stancode and standata and run cmdstanr separately. You can use make_stancode (see How to convert "standata" to "json"? to see a snippet on how to qucikly transform the data for cmdstanr).

If that still crashes then its something weird going on in the Stan core, otherwise its something that happens after Stan runs.

mathesong · September 2, 2020, 10:42am

SUMMARY: still crashes out.

I tried it using cmdstanr as follows:

# Code
stanc <- make_stancode(logtwotcm_fit_formula,
  family=gaussian(), 
  data = modeldat,
  prior = logtwotcm_prior,
  stan_funs = two_compartment_log_stan)

# Data
stand <- make_standata(logtwotcm_fit_formula,
    family=gaussian(), 
    data = modeldat,
    prior = logtwotcm_prior,
    stanvars = stanvar(scode = two_compartment_log_stan, 
                     block="functions"),
    chains = 4,
    cores = 4,
    backend = "cmdstanr")

# Data list
stand_list <- list()
for (t in names(stand)) {
  stand_list[[t]] <- stand[[t]]
}

# Saving code
stanc_f <- cmdstanr::write_stan_file(stanc, basename = "cmdstanr_test.stan")

# Model
mod <- cmdstanr::cmdstan_model(stanc_f)

# Sample
fit <- mod$sample(
  data = stand_list,
  chains = 4,
  parallel_chains = 4,
  refresh = 500
)

I used 4 cores just to be sure to get it to actually freeze if it was going to, and it did. htop says I had <7.5GB RAM used out of 62.6GB.

So then it’s something in the STAN code probably. But my Linux-fu is not good enough to try to diagnose what’s going on.

One potential thing to try: this is simulated data. I could send the data and full code privately to someone to try it out on a different version of Linux. I could also try to get it working on my Windows machine, but that’ll take some time and fiddling as I wasn’t able to get cmdstanr working on there when I last tried.

(also, thanks so much for all the help thus far!)

ahartikainen · September 2, 2020, 10:50am

What if you inject some couts in sampling lib to see where it crashes?

mathesong · September 2, 2020, 10:56am

I’d be happy to try this, but I don’t really know where to start. If you could give me some pointers, I could try. I don’t have any experience with C++, so I’m pretty clueless on how to go about doing this.

rok_cesnovar · September 2, 2020, 11:00am

Yeah, that or a bug in the Stan backend (math most likely).

Feel free to DM me and I can try this on Linux easily.

mathesong · September 2, 2020, 12:02pm

Awesome - thank you! I’ve emailed everything over. Fingers crossed that you can reproduce the issue.

rok_cesnovar · September 2, 2020, 12:19pm

How quickly did it fail for you? Instantly, in warmup or in sampling?

Also, can you post the outputs of

make --version
g++ --version

mathesong · September 2, 2020, 12:44pm

Re warmup vs sampling, I’m not sure with using cmdstanr directly, but I didn’t see any progress bars after it started sampling after the first sample. When using cmdstanr through brms with 3 chains, the crash sometimes happened early, sometimes later, so I think it didn’t matter. But I’ll try it again a few times later today after a meeting.

Re make and g++ (and clang++) :

GNU Make 4.2.1
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2016 Free Software Foundation, Inc.
Licence GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

g++ (Ubuntu 9.3.0-10ubuntu2) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

clang version 10.0.0-4ubuntu1 
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

And my .R/Makevars is as follows:


#CXX14FLAGS=-O3 -march=native -mtune=native
#CXX14FLAGS += -fPIC

#CXX14 = g++ -fPIC
CXX14 = clang++ -fPIC

CXXFLAGS=-O3 -std=c++1y -mtune=native -march=native -Wno-unused-variable -Wno-unused-function
CXXFLAGS += -DBOOST_MPL_CFG_NO_PREPROCESSED_HEADERS -DBOOST_MPL_LIMIT_LIST_SIZE=30

CXX14FLAGS=-O3 -std=c++1y -mtune=native -march=native -Wno-unused-variable -Wno-unused-function
CXX14FLAGS += -DBOOST_MPL_CFG_NO_PREPROCESSED_HEADERS -DBOOST_MPL_LIMIT_LIST_SIZE=30

Actually, that’s something to try: using g++ instead of clang. I’ll give that a shot too!

mathesong · September 2, 2020, 12:59pm

After changing the Makevars to run g++, I just tried running 6 chains on 6 cores, and my computer froze again after between 50 and 150 samples on all the chains. So it was warmup this time in any case. Though I don’t know if cmdstanr uses the compiler described in the Makevars, or if you change it some other way?

rok_cesnovar · September 2, 2020, 1:22pm

cmdstanr does not use makevars no, and if you didn’t touch anything it probably picked up g++ (which I believe is the default on Linux).

You can switch to a different compiler for cmdstan with

cmdstan_make_local(cpp_options = list("CXX"="clang++"))
rebuild_cmdstan(cores = 4)

This seems to run fine on my Ubuntu machine though… All chains ran just fine, so not sure what to make of all this (finished between 3500 - 4100 seconds).

I am using the exact same compiler, make, and cmdstan version. Argh… Will give this a few more thoughts.

rok_cesnovar · September 2, 2020, 1:31pm

I ran this with the model/data you sent.

library(cmdstanr)

file_path <- file.path("cmdstanr_test.stan")
mod <- cmdstan_model(file_path, compile = TRUE)

fit <- mod$sample(data = "cmdstanr_test_data.json", 
                  chains = 4,
                  parallel_chains = 4,
                  refresh = 100)

Maybe try fixing the seed so we can see if this pops up with a specific seed for you?

mathesong · September 2, 2020, 7:00pm

Well, that’s awesome that you’re running all the same versions of everything, but rather frustrating that we’re getting different results.

I tried different seeds and got different results. With seed=42, my PC hung very early - about 20 seconds after it started sampling.

fit <- mod$sample(data = "cmdstanr_test_data.json", 
                  chains = 4,
                  parallel_chains = 4,
                  refresh = 100, seed=42)

With seed=12345, it ran for 15 minutes or so, and then started spamming the console with

*** recursive gc invocation
*** recursive gc invocation
*** recursive gc invocation

Is that garbage collector invocation?

stevebronder · September 2, 2020, 10:26pm

It looks like someone using prophet hit this in the past. You may need to reinstall some dependencies or try the devtools::clean_dll() that was mentioned in the below

github.com/facebook/prophet

Caught segfault error

opened 06:06PM - 27 Jul 20 UTC

closed 01:08AM - 06 Nov 20 UTC

jroberayalas

I'm running a code that relies on the Prophet library, but when I want to predic…t using a trained model, I get a "caught segfault error". Any help on how to solve this? I've remove/reinstalled the prophet library already without any success. ``` *** caught segfault *** address 0x12cf90ffc, cause 'memory not mapped' *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation *** recursive gc invocation Traceback: 1: pmax(abs(as.integer(width)), if (format == "fg" || format == "f") { xEx <- as.integer(floor(log10(abs(x + (x == 0))))) as.integer(x < 0 | flag != "") + digits + if (format == "f") { 2L + pmax(xEx, 0L) } else { 1L + pmax(xEx, digits, digits + (-xEx) + 1L) + length(nf) }} else rep.int(digits + 8L, n)) 2: formatC(x, format = "fg", width = 1, digits = digits) 3: paste0(if (use.fC) formatC(x, format = "fg", width = 1, digits = digits) else format(x, trim = TRUE, digits = digits, ...), "%") 4: format_perc(probs) 5: quantile.default(newX[, i], ...) 6: FUN(newX[, i], ...) 7: apply(comp, 1, stats::quantile, lower.p, na.rm = TRUE) 8: predict_seasonal_components(object, df) 9: predict.prophet(m, all_data) 10: predict(m, all_data) 11: eval(statements[[idx]], envir = sourceEnv) 12: eval(statements[[idx]], envir = sourceEnv) 13: sourceWithProgress(script = "/Users/josolare/Documents/workspace/src/LRpR/backcaster.R", encoding = "UTF-8", con = stdout(), importRdata = NULL, exportRdata = NULL) An irrecoverable exception occurred. R is aborting now ... ``` The `sessionInfo` is shown below: ``` > sessionInfo() R version 4.0.2 (2020-06-22) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Mojave 10.14.6 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] Rcpp_1.0.5 urca_1.3-0 pillar_1.4.6 compiler_4.0.2 [5] tseries_0.10-47 tools_4.0.2 xts_0.12-0 nlme_3.1-148 [9] lubridate_1.7.9 lifecycle_0.2.0 tibble_3.0.3 gtable_0.3.0 [13] lattice_0.20-41 pkgconfig_2.0.3 rlang_0.4.7 DBI_1.1.0 [17] rstudioapi_0.11 curl_4.3 parallel_4.0.2 dplyr_1.0.0 [21] generics_0.0.2 vctrs_0.3.2 imputeTS_3.0 lmtest_0.9-37 [25] grid_4.0.2 nnet_7.3-14 forecast_8.12 tidyselect_1.1.0 [29] glue_1.4.1 R6_2.4.1 tidyr_1.1.0 TTR_0.23-6 [33] ggplot2_3.3.2 purrr_0.3.4 magrittr_1.5 scales_1.1.1 [37] ellipsis_0.3.1 quantmod_0.4.17 timeDate_3043.102 colorspace_1.4-1 [41] fracdiff_1.5-1 quadprog_1.5-8 stinepack_1.4 munsell_0.5.0 [45] crayon_1.3.4 zoo_1.8-8 ```

mathesong · September 3, 2020, 11:42am

Thanks so much for the suggestion!

I tried devtools::clean_dll(), and it doesn’t do much. Just gives me an error about looking for the root directory with a DESCRIPTION file. The command is for deleting compiled files when writing an R package. In this case, I’m calling cmdstanr from RStudio, and not within a package, so there’s nothing to clean out. Thanks for finding the old prophet issue though!

Since the error is occurring when running cmdstanr directly, I presume that it’s something that cmdstan depends on that’s causing the issues. Regarding dependencies then, I tried changing from g++ to clang++ in cmdstan. I got slightly different results: I ran 6 chains, 6 cores and seed=42, and during warmup, while some chains were steadily progressing, I got back “Chain 1 finished unexpectedly”, then “Chain 3 finished unexpectedly”, and then my computer hung again. I never saw this with g++.

So something is causing some of these chains to fail, but changing the C++ compiler doesn’t necessarily fix it. Though it might make it a little bit more resilient, as now some of the chains fail in a visible way. Are there other dependencies that I might cycle through reinstalling? I guess I could try reinstalling make, but are there others you might recommend?

Another possibility could be that my model is specified badly (I’m having issues with convergence simultaneously, and busy drafting another question). Could it be that improving my model definition is enough to prevent STAN from freezing my machine?

ahartikainen · September 3, 2020, 11:45am

Do you know any python? Maybe try the same model with CmdStanPy?

instructions

Assuming that python --version --> 3.x, if not use python3

python -m pip install cmdstanpy
python -m cmdstanpy.install_cmdstan
# create a myfile.py (follow example in https://github.com/stan-dev/cmdstanpy)
# create myfile.stan
# use rstan function to create datafile myfile.rdata
python myfile.py

If you want to access the csv files later, add this to your python file

fit.save_csvfiles(".") # save to local folder

Topic		Replies	Views
Statistical Rethinking Simple Models Crashing RStan rstan	4	1024	August 13, 2020
Managing memory with OpenCL CmdStan techniques , fitting-issues , performance	20	1450	March 30, 2021
RStudio crashes working with brms brms	35	6732	February 4, 2024
Memory error when running stan General	6	889	April 22, 2021
RAM keep increasing until crash when run many brms/Stan models in parallel based on futures Modeling brms	13	1113	September 19, 2022

STAN on multiple cores occasionally crashing Linux without overwhelming memory

Related topics