Note: I posted an earlier version of this with a much more complex model, but I realised this issue is completely independent of the model, so here’s a much simpler reproducible example (I deleted the previous complicated version).
The problem
cmdstanr crashes the R-session after successfully sampling from a model with many parameters. I think it’s something to do with how cmdstanr summarises or assesses the sampling. A clean R-session will also crash when trying to read the stored csv files using cmdstanr::read_cmdstan_csv()
or cmdstanr::as_cmdstan_fit()
. However, the same stored csv files can be read successfully with rstan rstan::read_stan_csv()
, and so I’m confident that the model-fitting was successful.
This has come up in a project working with a large database of bird observations from the last 56 years: The North American Breeding Bird Survey. The models from that project work fine for bird species with ~50-60K observations, but this R-crash occurs for the more data-rich species with ~100K observations (which result in ~250K parameters). I’d like to be able to apply my model to all of the species in the database, and to stick with cmdstanr for my entire workflow, and of course I’d also like it if the R-session didn’t crash after fitting a model.
Reproducible Example
Here’s a simple reproducible example, that suggests there’s something about the number of parameters that causes the crash.
Simple linear regression model, with 250K data.
library(cmdstanr)
N = 250000
x = rnorm(N)
y = x+rnorm(N,0,0.3)
stan_data <- list(N = N,
y = y,
x = x)
mod <- "models/simple_regression.stan"
model <- cmdstan_model(mod)
The model
data {
int<lower=1> N;
vector[N] x;
vector[N] y;
}
parameters {
real a;
real b;
real<lower=0> sigma;
}
model {
sigma ~ student_t(3,0,1);
b ~ std_normal();
a ~ std_normal();
y ~ normal(a+b*x,sigma);
}
generated quantities {
vector[N] log_lik;
for(i in 1:N){
log_lik[i] = normal_lpdf(y[i] | a+b*x[i], sigma);
}
}
Crashes after fitting
This call to model$sample
crashes the R-session after sampling is complete. The csv output files are stored. It takes ~10 minutes to sample, write the files, then with no errors or warnings, the R-session crashes. The crash happens in a stand-along R-session and/or RStudio.
stanfit <- model$sample(
data=stan_data,
refresh=200,
chains=4,
iter_sampling=1000,
iter_warmup=1000,
parallel_chains = 4,
output_dir = "output",
output_basename = "simple_regression_fit")
The csv files can be read with rstan
This rstan::read_stan_csv
call works, although it takes a long time to read in the files.
csv_files <- paste0("output/simple_regression_fit-",1:4,".csv")
stanfit <- rstan::read_stan_csv(csv_files, col_major = TRUE) ## successful reading of csv files with rstan
But trying to read or load the files with cmdstanr causes R-crash
Trying to read in the csv files with cmdstanr cause the R-session to crash. The crash happens quickly (a few seconds), there is no indication from the operating system of a memory issue or any other issue, and no other indication of an error. The session crashes both within a stand-alone R-session, and in RStudio.
### this as_cmdstan_fit call crashes the R-session
stanfit <- as_cmdstan_fit(files = csv_files)
### similarly, this read_cmdstan_csv call crashes the R-session
stanfit <- read_cmdstan_csv(
files = paste0(output_dir,"/",csv_files),
variables = "",
sampler_diagnostics = NULL,
format = "draws_list") # following note about efficiency in ?cmdstanr::draws
Session info
Running on a Windows computer with 16 cores and 128GB of RAM (so it’s not a question of memory, I don’t think)
utils::sessionInfo()
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rstan_2.21.5 ggplot2_3.3.6 StanHeaders_2.21.0-7 cmdstanr_0.5.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.8.3 pillar_1.7.0 compiler_4.2.0 prettyunits_1.1.1 tools_4.2.0 pkgbuild_1.3.1 jsonlite_1.8.0 lifecycle_1.0.1
[9] tibble_3.1.7 gtable_0.3.0 checkmate_2.1.0 pkgconfig_2.0.3 rlang_1.0.2 cli_3.3.0 DBI_1.1.3 parallel_4.2.0
[17] xfun_0.31 loo_2.5.1 gridExtra_2.3 withr_2.5.0 dplyr_1.0.9 knitr_1.39 generics_0.1.2 vctrs_0.4.1
[25] stats4_4.2.0 grid_4.2.0 tidyselect_1.1.2 inline_0.3.19 glue_1.6.2 R6_2.5.1 processx_3.6.1 fansi_1.0.3
[33] distributional_0.3.0 tensorA_0.36.2 callr_3.7.0 farver_2.1.0 purrr_0.3.4 posterior_1.2.2 magrittr_2.0.3 codetools_0.2-18
[41] matrixStats_0.62.0 ps_1.7.1 backports_1.4.1 scales_1.2.0 ellipsis_0.3.2 abind_1.4-5 assertthat_0.2.1 colorspace_2.0-3
[49] utf8_1.2.2 RcppParallel_5.1.5 munsell_0.5.0 crayon_1.5.1