Sampling fails after warmup

I have a model that seems to be failing as soon as it stops the warmup phase and begins the sampling phase.

Here’s a demo of some of the output where I was using 100 iterations for warmup. The same thing happens with 250 and 1k warmup iterations. It’ll stop the warmup, try to do one normal sample, and fail silently.

Chain 3 Iteration:   98 / 1100 [  8%]  (Warmup) 
Chain 1 Iteration:  101 / 1100 [  9%]  (Sampling) 
Chain 4 Iteration:   99 / 1100 [  9%]  (Warmup) 
Chain 2 Iteration:   95 / 1100 [  8%]  (Warmup) 
Chain 2 Iteration:   96 / 1100 [  8%]  (Warmup) 
Chain 3 Iteration:   99 / 1100 [  9%]  (Warmup) 
Chain 4 Iteration:  100 / 1100 [  9%]  (Warmup) 
Chain 4 Iteration:  101 / 1100 [  9%]  (Sampling) 
Chain 2 Iteration:   97 / 1100 [  8%]  (Warmup) 
Warning: Chain 1 finished unexpectedly!

Chain 3 Iteration:  100 / 1100 [  9%]  (Warmup) 
Chain 2 Iteration:   98 / 1100 [  8%]  (Warmup) 
Chain 3 Iteration:  101 / 1100 [  9%]  (Sampling) 
Warning: Chain 4 finished unexpectedly!

Chain 2 Iteration:   99 / 1100 [  9%]  (Warmup) 
Chain 2 Iteration:  100 / 1100 [  9%]  (Warmup) 
Warning: Chain 3 finished unexpectedly!

Chain 2 Iteration:  101 / 1100 [  9%]  (Sampling) 
Warning: Chain 2 finished unexpectedly!

Warning: Use read_cmdstan_csv() to read the results of the failed chains.
Warning messages:
1: All chains finished unexpectedly! Use the $output(chain_id) method for more information.
2: No chains finished successfully. Unable to retrieve the fit. 

Anyone seen anything like this before? I’m on cmdstanr with cmdstan version 2.31.0.

Normally I would accompany this with a MWE, but it’s a relatively large model for ongoing non-shareable research. I’m hoping people might have a sense of places I could start looking.

Hopefully somebody knows what’s up, but one suggestion to run with save_warmup = TRUE, and then inspect the output csvs for anything super weird, and also for checking, for example, whether the failure happens before or after writing out the inverse metric. Edit: this can also tell you if for some reason your cmdstan is unable to write any iterations to csv, or whether the problem is somehow specific to the transition from warmup to sampling.

1 Like

Okay, good suggestion. This is interesting – it now fails immediately. Inspecting the output CSV reveals that the inv_metric never makes it into the output, warmup_draws is all NAs.

Can you fit other models just fine?

Yeah, my machine works for a ton of other models. Suggest there’s something weird about this one. Not sure what I should do about this though – a MWE will be real hard to come by here. Wish I had better diagnostic tools here, or something.

Print debugging seems to suggest that the failure happens after the model block runs. I sprinkled a bunch of print statements throughout the model block that print out the log joint to see if I could track down where it failed. I also set the max_treedepth=1 here to cut down on the amount of messages (the outcome is the same regardless of the treedepth I use).

Here’s the log:

Chain 1 Iteration:    1 / 1100 [  0%]  (Warmup) 
Chain 1 ======beginning evaluation====== 
Chain 1 target priors: -104.444 
Chain 1 target characteristcs: -128.444 
Chain 1 target characteristic likelihood: -3362.07 
Chain 1 target poisson likelihood: -3374.07 
Chain 1 target arrival likelihood: -1.57851e+07 
Chain 1 target at model block end: -1.58903e+07 
Chain 1 ======beginning evaluation====== 
Chain 1 target priors: -104.681 
Chain 1 target characteristcs: -128.681 
Chain 1 target characteristic likelihood: -3360.62 
Chain 1 target poisson likelihood: -3372.43 
Chain 1 target arrival likelihood: -6.47385e+06 
Chain 1 target at model block end: -6.57905e+06 
Warning: Chain 1 finished unexpectedly!

Warning messages:
1: In model$sample(data = file.path(input_location, "data.json"), refresh = 1,  :
  'num_chains' is deprecated. Please use 'chains' instead.
2: No chains finished successfully. Unable to retrieve the fit. 

Curiously it initializes just fine – just seems to fail when it needs to offload something to disk.

Do you have a generated quantities block? It’s not executed during the warmup, so failing there could explain this.

No, I don’t have any generated quantities yet.

More info! After enabling some flags by placing the following in my make/local file:

CXXFLAGS+= -fsanitize=undefined
CXXFLAGS+= -fsanitize=address

I got these flags from here.

The compiler output suggests a memory issue. I have the following output:

==2792681==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x7fa3cec11810 in thread T0
    #0 0x7fa3d04be672 in __interceptor_free /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:52
    #1 0x5579cccc4623 in stan::services::util::create_unit_e_diag_inv_metric(unsigned long) (/home/cameron/research/option-value-of-news/generative-model/stan/combined-model+0x158e623)
    #2 0x5579cccd54b4 in int stan::services::sample::hmc_nuts_diag_e_adapt<stan::model::model_base, std::shared_ptr<stan::io::var_context>, stan::callbacks::writer, stan::callbacks::unique_stream_writer<std::ostream>, stan::callbacks::unique_stream_writer<std::ostream> >(stan::model::model_base&, unsigned long, std::vector<std::shared_ptr<stan::io::var_context>, std::allocator<std::shared_ptr<stan::io::var_context> > > const&, unsigned int, unsigned int, double, int, int, int, bool, int, double, double, int, double, double, double, double, unsigned int, unsigned int, unsigned int, stan::callbacks::interrupt&, stan::callbacks::logger&, std::vector<stan::callbacks::writer, std::allocator<stan::callbacks::writer> >&, std::vector<stan::callbacks::unique_stream_writer<std::ostream>, std::allocator<stan::callbacks::unique_stream_writer<std::ostream> > >&, std::vector<stan::callbacks::unique_stream_writer<std::ostream>, std::allocator<stan::callbacks::unique_stream_writer<std::ostream> > >&) (/home/cameron/research/option-value-of-news/generative-model/stan/combined-model+0x159f4b4)
    #3 0x5579ccc75e6e in cmdstan::command(int, char const**) (/home/cameron/research/option-value-of-news/generative-model/stan/combined-model+0x153fe6e)
    #4 0x5579cc2ef245 in main (/home/cameron/research/option-value-of-news/generative-model/stan/combined-model+0xbb9245)
    #5 0x7fa3cf63c28f  (/usr/lib/
    #6 0x7fa3cf63c349 in __libc_start_main (/usr/lib/
    #7 0x5579cc2efa14 in _start ../sysdeps/x86_64/start.S:115

0x7fa3cec11810 is located 16 bytes inside of 170944-byte region [0x7fa3cec11800,0x7fa3cec3b3c0)
allocated by thread T0 here:
    #0 0x7fa3d04bfa89 in __interceptor_malloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:69
    #1 0x5579cc3fd013 in Eigen::internal::aligned_malloc(unsigned long) (/home/cameron/research/option-value-of-news/generative-model/stan/combined-model+0xcc7013)

SUMMARY: AddressSanitizer: bad-free /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:52 in __interceptor_free

My interpretation of this is that I may have a variable which is undefined when the metric is created? Would that be a reasonable interpretation?

@stevebronder @WardBrian

How many parameters are in this model?

Boy is this weird as hell. I started just commenting out vast swathes of the model to see if I could get the alloc error to go away, and low and behold, I have the most amazing MWE.

Model code, saved to mwe.stan:

  int D;

parameters {
  vector<lower=0>[D] epsilon_var;

model {
  epsilon_var ~ inv_gamma(2, 3);

R code:

model = cmdstanr::cmdstan_model(

chain = model$sample(data=list(D=12))

Session info:

> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS:   /usr/lib/
LAPACK: /usr/lib/

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] knitr_1.41           magrittr_2.0.3       tidyselect_1.2.0    
 [4] munsell_0.5.0        colorspace_2.0-3     R6_2.5.1            
 [7] rlang_1.0.6          fansi_1.0.3          dplyr_1.0.10        
[10] tools_4.2.2          grid_4.2.2           checkmate_2.1.0     
[13] gtable_0.3.1         xfun_0.36            utf8_1.2.2          
[16] cli_3.6.0            DBI_1.1.3            withr_2.5.0         
[19] cmdstanr_0.5.3       posterior_1.3.1      assertthat_0.2.1    
[22] abind_1.4-5          tibble_3.1.8         lifecycle_1.0.3     
[25] processx_3.8.0       tensorA_0.36.2       farver_2.1.1        
[28] ggplot2_3.4.0        ps_1.7.2             vctrs_0.5.1         
[31] glue_1.6.2           compiler_4.2.2       pillar_1.8.1        
[34] generics_0.1.3       scales_1.2.1         backports_1.4.1     
[37] distributional_0.3.1 jsonlite_1.8.4       pkgconfig_2.0.3   

I’m running on cmdstan version 2.31.0.

Huh, that example runs to completion for me (Ubuntu 22.04/gcc 11.3.0). It seems like it might then be something version/compiler specific, sometimes these memory errors/UB issues only show up with newer versions. If you’re using Manjaro I assume you have gcc 12?

Edit: I just tried with gcc 12.2.0 and it still sampled fine, so there must be something else going on here.

Yep, I’m currently on gcc 12.2.0. Super weird!

Perhaps I should roll back to an earlier cmdstan release, let me give it a shot.

Does it still fail if you initialize far away from 0? What if you put lower=0.0001?

1 Like

Okay, I figured it out by (a) rolling back to cmdstan 2.29.2 (though this was not the issue) and (b) commenting out all my code and re-introducing lines at a time.

Essentially, the issue arose from a fairly minor bug where I had a transformed parameter array with dimensions

array[x1,x2,x3] real something;

and a function that calculated the value of something:

array[,,,] real something_function(x1,x2,x3);

Unfortunately, I was calling something_function(x1, x2, x4); for some other value of x4 < x3. This had the effect of leaving a large block of something unitialized. At least, this is my sense of the problem – correcting the dimensions of something_function seems to have made it estimable. My chains are now running!

Thanks for the tips, folks.


FWIW 2.29.2 made the MWE I presented above work fine.


Given this explanation, why would the MWE fail though?

1 Like