Sampling fails after warmup

cpfiffer · January 23, 2023, 10:57pm

I have a model that seems to be failing as soon as it stops the warmup phase and begins the sampling phase.

Here’s a demo of some of the output where I was using 100 iterations for warmup. The same thing happens with 250 and 1k warmup iterations. It’ll stop the warmup, try to do one normal sample, and fail silently.

Chain 3 Iteration:   98 / 1100 [  8%]  (Warmup) 
Chain 1 Iteration:  101 / 1100 [  9%]  (Sampling) 
Chain 4 Iteration:   99 / 1100 [  9%]  (Warmup) 
Chain 2 Iteration:   95 / 1100 [  8%]  (Warmup) 
Chain 2 Iteration:   96 / 1100 [  8%]  (Warmup) 
Chain 3 Iteration:   99 / 1100 [  9%]  (Warmup) 
Chain 4 Iteration:  100 / 1100 [  9%]  (Warmup) 
Chain 4 Iteration:  101 / 1100 [  9%]  (Sampling) 
Chain 2 Iteration:   97 / 1100 [  8%]  (Warmup) 
Warning: Chain 1 finished unexpectedly!

Chain 3 Iteration:  100 / 1100 [  9%]  (Warmup) 
Chain 2 Iteration:   98 / 1100 [  8%]  (Warmup) 
Chain 3 Iteration:  101 / 1100 [  9%]  (Sampling) 
Warning: Chain 4 finished unexpectedly!

Chain 2 Iteration:   99 / 1100 [  9%]  (Warmup) 
Chain 2 Iteration:  100 / 1100 [  9%]  (Warmup) 
Warning: Chain 3 finished unexpectedly!

Chain 2 Iteration:  101 / 1100 [  9%]  (Sampling) 
Warning: Chain 2 finished unexpectedly!

Warning: Use read_cmdstan_csv() to read the results of the failed chains.
Warning messages:
1: All chains finished unexpectedly! Use the $output(chain_id) method for more information.
 
2: No chains finished successfully. Unable to retrieve the fit.

Anyone seen anything like this before? I’m on cmdstanr with cmdstan version 2.31.0.

Normally I would accompany this with a MWE, but it’s a relatively large model for ongoing non-shareable research. I’m hoping people might have a sense of places I could start looking.

jsocolar · January 23, 2023, 11:44pm

Hopefully somebody knows what’s up, but one suggestion to run with save_warmup = TRUE, and then inspect the output csvs for anything super weird, and also for checking, for example, whether the failure happens before or after writing out the inverse metric. Edit: this can also tell you if for some reason your cmdstan is unable to write any iterations to csv, or whether the problem is somehow specific to the transition from warmup to sampling.

cpfiffer · January 24, 2023, 12:38am

Okay, good suggestion. This is interesting – it now fails immediately. Inspecting the output CSV reveals that the inv_metric never makes it into the output, warmup_draws is all NAs.

jsocolar · January 24, 2023, 12:44am

Can you fit other models just fine?

cpfiffer · January 24, 2023, 12:53am

Yeah, my machine works for a ton of other models. Suggest there’s something weird about this one. Not sure what I should do about this though – a MWE will be real hard to come by here. Wish I had better diagnostic tools here, or something.

cpfiffer · January 24, 2023, 1:05am

Print debugging seems to suggest that the failure happens after the model block runs. I sprinkled a bunch of print statements throughout the model block that print out the log joint to see if I could track down where it failed. I also set the max_treedepth=1 here to cut down on the amount of messages (the outcome is the same regardless of the treedepth I use).

Here’s the log:

Chain 1 Iteration:    1 / 1100 [  0%]  (Warmup) 
Chain 1 ======beginning evaluation====== 
Chain 1 target priors: -104.444 
Chain 1 target characteristcs: -128.444 
Chain 1 target characteristic likelihood: -3362.07 
Chain 1 target poisson likelihood: -3374.07 
Chain 1 target arrival likelihood: -1.57851e+07 
Chain 1 target at model block end: -1.58903e+07 
Chain 1 ======beginning evaluation====== 
Chain 1 target priors: -104.681 
Chain 1 target characteristcs: -128.681 
Chain 1 target characteristic likelihood: -3360.62 
Chain 1 target poisson likelihood: -3372.43 
Chain 1 target arrival likelihood: -6.47385e+06 
Chain 1 target at model block end: -6.57905e+06 
Warning: Chain 1 finished unexpectedly!

Warning messages:
1: In model$sample(data = file.path(input_location, "data.json"), refresh = 1,  :
  'num_chains' is deprecated. Please use 'chains' instead.
2: No chains finished successfully. Unable to retrieve the fit.

Curiously it initializes just fine – just seems to fail when it needs to offload something to disk.

avehtari · January 24, 2023, 11:11am

Do you have a generated quantities block? It’s not executed during the warmup, so failing there could explain this.

cpfiffer · January 24, 2023, 3:57pm

No, I don’t have any generated quantities yet.

cpfiffer · January 24, 2023, 7:22pm

More info! After enabling some flags by placing the following in my make/local file:

CXXFLAGS+= -fsanitize=undefined
CXXFLAGS+= -fsanitize=address

I got these flags from here.

The compiler output suggests a memory issue. I have the following output:

=================================================================
==2792681==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x7fa3cec11810 in thread T0
    #0 0x7fa3d04be672 in __interceptor_free /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:52
    #1 0x5579cccc4623 in stan::services::util::create_unit_e_diag_inv_metric(unsigned long) (/home/cameron/research/option-value-of-news/generative-model/stan/combined-model+0x158e623)
    #2 0x5579cccd54b4 in int stan::services::sample::hmc_nuts_diag_e_adapt<stan::model::model_base, std::shared_ptr<stan::io::var_context>, stan::callbacks::writer, stan::callbacks::unique_stream_writer<std::ostream>, stan::callbacks::unique_stream_writer<std::ostream> >(stan::model::model_base&, unsigned long, std::vector<std::shared_ptr<stan::io::var_context>, std::allocator<std::shared_ptr<stan::io::var_context> > > const&, unsigned int, unsigned int, double, int, int, int, bool, int, double, double, int, double, double, double, double, unsigned int, unsigned int, unsigned int, stan::callbacks::interrupt&, stan::callbacks::logger&, std::vector<stan::callbacks::writer, std::allocator<stan::callbacks::writer> >&, std::vector<stan::callbacks::unique_stream_writer<std::ostream>, std::allocator<stan::callbacks::unique_stream_writer<std::ostream> > >&, std::vector<stan::callbacks::unique_stream_writer<std::ostream>, std::allocator<stan::callbacks::unique_stream_writer<std::ostream> > >&) (/home/cameron/research/option-value-of-news/generative-model/stan/combined-model+0x159f4b4)
    #3 0x5579ccc75e6e in cmdstan::command(int, char const**) (/home/cameron/research/option-value-of-news/generative-model/stan/combined-model+0x153fe6e)
    #4 0x5579cc2ef245 in main (/home/cameron/research/option-value-of-news/generative-model/stan/combined-model+0xbb9245)
    #5 0x7fa3cf63c28f  (/usr/lib/libc.so.6+0x2328f)
    #6 0x7fa3cf63c349 in __libc_start_main (/usr/lib/libc.so.6+0x23349)
    #7 0x5579cc2efa14 in _start ../sysdeps/x86_64/start.S:115

0x7fa3cec11810 is located 16 bytes inside of 170944-byte region [0x7fa3cec11800,0x7fa3cec3b3c0)
allocated by thread T0 here:
    #0 0x7fa3d04bfa89 in __interceptor_malloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:69
    #1 0x5579cc3fd013 in Eigen::internal::aligned_malloc(unsigned long) (/home/cameron/research/option-value-of-news/generative-model/stan/combined-model+0xcc7013)

SUMMARY: AddressSanitizer: bad-free /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:52 in __interceptor_free
==2792681==ABORTING

cpfiffer · January 24, 2023, 7:25pm

My interpretation of this is that I may have a variable which is undefined when the metric is created? Would that be a reasonable interpretation?

jsocolar · January 24, 2023, 7:42pm

@stevebronder @WardBrian

WardBrian · January 24, 2023, 7:56pm

How many parameters are in this model?

cpfiffer · January 24, 2023, 8:17pm

Boy is this weird as hell. I started just commenting out vast swathes of the model to see if I could get the alloc error to go away, and low and behold, I have the most amazing MWE.

Model code, saved to mwe.stan:

data{
  int D;
}

parameters {
  vector<lower=0>[D] epsilon_var;
}

model {
  epsilon_var ~ inv_gamma(2, 3);
}

R code:

model = cmdstanr::cmdstan_model(
  "generative-model/stan/mwe.stan",
  force_recompile=TRUE,
)

chain = model$sample(data=list(D=12))

Session info:

> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.11.0
LAPACK: /usr/lib/liblapack.so.3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] knitr_1.41           magrittr_2.0.3       tidyselect_1.2.0    
 [4] munsell_0.5.0        colorspace_2.0-3     R6_2.5.1            
 [7] rlang_1.0.6          fansi_1.0.3          dplyr_1.0.10        
[10] tools_4.2.2          grid_4.2.2           checkmate_2.1.0     
[13] gtable_0.3.1         xfun_0.36            utf8_1.2.2          
[16] cli_3.6.0            DBI_1.1.3            withr_2.5.0         
[19] cmdstanr_0.5.3       posterior_1.3.1      assertthat_0.2.1    
[22] abind_1.4-5          tibble_3.1.8         lifecycle_1.0.3     
[25] processx_3.8.0       tensorA_0.36.2       farver_2.1.1        
[28] ggplot2_3.4.0        ps_1.7.2             vctrs_0.5.1         
[31] glue_1.6.2           compiler_4.2.2       pillar_1.8.1        
[34] generics_0.1.3       scales_1.2.1         backports_1.4.1     
[37] distributional_0.3.1 jsonlite_1.8.4       pkgconfig_2.0.3

I’m running on cmdstan version 2.31.0.

WardBrian · January 24, 2023, 8:35pm

Huh, that example runs to completion for me (Ubuntu 22.04/gcc 11.3.0). It seems like it might then be something version/compiler specific, sometimes these memory errors/UB issues only show up with newer versions. If you’re using Manjaro I assume you have gcc 12?

Edit: I just tried with gcc 12.2.0 and it still sampled fine, so there must be something else going on here.

cpfiffer · January 24, 2023, 9:41pm

Yep, I’m currently on gcc 12.2.0. Super weird!

cpfiffer · January 24, 2023, 9:42pm

Perhaps I should roll back to an earlier cmdstan release, let me give it a shot.

spinkney · January 24, 2023, 10:42pm

Does it still fail if you initialize far away from 0? What if you put lower=0.0001?

cpfiffer · January 24, 2023, 10:54pm

Okay, I figured it out by (a) rolling back to cmdstan 2.29.2 (though this was not the issue) and (b) commenting out all my code and re-introducing lines at a time.

Essentially, the issue arose from a fairly minor bug where I had a transformed parameter array with dimensions

array[x1,x2,x3] real something;

and a function that calculated the value of something:

array[,,,] real something_function(x1,x2,x3);

Unfortunately, I was calling something_function(x1, x2, x4); for some other value of x4 < x3. This had the effect of leaving a large block of something unitialized. At least, this is my sense of the problem – correcting the dimensions of something_function seems to have made it estimable. My chains are now running!

Thanks for the tips, folks.

cpfiffer · January 24, 2023, 10:55pm

FWIW 2.29.2 made the MWE I presented above work fine.

spinkney · January 25, 2023, 1:12pm

Given this explanation, why would the MWE fail though?

Topic		Replies	Views
Saving & reusing adaptation in cmdstanr Interfaces cmdstanr	53	3979	June 8, 2022
Chains finish unexpectedly in new install of CmdStanR CmdStan cmdstanr	10	2350	August 6, 2024
Compilation error on rstan General rstan , fitting-issues	26	2640	January 27, 2021
Continue Warmup / Sampling after Interruption General	10	2728	July 31, 2022
Stuck at Warmup iteration with no error : CmdStanR CmdStan techniques , fitting-issues	48	3176	April 21, 2020

Sampling fails after warmup

Related topics