I’ve had some interesting behaviour with a model I’ve been running. The model includes a system of ODEs to solve the steady state of chemical species. After the first few samples where it appears to be running correctly the chains end up being terminated by signal -11 with the same code in the return codes output. I’ve seen posts previously indicating that this is a segment fault, however I can’t see why it would only occur after running a few samples with a few thousand steps within each sample. The code is quite long, so I’ll just include the link to the GitHub where the Stan code is: Model Code | Function Block Code
If you only have to run the model for a relatively short time before it crashes with a segmentation fault, you may be able to run it using valgrind memcheck, which may be able to diagnose the location of the fault.
e.g. if you’re running a linux distribution such as debian you can install valgrind with apt install valgrind and then run your program under valgrind memcheck by following: The Valgrind Quick Start Guide.
Maybe a good place to start would be to first figure out a reproducible way of triggering the crash, that only involves building and running a stan model binary using cmdstan, without any layer of python. E.g. fixed input data file + fixed input stan model + fixed command to run sampler that always or often triggers the crash – the crash could be nondeterministic and depend on the whims of the memory allocator even if the inputs don’t change. If the crash is caused by some out of bounds array access, then it could also be data dependent, if some array dimensions or array indices are defined by data.
I attempted to use valgrind by setting my optimisation to -O:g in the cmdstan/make/local file. And than ran it using just the executable directed at the data. It almost immediately failed, however, the pointers didn’t make much sense to me. I’ll try running again tomorrow to attach a screen shot. However, after switching to the bdf solver the error doesn’t appear to occur. Furthermore, running the program in fixed_param=True mode with a known input/output for the adjoint solver was correct.
Cool, at least it seems easy to reproduce the crash.
As well as sharing a screenshot or copy of valgrind output, please also share a copy of the entire C++ file for your model that Stan has generated – some of the developers may be able to correlate the issues that valgrind is reporting (especially if there are source file names and line numbers in valgrind output) with parts of the model code or the depths of stan’s library code.