Segfault with specific subset of data

I have data that was collected over multiple years and I want to get the model fit that I would have been able to get each year, so I filter it like this:

df[df.year < 2018]
df[df.year < 2019]
df[df.year < 2020]
df[df.year < 2021]
df[df.year < 2022]
df[df.year < 2023]

Those dataframes are then loaded into dictionaries and passed to cmdstanpy for sampling

The model is working correctly for most of the combinations, but when year < 2020, I am getting a segfault (retcode=-11) on the first iteration. I’m not sure where to start debugging this. Any thoughts/points/links?

Are you able to share your Stan code?

I can’t share the full model, and I’m not sure how to go about creating a minimal example that segfaults when supersets of the failing data and subsets of the failing data are working as expected (this probably means something unexpected is happening and I’m missing it right?)

Yeah, most likely something is off about indexing or similar. Especially if you’re using an older version of Stan that could lead to segfaults, in the latest versions it should lead to more direct error messages.

2020 is the only year in that data set which was a leap year, does your data have per-day resolution?

1 Like

It was an overflow thing where one set of indexes was assigning stuff to parts of a vector it shouldn’t have been, and this was revealed in a special case where the size of the first factor was bigger than the second (the values in the vector had different likelihoods applied)

My problem was really specific to me, but my advice to anyone looking for answers here is to not assume that compiling and running and passing basic diagnostic tests means your model is correct. My segfault just helped locate the error