Warning: Chain 1 finished unexpectedly

Hello,
I’m seeing some weird behavior when running the exact same code on mac vs. linux.
Mac: Latest M3 chip. Stan version 2.34.1
Linux: Recent Intel chip, Fedora 39. Stan version: 2.34.1

On the mac, I get some early warnings about rejected proposals, but then things settle down, and it samples fine. No divergences, Rhat around 1.01, etc. Looks fine

On the Linux box, I get the same warnings, but then some chains halt right away, some chains sample for a while and then halt. The error message is always: Warning: Chain 1 finished unexpectedly!

My guess is that maybe some internal library in Linux is more sensitive about some floating point issue, but I have no way to resolve this. Any ideas would be greatly appreciated.

For reference, below is the stan file I used:

data{
    int<lower=1>  N;
    vector<lower=0>[N] y; 
    vector<lower=0>[N] size;
}
parameters{
    real<lower=0> a0;
    real<lower=0> sg;

}
transformed parameters{

    eta = exp(a0); // to ensure positive value

    for (i in 1:N){

        mu[i] = size[i] / eta;

        alpha[i] = square(mu[i] / sg);
        beta[i]  = mu[i]  / square(sg);

    }
}
model{
    y ~ gamma(alpha, beta);
    
    // priors
    a0 ~ normal(16,1);
    sg ~ normal(0, 1);

}

Can you provide some (possibly simulated) data you can recreate the issue with?

Can you also share your compiler versions and any extra flags you may be using?

I turned on debugging -g flag when compiling. I’m just a beginner with gdb, it looks like the crash is always in the same function:

#0  0x000055b2f113d92d in stan::math::operator/(stan::math::var_value<double, void> const&, stan::math::var_value<double, void> const&)::{lambda(auto:1&&)#1}::operator()<stan::math::internal::callback_vari<double, {lambda(auto:1&&)#1}>&>(stan::math::internal::callback_vari<double, {lambda(auto:1&&)#1}>&) const (vi=...,
    __closure=0x7fa6668a88f0) at stan/lib/stan_math/stan/math/rev/core/operator_division.hpp:64

I then tried to find the arguments sent to that function:

(gdb) info args
vi = warning: RTTI symbol not found for class 'stan::math::internal::callback_vari<double, stan::math::operator/(stan::math::var_value<double, void> const&, stan::math::var_value<double, void> const&)::{lambda(auto:1&&)#1}>'
@0x7fa6668a88d8: {<stan::math::vari_value<double, void>> = {<stan::math::vari_base> = {
      _vptr.vari_base = 0x55b2f131f108 <vtable for stan::math::internal::callback_vari<double, stan::math::operator/(stan::math::var_value<double, void> const&, stan::math::var_value<double, void> const&)::{lambda(auto:1&&)#1}>+16>}, val_ = 118.04911367747745, adj_ = 0.11970678026202108}, rev_functor_ = {__dividend = {
      vi_ = 0x7fa6668a8848}, __divisor = {vi_ = 0x7fa6668a88b8}}}
__closure = 0x7fa6668a88f0

Trying to find out more, I was able to see the values of divisor and dividend:

(gdb) info locals
divisor = {vi_ = 0x7fa6668a88b8}
dividend = {vi_ = 0x7fa6668a8848}

(gdb) print 0x7fa6668a88b8
$2 = 140352661653688

(gdb) print 0x7fa6668a8848
$4 = 140352661653576

Those are both huge numbers, but dividing them should be reasonable on a modern i7 chip.

Do any of these clues help?

Not sure if that helps. But, it looks like some kind of numerical issue with division.

Digging a bit deeper into gdb (I’m still beginner in there), I can see that the segmentation fault always happens around

at stan/lib/stan_math/stan/math/rev/core/operator_division.hpp:64
dividend.adj() += vi.adj() / divisor.val();

If I am using gdb correctly, the values are:

val_ = 68.707819887118788,
adj_ = -1.4094198127133239

Those seem reasonable, and shouldn’t be generating any errors. The bigger question is why? I’m just using cmdstan with a fairly simple model file. Stan runs for a while, gets through a few thousands iterations, then crashes with this error.

Any ideas?

Digging even more, those values I shared were incorrect. I managed to get actual values from GDB when the program crashed.

The numerator is a reasonable 22824.305312917772
The denominator is a massive 4.6355706526156105e-310

This is clearly causing the numerical overflow that leads to a segmentation fault.

The bigger question is: How to resolve this?