Segmentation Faults

Hi,

Running a fairly simple model, with one random effect, and it always crashes with a segmentation fault. Ive tried different parameterizations, different priors, et. Always segmentation fault after running for a few hours.

I’m happy to also share some example data, but don’t know how to attach it in discourse.

I’ve tried many variants of this model, and can always produce a similar crash, always from somewhere in the stan_math library. (sometimes “/”, sometimes “+”)

Interesting note: I asked cmdstan to save_warmup. The actual values written to output.csv look reasonable, and are in the range expected.

Second interesting note: Running the same model on a Macbook (M3 chip) does not produce a crash, but runs VERY slowly

This is the latest model file I ran:

data{
    int<lower=1>  N;
    int<lower=1>  N_group;
    array[N] int  group;
    vector<lower=0>[N] y;    
}
parameters{
    real<lower=0> a0;
    real<lower=0> sg;
    vector[N_group] group_eta;
    real<lower=0>  group_scale;
}
transformed parameters{
    vector[N_group] a_group; 
    a_group = group_scale * group_eta;
} 
model{ 
    
    for (i in 1:N){
        real mu = a0 + a_group[group[i]];
        y[i] ~ normal(mu, sg); 

    } 

    // priors
    a0          ~ normal(0, 0.1);
    sg          ~ normal(0, 0.1);
    group_eta   ~ normal(0, 1);
    group_scale ~ normal(0, 1);
}

I compiled the cmdstan code with debugging on, so we can see the error. The resulting crash in gdb is:

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: normal_lpdf: Location parameter is inf, but must be finite! (in 'model_8.stan', line 21, column 8 to column 34)
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Iteration:  100 / 2000 [  5%]  (Warmup)

Program received signal SIGSEGV, Segmentation fault.
0x0000555555587ee4 in stan::math::operator+(stan::math::var_value<double, void> const&, stan::math::var_value<double, void> const&)::{lambda(auto:1 const&)#1}::operator()<stan::math::internal::callback_vari<double, {lambda(auto:1 const&)#1}> >(stan::math::internal::callback_vari<double, {lambda(auto:1 const&)#1}> const&) (vi=warning: RTTI symbol not found for class 'stan::math::internal::callback_vari<double, stan::math::operator+(stan::math::var_value<double, void> const&, stan::math::var_value<double, void> const&)::{lambda(auto:1 const&)#1}>'
...,
    __closure=0x7ffff755b858) at stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:56
56                                    avi->adj_ += vi.adj_;

I then asked gdb for the arguments passed to that function.

(gdb) info args
vi = warning: RTTI symbol not found for class 'stan::math::internal::callback_vari<double, stan::math::operator+(stan::math::var_value<double, void> const&, stan::math::var_value<double, void> const&)::{lambda(auto:1 const&)#1}>'
@0x7ffff755b840: {<stan::math::vari_value<double, void>> = {<stan::math::vari_base> = {
      _vptr.vari_base = 0x555555756080 <vtable for stan::math::internal::callback_vari<double, stan::math::operator+(stan::math::var_value<double, void> const&, stan::math::var_value<double, void> const&)::{lambda(auto:1 const&)#1}>+16>}, val_ = 0.059424749080934487, adj_ = 52.870598166439621}, rev_functor_ = {
    __avi = 0x800555555770b70, __bvi = 0x555555771a88}}
__closure = 0x7ffff755b858

Environment:

  • cmdstan 2.34.1
  • New install of Debian 12: Linux bsc 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux
  • g++ versions: g++ (Debian 12.2.0-14) 12.2.0

Hardware

System:
  Host: bsc Kernel: 6.1.0-18-amd64 arch: x86_64 bits: 64 Console: pty pts/3 Distro: Debian
    GNU/Linux 12 (bookworm)
Machine:
  Type: Desktop System: ASUS product: N/A v: N/A serial: N/A
  Mobo: ASUSTeK model: PRIME B560M-A v: Rev 1.xx serial: 210585046001202
    UEFI: American Megatrends v: 0820 date: 04/27/2021
Memory:
  RAM: total: 125.57 GiB used: 1.7 GiB (1.4%)
  Array-1: capacity: 128 GiB note: est. slots: 4 EC: None
  Device-1: Controller0-ChannelA-DIMM0 type: DDR4 size: 32 GiB speed: 3200 MT/s
  Device-2: Controller0-ChannelA-DIMM1 type: DDR4 size: 32 GiB speed: 3200 MT/s
  Device-3: Controller0-ChannelB-DIMM0 type: DDR4 size: 32 GiB speed: 3200 MT/s
  Device-4: Controller0-ChannelB-DIMM1 type: DDR4 size: 32 GiB speed: 3200 MT/s
CPU:
  Info: 8-core model: 11th Gen Intel Core i7-11700 bits: 64 type: MT MCP cache: L2: 4 MiB
  Speed (MHz): avg: 800 min/max: 800/4800:4900 cores: 1: 800 2: 800 3: 800 4: 800 5: 800 6: 800
    7: 800 8: 800 9: 800 10: 800 11: 800 12: 800 13: 800 14: 800 15: 800 16: 800

Replace

with

array[N] int<lower = 1, upper = N_group> group;

Compile and run Stan again. Is Stan still producing a crash?

With VERY tight priors → It did not crash
With generic priors normal(0,1) → It crashed after 1600 steps

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1"
Core was generated by `./model_8 sample num_warmup=1000 num_samples=1000 save_warmup=1 data file=data.'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000563b59c84264 in stan::math::internal::callback_vari<double, stan::math::operator+(stan::math::var_value<double, void> const&, stan::math::var_value<double, void> const&)::{lambda(auto:1 const&)#1}>::chain() ()
  1. vectorize, vectorize, vectorize!!!
  2. if a0 is the intercept term, why must it be positive?
  3. categorial predictors can be difficult to pin down. a sum-to-zero constraint helps alot.

it looks like somehow this is blowing out memory - you’ve got a nested loop inside which you declare a variable over and over.

I think that perhaps this is the model you want - you’ve got a single intercept and 1 group-level predictor, correct?

data {
  int<lower=1> N;
  int<lower=2> N_group;
  array[N] int<lower=1, upper=N_group> group;
  vector[N] real y;
}
parameters {
  real alpha;
  real<lower=0> sigma;
  real<lower=0> sigma_group;
  vector<multiplier=sigma_group>[N_group - 1] beta_group_raw;
}
transformed parameters {
  // sum to zero 
  vector[N_group] beta_group = append_row(beta_group_raw, -sum(beta_group_raw));
}
model {
  y ~ normal(alpha + beta_group[group], sigma);
  // priors on alpha, sigma, sigma_group,  and beta_group.
}

See the Stan User’s Guide section on Efficiency Tuning: Efficiency Tuning

Also see the Prior Choice wiki - Prior Choice Recommendations · stan-dev/stan Wiki · GitHub

I haven’t tried to compile this - there may well be typos or worse.

This is a bug. We shouldn’t be segfaulting no matter what happens.

I’m not sure if there is something specific that prompted this response from @mitzimorris, but the fact that this happens later in sampling points to a memory issue. What are the sizes you’re working with?

This is nice (but it’s missing the y declaration in the data block). It uses the multiplier transform to carry out the non-centered parameterization. If it’s not clear, the prior should go on beta_group—it doesn’t need a Jacobian because the negative sum is a linear transform.

It will also help with memory in that the autodiff trees for the y sampling statement will be smaller.

Adding bounds to a Stan data variable only causes the ranges to be validated at the end of the block. It doesn’t change any runtime behavior once the model starts sampling.

1 Like

agreed - definitely a bug somewhere in Stan.

also added declaration for y to data block.

1 Like

Thanks @mitzimorris and @Bob_Carpenter

The data has around 140,000 rows, and there are around 70 unique groups. (In a future version of the model, I plan to have a few more random-effect-intercepts)

I’m trying @mitzimorris vectorized version now. Never saw this paramaterization before (espectially the <mutiplier command. Where can I learn more about thiat? Why [N_group-1] instead of N_group

I think Bob is correct about some memory bug. A few notes:

  • The model only crashes after 700+ iterations
  • The model runs, without any problems, on a newer Macbook
  • The same crash behaviour happend with fresh installs of both Fedora and Debian

I really, really appreciate all the help. Please let me knnow if I can provide any debug details from core dumps, or any other information.

Thank You!!!

This is the affine transform which gives you the non-centered parameterization. Data Types and Declarations

This is explained in the Stan User’s Guide here: Regression Models

However, if you have a large number of groups, a soft sum-to-zero constraint will be easier to fit -

sum(beta_group) ~ normal(0, 0.001 * N); // equivalent to mean(beta_group) ~ normal(0,0.001)

With this, and going with the convention of using single capital letters for the sizes of arrays, we have:

data {
  int<lower=1> N;   // num observations
  int<lower=2> K;   // num groups
  array[N] int<lower=1, upper=K> group;
  vector[N] real y;
}
parameters {
  real alpha;
  real<lower=0> sigma;
  real<lower=0> sigma_group;
  vector<multiplier=sigma_group>[K] beta_group;
}
model {
  y ~ normal(alpha + beta_group[group], sigma);
  sum(beta_group) ~ normal(0, 0.001 * N);  // mean beta_group is normal(0, 0.001)
  // priors alpha, sigma, sigma_group
}

Thanks, that makes sense.

I’m still getting random segmentation faults.

It was suggested that I use clang instead of gcc for cmdstan. That makes the segfaults happen less often, but they still happen.

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./model_4 sample num_warmup=1000 num_samples=1000 data file=data.json'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00005641058f5b6d in void stan::math::gradient<stan::model::model_functional<stan::model::model_base> >(stan::model::model_functional<stan::model::model_base> const&, Eigen::Matrix<double, -1, 1, 0, -1, 1> const&, double&, Eigen::Matrix<double, -1, 1, 0, -1, 1>&) ()

Any ideas on how I can setup Linux + Stan to be stable? I don’t care which distribution, version, etc.

Can you share some data to allow us to run your model on our own machines?

Hello,

It turns out that all this mess was caused by bad hardware. At the suggestion of a friend, I ran a low-level memory hardware test (boot from USB and run memtest). Two of my four DDR SIMS reported a ton of errors.

This explains why the crashes were a bit random (no pun intended)

I’ve upgraded the hardware, and all the models run fine.

Apologies for wasting the group’s time on a hardware issue.

Thank You,

1 Like

Well you see something new every day - maybe memchk will need to become part of our debugging check-list :)

1 Like