Segmentation fault with cmdstan

scijens · January 21, 2021, 2:11pm

Hi everyone,

I am running several tasks on a Linux high-performance cluster. For each task, I created one folder, so I didn’t expect any interferences between these taks. However, I am getting a Segmentation fault in all the runs.

Because the models are MPI’ed and I had some error messages before, I ran the commands
make clean-all and
make build
in each batch script.

Here is the structure of each of the 4 batch scripts. In each script, I only adjust the folder name (lognormal1, lognormal2, lognormal3, lognormal4). The model and dataset names are the same, I just made simple adjustment to the model files and the data; it’s basically the same model running on different samples.

Any ideas what’s going on?

module load gnu/7.4.0

cd $HOME/cmdstan-2.25.0/

make clean-all
make build
make $HOME/cmdstan-2.25.0/examples/lognormal1/lognormal


cd $HOME/cmdstan-2.25.0/examples/lognormal1/
for i in {1..2}
do
./lognormal sample algorithm=hmc engine=nuts max_depth=10 num_samples=2000 data file=sample_50users_data.R output file=output_${i}.csv &
done

wait

cd $HOME/cmdstan-2.25.0/

./bin/stansummary ./examples/lognormal1/output_*.csv
./bin/diagnose ./examples/lognormal1/output_*.csv

And here is the error message I get in the output file:

/var/log/slurm/spool_slurmd//job14613154/slurm_script: line 28: 18927 Segmentation fault      (core dumped) ./lognormal sample algorithm=hmc engine=nuts max_depth=10 num_samples=2000 data file=sample_50users_data.R output file=output_${i}.csv
/var/log/slurm/spool_slurmd//job14613154/slurm_script: line 28: 18928 Segmentation fault      (core dumped) ./lognormal sample algorithm=hmc engine=nuts max_depth=10 num_samples=2000 data file=sample_50users_data.R output file=output_${i}.csv
Input files: ./examples/lognormal1/output_1.csv, ./examples/lognormal1/output_2.csv
Warning: non-fatal error reading samples
Warning: non-fatal error reading samples

bbbales2 · January 25, 2021, 10:19pm

Can you narrow down an input file/model that causes the segfault on your own computer? Is this randomly happening? Or is it deterministically happening with certain input?

scijens · January 26, 2021, 9:33am

Sure, I can further specify the issue:

I am running a Hurdle Model with a hidden Markov process. I think what’s causing the segmentation fault is the lognormal in the Hurdle. I am selecting model parts that might be relevant to answer this question.

data{
  int<lower = 1> N; // number of observations
  int<lower = 0> S; // number of states
  int<lower = 1> H; // number of individuals
  int<lower = 0> K1; // number of covariates inside delta in Bernoulli part
  int<lower = 0> K2; // number of covariates inside delta in Lognormal part
  matrix[N, K1] C1; // matrix of covariates for Bernoulli part
  matrix[N, K2] C2; // matrix of covariates for Lognormal part
  int<lower = 0, upper = 1> y[N]; // binary decision
  real q[N]; // Hurdle: Dependent variable we want to model conditional on y
  int<lower = 1> id[N]; // identifier of individuals
}

parameters {
  ordered[S] mu; // state-dependent intercepts in Bernoulli part
  vector[S] nu; // state-dependent intercepts in Lognormal part
  real alphaj[H]; // individual-specific intercept in Bernoulli part
  real alphai[H]; // individual-specific intercept in Lognormal part
  real<lower = 0> sigma_alphai;
  real<lower = 0> sigma_alphaj;
  real<lower = 0> sigma_q;
  vector[K1] delta1;
  vector[K2] delta2;
}
model {
...
  for (t in 2:N) {
      target += log_sum_exp(gamma);
      for (k in 1:S){
        gamma_prev[k] =  bernoulli_logit_lpmf(y[t] | alphaj[id[t]] + mu[k] + C1[t]*delta1);
        if(y[t] == 1){
	  gamma_prev[k] += lognormal_lpdf(q[t] | alphai[id[t]] + nu[k] + C2[t]*delta2, sigma_q);
...

C1 and C2 are design matrices of categorical predictors converted into dummy variables and then stored in a design matrix, e.g.
C1 <- structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,...).Dim = c(7192, 17))
The variable q we are modeling in the lognormal part has a range between 0 and 1, as I converted real q to q = real_q/max(real_q) before giving the data to stan, e.g.

q ← c(0.372852812394746, 0.175479959582351, 0.143482654092287, 0.121926574604244, 0.122600202088245, 0.112495789828225, 0.16503873358033, 0.0919501515661839,
0.0963287302121927, 0.100370495116201, 0.11485348602223, 0.107106769956214, 0.106433142472213, 0.149208487706298, 0.108790838666218, 0.102728191310205,
0.0929605927921859, 0.0929605927921859, 0.105422701246211, 0.105422701246211, 0.047490737622095, 0.101717750084203, 0.120916133378242, 0.106096328730212, 0,
0.1000336813742, 0.113169417312226, 0.112495789828225, 0.117884809700236, 0.0983496126641967, 0.0848770629841697, 0.0771303469181543, 0.11485348602223,
0.0788144156281576, 0.0747726507241495, 0.0191983832940384, 0.113169417312226, 0.0936342202761873, 0.0717413270461435, 0.082519366790165, 0.107443583698215,
0.0771303469181543, 0.0811721118221623, 0.0892556416301785, 0.0269450993600539, 0.0855506904681711, 0.11519029976423, 0.0932974065341866, 0.0848770629841697,
0.10474907376221, 0.0781407881441563, 0, 0.017514314584035, 0.0757830919501516, 0.125631525766251, 0.0862243179521724, 0.11013809363422, 0.0939710340181879, 0, 0,
0, 0, 0, 0.0181879420680364, 0, 0, 0, 0, 0, 0.0101044122600202, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

mu and nu are state-dependent intercepts (one for each state in each equation)
alphaj and alphai are individual-specific intercepts

When I run the model on a Linux machine with cmdstan and a lognormal in the Hurdle, it throws the following error message at the start of the sampling 6 times per chain and only during warmup. The segmentation fault occurs at ~ 42% of the estimation.

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:

Exception: lognormal_lpdf: Scale parameter is 0, but must be > 0! (in '/home/user1/cmdstan-2.25.0/examples/noncp50/lognormal_2state.stan', line 80, column 4 to line 81, column 79)

If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,

but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

When I run the same model with a normal instead of a lognormal in the Hurdle, it throws this message only once within a chain. Most important: It does not stop the estimation due to a segmentation fault.

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: normal_lpdf: Scale parameter is 0, but must be > 0! (in '/home/user1/cmdstan-2.25.0/examples/lognormal2502/lognormal_2state.stan', line 50, column 2 to column 35)
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

The segmentation fault does not occur if I sample data of 50 individuals, but it does when I run the model with data of 250 or 500 individuals.

Any ideas what’s happening here?

bbbales2 · January 26, 2021, 12:18pm

There is a known indexing bug that could cause a segfault with complicated indexing: Indexing out-of-bounds not always handled properly · Issue #776 · stan-dev/stanc3 · GitHub (it should give an error about out of bounds accesses but instead it segfaults). So it could definitely be popping up here.

If you want to post data and the full model I’ll have a look. Otherwise I think the way to debug this is figure out a set of conditions that produces the segfault (with seed and stuff), and then comment out sections of the model until the segfault goes away.

You can try adding extra checks to make sure things are in bound. Like:

int<lower = 1, upper = H> id[N];

scijens · January 26, 2021, 1:14pm

Thanks, Ben. Unfortunately I cannot post the full model and data here for confidentiality reasons.

One thing that would help me with debugging is to know if it is reasonable that we have an indexing issue if the same model works on the same data with a normal distribution in the hurdle, but not with a lognormal? One information I didn’t communicate so far is that I noticed a few -Inf values in the generated csv output files. Together with the Error evaluating the log probability at the initial value, I thought the root of the error might be inside the values the lognormal is sampling?

bbbales2 · January 26, 2021, 1:21pm

If all you’re doing is changing a normal into a lognormal, it seems unlikely to me that there’s an error (though there could be one).

Whittle this down into a test case that you can break deterministically and comment stuff out until you find the offending section of code. It is possible there’s a problem with lognormal vs. normal but I suspect something else (and this will let us find it either way).

Topic		Replies	Views
Bin/stanc segfault on cluster CmdStan stanc	3	1205	April 12, 2017
Segmentation fault: `cmdstan::command` can not parallel using `std::thread` Developers	15	1429	August 20, 2019
Segmentation Faults CmdStan	11	419	March 20, 2024
Setting initial values causing segmentation fault CmdStan initialization	5	595	August 9, 2021
Segfault in simple exponential model General	6	826	July 9, 2018

Segmentation fault with cmdstan

Related topics