STAN on multiple cores occasionally crashing Linux without overwhelming memory

rok_cesnovar · September 3, 2020, 11:56am

One thing that might be worth trying is try running the model directly in cmdstan (in the command line) or using cmdstanpy like @ahartikainen mentioned. That would help narrow down where the problem lies (R or cmd(stan).

mathesong · September 3, 2020, 12:41pm

Thanks so much for the help. I’m pretty rusty on my python, but managed to get it running (and did it in Jupyter before I saw your edit with the code). I got everything working with the stan file and the json that I prepared for Rok yesterday. It went on for about 3/4 minutes before my system froze up again, without any changes to the progress bars. Attached the screenshot of how the notebook looked.

Lemme see if I can try it in cmdstan directly too…

mathesong · September 3, 2020, 1:49pm

Same outcome with cmdstan.

On the command line, I’m compiling, and then sampling using the following:

for i in {1..6}
    do
      ./cmdstanr_test sample data file=cmdstanr_test_data- \
      output file=output_${i}.csv &
    done

I get the same bunch of warnings (like the following - but maybe 50 of them), as I get in cmdstanr and cmdstanpy (still trying to figure out what part of the model is ill-specified, but that’s another story for another thread)

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: normal_lpdf: Location parameter[1] is -nan, but must be finite! (in '/home/granville/Repositories/Multilevel_TCM/R/Sim_data_model/Rok_Testing/cmdstanr_test.stan', line 197, column 4 to column 41)
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: normal_lpdf: Location parameter[1] is -nan, but must be finite! (in '/home/granville/Repositories/Multilevel_TCM/R/Sim_data_model/Rok_Testing/cmdstanr_test.stan', line 197, column 4 to column 41)
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: normal_lpdf: Location parameter[1] is -nan, but must be finite! (in '/home/granville/Repositories/Multilevel_TCM/R/Sim_data_model/Rok_Testing/cmdstanr_test.stan', line 197, column 4 to column 41)
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

and then it starts warmup, and then the system hangs. This time after just less than 7 minutes.

ahartikainen · September 3, 2020, 2:11pm

@rok_cesnovar I bet this is something in Stan that is misbehaving. Do we have some example repo where we could inject couts to see where it gets stuck.

edit. wait remove macOS, haha

ahartikainen · September 3, 2020, 2:12pm

great, did you use the same cmdstan installation or did you create a new one?

andrjohns · September 3, 2020, 2:30pm

Do you get the same behaviour for models other than your own? A quick test would be to run the Bernoulli example provided with cmdstan

stevebronder · September 3, 2020, 4:04pm

I wonder if we are doing something dumb in Stan when the user has a ton of non/infinite values? @rok_cesnovar @mathesong it may be helpful to run this with valgrind and heaptrack like

valgrind --leak-check=full --show-leak-kinds=all  ./cmdstanr_test sample data file=cmdstanr_test_data- output file=output_1.csv

heaptrack ./cmdstanr_test sample data file=cmdstanr_test_data- output file=output_1.csv

to see if we have leaks or bad memory reads/writes somewhere. I’m really not sure what would cause a full freeze without also taking up all the users memory. I think we’d have to be writing memory to somewhere that leads to an unrecoverable state for the OS

rok_cesnovar · September 3, 2020, 4:53pm

The thing that is the most weird here is that we are not able to reproduce. I am not familiar with Pop!OS.

Going to try to install it as a virtual machine if that would help.

mathesong · September 3, 2020, 8:52pm

Can I just re-iterate how incredibly thankful I am for all this help!

Same installation

I tried the cmdstan bernoulli example, and it runs perfectly without crashing out on 6 parallel chains. However, it samples so quickly that I wonder if there isn’t time for anything to go wrong. I usually crash out after several minutes. In this case, the whole sampling is completed in less than 0.1 seconds for 10000 warmup and 10000 samples. But this is really promising at least that something works in parallel!

Pop!OS is just basically Ubuntu at its core, with a few tweaks here and there.

Ok, great! I’m running the valgrind now, and I’ll run the heaptrack afterwards and report back.

Thanks again everyone!!

ahartikainen · September 3, 2020, 9:23pm

Ok, then the error is probably in Stan part.

Could you try to install a new cmdstan with CmdStanPy?

python -m cmdstanpy.install_cmdstan

And then run the model again. (You can remove the cmdstan from ~/.cmdstanpy folder after the test)

mathesong · September 3, 2020, 10:21pm

I installed a new cmdstanpy version, set cmdstan_path to the new directory, and then tried it again. Same result: crashes out after about 4 minutes. :(

Still waiting on the valgrind (it seems to be going really slowly? I’ll leave it on overnight). But I got the heaptrack. I did the suggested analysis (let me know if it would be better to share the whole zst file). I’ve attached it here: heaptrack_out.txt (403.5 KB)

I’ll update tomorrow with the valgrind if it’s finished.

mathesong · September 4, 2020, 8:12am

And the valgrind is now finished after 10 hours. Attached here: valgrid_out.txt (49.0 KB)

ahartikainen · September 5, 2020, 7:57am

I’m not any C++ expert, but the first warnings are normal (?) and given that your ram hasn’t filled, are probably not the reason for the behaviour.

The second error, is there a possibility that {Eigen, Matrix, resize} -block gets collected at some point but our program still tries to use it? Would this cause segfault?

Could you share the .hpp file?

mathesong · September 5, 2020, 7:41pm

Sure - it’s here: cmdstanr_test.hpp (134.8 KB)

maedoc · September 5, 2020, 8:38pm

You don’t mention the hardware or if this is a virtual machine? I would suggest leaving dmesg -w running in a terminal when you crash the machine since there are only so many ways you can crash Linux with a process like Stan and the kernel will usually complain about something.

mathesong · September 5, 2020, 9:07pm

Thanks for the suggestion!

Here’s my inxi -Fxz

System:
  Kernel: 5.4.0-7642-generic x86_64 bits: 64 compiler: gcc v: 9.3.0 
  Desktop: Gnome 3.36.4 Distro: Pop!_OS 20.04 LTS 
  base: Ubuntu 20.04 LTS Focal 
Machine:
  Type: Desktop System: ASUS product: N/A v: N/A serial: <filter> 
  Mobo: ASUSTeK model: ROG STRIX Z490-F GAMING v: Rev 1.xx serial: <filter> 
  UEFI: American Megatrends v: 0607 date: 05/29/2020 
CPU:
  Topology: 8-Core model: Intel Core i7-10700K bits: 64 type: MT MCP 
  arch: N/A L2 cache: 16.0 MiB 
  flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx 
  bogomips: 121596 
  Speed: 800 MHz min/max: 800/5100 MHz Core speeds (MHz): 1: 800 2: 800 
  3: 800 4: 800 5: 800 6: 800 7: 800 8: 801 9: 800 10: 800 11: 800 12: 800 
  13: 800 14: 800 15: 800 16: 800 
Graphics:
  Device-1: Intel vendor: ASUSTeK driver: i915 v: kernel bus ID: 00:02.0 
  Device-2: NVIDIA TU104 [GeForce RTX 2070 SUPER] driver: nvidia v: 440.100 
  bus ID: 01:00.0 
  Display: x11 server: X.Org 1.20.8 driver: modesetting,nvidia 
  unloaded: fbdev,nouveau,vesa resolution: 2560x1440~60Hz 
  OpenGL: renderer: GeForce RTX 2070 SUPER/PCIe/SSE2 v: 4.6.0 NVIDIA 440.100 
  direct render: Yes 
Audio:
  Device-1: Intel Comet Lake PCH cAVS vendor: ASUSTeK driver: snd_hda_intel 
  v: kernel bus ID: 00:1f.3 
  Device-2: NVIDIA TU104 HD Audio driver: snd_hda_intel v: kernel 
  bus ID: 01:00.1 
  Sound Server: ALSA v: k5.4.0-7642-generic 
Network:
  Device-1: Intel vendor: ASUSTeK driver: igc v: 0.0.1-k port: 3000 
  bus ID: 04:00.0 
  IF: enp4s0 state: down mac: <filter> 
  Device-2: Broadcom and subsidiaries BCM4352 802.11ac Wireless Network 
  Adapter 
  vendor: ASUSTeK driver: wl v: kernel port: 3000 bus ID: 05:00.0 
  IF: wlp5s0 state: up mac: <filter> 
Drives:
  Local Storage: total: 931.51 GiB used: 265.21 GiB (28.5%) 
  ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 970 EVO Plus 1TB 
  size: 931.51 GiB 
Partition:
  ID-1: / size: 907.53 GiB used: 262.59 GiB (28.9%) fs: ext4 
  dev: /dev/nvme0n1p3 
  ID-2: swap-1 size: 4.00 GiB used: 0 KiB (0.0%) fs: swap dev: /dev/dm-0 
Sensors:
  System Temperatures: cpu: 27.8 C mobo: N/A gpu: nvidia temp: 39 C 
  Fan Speeds (RPM): N/A gpu: nvidia fan: 0% 
Info:
  Processes: 395 Uptime: 6m Memory: 62.65 GiB used: 3.69 GiB (5.9%) 
  Init: systemd runlevel: 5 Compilers: gcc: 9.3.0 clang: 10.0.0-4ubuntu1 
  Shell: bash v: 5.0.17 inxi: 3.0.38

I’ll run the dmesg -w now, and will update in the morning!

EDIT: also, forgot to say: not running it within a VM.

maedoc · September 5, 2020, 9:15pm

Thanks, as I thought, all of that is about the newest stuff that exists (not my boring GCC 4.8.5 on CentOS 7 on Westmere Xeon), so you might be running into a bug under the Stan layer of the stack. Will be interesting to see what the kernel says :popcorn:

mathesong · September 5, 2020, 9:45pm

Couldn’t resist following this until it crashed out. And we got something! “general protection fault”. I don’t know what it means, but I hope this means something to someone!

Of potential importance: I ran it at first, before realising I’d forgotten to set the seed. So I stopped it, pressed a few enters to space the new output from the old, and started it again. But it seems I got a similar output both times, and it hadn’t yet crashed my computer the first time (though it appeared just before I killed the process).

maedoc · September 5, 2020, 9:52pm

This can be faulty memory sticks, you might run memtest to check. You can also run various stress test utilities which trigger the crash (to confirm it’s not Stan). Lastly you might try to use more generic compiler flags to avoid eg avx instructions. These are guesses since this is a generic error message.

maedoc · September 5, 2020, 10:06pm

It might, but a process will just exit on SIGSEGV if a handler isn’t set (hopefully not the case with CmdStan or R), not crash the machine. A segmentation fault should also be deterministic in the number of processes running.

Topic		Replies	Views
Statistical Rethinking Simple Models Crashing RStan rstan	4	1018	August 13, 2020
Managing memory with OpenCL CmdStan techniques , fitting-issues , performance	20	1445	March 30, 2021
RStudio crashes working with brms brms	35	6651	February 4, 2024
Memory error when running stan General	6	875	April 22, 2021
RAM keep increasing until crash when run many brms/Stan models in parallel based on futures Modeling brms	13	1065	September 19, 2022

STAN on multiple cores occasionally crashing Linux without overwhelming memory

Related topics