STAN on multiple cores occasionally crashing Linux without overwhelming memory

The thing that is the most weird here is that we are not able to reproduce. I am not familiar with Pop!OS.

Going to try to install it as a virtual machine if that would help.

Can I just re-iterate how incredibly thankful I am for all this help!

  • Same installation
  • I tried the cmdstan bernoulli example, and it runs perfectly without crashing out on 6 parallel chains. However, it samples so quickly that I wonder if there isn’t time for anything to go wrong. I usually crash out after several minutes. In this case, the whole sampling is completed in less than 0.1 seconds for 10000 warmup and 10000 samples. But this is really promising at least that something works in parallel!

Pop!OS is just basically Ubuntu at its core, with a few tweaks here and there.

Ok, great! I’m running the valgrind now, and I’ll run the heaptrack afterwards and report back.

Thanks again everyone!!

1 Like

Ok, then the error is probably in Stan part.

Could you try to install a new cmdstan with CmdStanPy?

python -m cmdstanpy.install_cmdstan

And then run the model again. (You can remove the cmdstan from ~/.cmdstanpy folder after the test)

I installed a new cmdstanpy version, set cmdstan_path to the new directory, and then tried it again. Same result: crashes out after about 4 minutes. :(

Still waiting on the valgrind (it seems to be going really slowly? I’ll leave it on overnight). But I got the heaptrack. I did the suggested analysis (let me know if it would be better to share the whole zst file). I’ve attached it here: heaptrack_out.txt (403.5 KB)

I’ll update tomorrow with the valgrind if it’s finished.

1 Like

And the valgrind is now finished after 10 hours. Attached here: valgrid_out.txt (49.0 KB)

2 Likes

I’m not any C++ expert, but the first warnings are normal (?) and given that your ram hasn’t filled, are probably not the reason for the behaviour.

The second error, is there a possibility that {Eigen, Matrix, resize} -block gets collected at some point but our program still tries to use it? Would this cause segfault?

Could you share the .hpp file?

Sure - it’s here: cmdstanr_test.hpp (134.8 KB)

You don’t mention the hardware or if this is a virtual machine? I would suggest leaving dmesg -w running in a terminal when you crash the machine since there are only so many ways you can crash Linux with a process like Stan and the kernel will usually complain about something.

1 Like

Thanks for the suggestion!

Here’s my inxi -Fxz

System:
  Kernel: 5.4.0-7642-generic x86_64 bits: 64 compiler: gcc v: 9.3.0 
  Desktop: Gnome 3.36.4 Distro: Pop!_OS 20.04 LTS 
  base: Ubuntu 20.04 LTS Focal 
Machine:
  Type: Desktop System: ASUS product: N/A v: N/A serial: <filter> 
  Mobo: ASUSTeK model: ROG STRIX Z490-F GAMING v: Rev 1.xx serial: <filter> 
  UEFI: American Megatrends v: 0607 date: 05/29/2020 
CPU:
  Topology: 8-Core model: Intel Core i7-10700K bits: 64 type: MT MCP 
  arch: N/A L2 cache: 16.0 MiB 
  flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx 
  bogomips: 121596 
  Speed: 800 MHz min/max: 800/5100 MHz Core speeds (MHz): 1: 800 2: 800 
  3: 800 4: 800 5: 800 6: 800 7: 800 8: 801 9: 800 10: 800 11: 800 12: 800 
  13: 800 14: 800 15: 800 16: 800 
Graphics:
  Device-1: Intel vendor: ASUSTeK driver: i915 v: kernel bus ID: 00:02.0 
  Device-2: NVIDIA TU104 [GeForce RTX 2070 SUPER] driver: nvidia v: 440.100 
  bus ID: 01:00.0 
  Display: x11 server: X.Org 1.20.8 driver: modesetting,nvidia 
  unloaded: fbdev,nouveau,vesa resolution: 2560x1440~60Hz 
  OpenGL: renderer: GeForce RTX 2070 SUPER/PCIe/SSE2 v: 4.6.0 NVIDIA 440.100 
  direct render: Yes 
Audio:
  Device-1: Intel Comet Lake PCH cAVS vendor: ASUSTeK driver: snd_hda_intel 
  v: kernel bus ID: 00:1f.3 
  Device-2: NVIDIA TU104 HD Audio driver: snd_hda_intel v: kernel 
  bus ID: 01:00.1 
  Sound Server: ALSA v: k5.4.0-7642-generic 
Network:
  Device-1: Intel vendor: ASUSTeK driver: igc v: 0.0.1-k port: 3000 
  bus ID: 04:00.0 
  IF: enp4s0 state: down mac: <filter> 
  Device-2: Broadcom and subsidiaries BCM4352 802.11ac Wireless Network 
  Adapter 
  vendor: ASUSTeK driver: wl v: kernel port: 3000 bus ID: 05:00.0 
  IF: wlp5s0 state: up mac: <filter> 
Drives:
  Local Storage: total: 931.51 GiB used: 265.21 GiB (28.5%) 
  ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 970 EVO Plus 1TB 
  size: 931.51 GiB 
Partition:
  ID-1: / size: 907.53 GiB used: 262.59 GiB (28.9%) fs: ext4 
  dev: /dev/nvme0n1p3 
  ID-2: swap-1 size: 4.00 GiB used: 0 KiB (0.0%) fs: swap dev: /dev/dm-0 
Sensors:
  System Temperatures: cpu: 27.8 C mobo: N/A gpu: nvidia temp: 39 C 
  Fan Speeds (RPM): N/A gpu: nvidia fan: 0% 
Info:
  Processes: 395 Uptime: 6m Memory: 62.65 GiB used: 3.69 GiB (5.9%) 
  Init: systemd runlevel: 5 Compilers: gcc: 9.3.0 clang: 10.0.0-4ubuntu1 
  Shell: bash v: 5.0.17 inxi: 3.0.38 

I’ll run the dmesg -w now, and will update in the morning!

EDIT: also, forgot to say: not running it within a VM.

Thanks, as I thought, all of that is about the newest stuff that exists (not my boring GCC 4.8.5 on CentOS 7 on Westmere Xeon), so you might be running into a bug under the Stan layer of the stack. Will be interesting to see what the kernel says :popcorn:

Couldn’t resist following this until it crashed out. And we got something! “general protection fault”. I don’t know what it means, but I hope this means something to someone!

Of potential importance: I ran it at first, before realising I’d forgotten to set the seed. So I stopped it, pressed a few enters to space the new output from the old, and started it again. But it seems I got a similar output both times, and it hadn’t yet crashed my computer the first time (though it appeared just before I killed the process).

This can be faulty memory sticks, you might run memtest to check. You can also run various stress test utilities which trigger the crash (to confirm it’s not Stan). Lastly you might try to use more generic compiler flags to avoid eg avx instructions. These are guesses since this is a generic error message.

It might, but a process will just exit on SIGSEGV if a handler isn’t set (hopefully not the case with CmdStan or R), not crash the machine. A segmentation fault should also be deterministic in the number of processes running.

Also, the 10th gen Intel added some new multi core speed boost stuff. If you are doing any over clocking on the CPU or memory (or even if you aren’t), you may want to head into the BIOS and turn the CPU and memory speed down 10% or so to ensure stability.

3 Likes

Didn’t @bbbales2 do something like this with campfire?

Edit: Whups sorry wrong thread!

This solves it! Thank you so much!!

What was wrong

For those who might run into this issue in future, the issue seems to be with my 10th gen Intel Processor. Changing out some of the default settings in the BIOS to undo some of the optimisation seems to do the trick.

What I did

As you suggested, I tried running sudo memtester 1024 5 to test the memory, and got perfect results: no issues. I also ran 2 hours of stress testing using GtkStressTesting to see if it might freeze it, and it cleared it without a problem. So still no easy way to diagnose the issue.

This was the suggestion that solved it. I haven’t overclocked at all, and am running default settings on everything. I headed into the BIOS regardless. My BIOS, for anyone else who might have this problem in future, is American Megatrends Version 2.20.1276. I couldn’t figure out how to directly turn the CPU speed down, but I tried several things. What finally resolved the issue was under Advanced Mode > Ai Tweaker > Ai Overclock Tuner, the default setting was XMP I, which I changed to Auto. This automatically changed several other settings too:

  • DRAM Frequency [DDR4-3200MHz] -> [Auto]
  • DRAM CAS# Latency [16] -> [Auto]
  • DRAM RAS# to CAS# Delay [20] -> [Auto]
  • DRAN RAS# ACT Time [38] -> [Auto]
  • DRAM Voltage [1.35000] -> [Auto]
  • RC6(Render Standby) [Disabled] -> [Enabled]

I could then run the cmdstanpy model without a problem, on 6 cores simultaneously. I’ve done it twice now to make sure, and it worked both times (compared to 0 out of about 15-20 times before). I’ll return here if I start encountering this problem again on more than this specific example. But it sure feels like it’s working now.

What’s still unresolved

  • I’m still a bit mystified that it’s only STAN that could manifest the crashes. I would have imagined that dedicated memory tests or CPU stress tests should have failed too, but they found nothing… I’m very happy to try things out to resolve this for others in future, or to test any future modifications to STAN to evaluate if they protect against this.

  • We also don’t know if it’s for all 10th Gen Intel processors. I would imagine there are a few of them out there.

Lastly

Thank you so much to everyone for all your help! I really appreciate you all spending your time and energy on helping me to resolve this issue! It means a lot :).

5 Likes

Great to hear that solved the problem, perhaps unsurprising Intel is playing games these days given the pressure from AMD.

A bit of a guess, but unless you install a stress tester from source, the generic binary from your distribution isn’t going to warm up those transistor-heavy AVX2 arithmetic units, since it will only stress the more widely supported SSE4.2. You could probably have triggered the test with multicore linpack benchmarks or similar.

The 10700K seems to be the one targeting people who like to overclock and willing to play the lottery. It is probably a segment to avoid, unless you want to run into similar problems. If Stan is your big use case, it can be better to look at AMD, the Ryzens et al are a better deal.

Woof, glad to hear this was resolved! Random, in your bios do you also have physical ram scrambling turned off? It’s on by default usually. It’s a weird feature that I’ve never really seen fully documented (though this stackoverflow gives a good rundown about it.) It’s confusing but essentially scrambles physical ram to allow higher Hz through the ram sticks. Turning that off could be another source of problems. Talking about the literal flipping of bits is a bit out of my domain knowledge though.

If you like messing with hardware you could compile a stress tester from source and flip around your CAS, RAS, voltage etc till it breaks. Though tbh I’ve done stuff like that before and you’ll probably get more from flipping off linux safety flags (if you trust the software your running you can flip off spectre mitigation) and using some fancy malloc implementation like mimalloc. @wds15 and I have both seen 20% speed gains from using alternative mallocs

@maedoc PopOS looks like it pitches itself as a ML focused OS. Maybe they do something that mixed with the 10th gen causes this. I’m on Ryzen so can’t check

The microcode updates appear to negate this and the impact of the mitigations are apparently minimal, cf

(though Haswell fares better than older architectures) so using a better allocator is probably way better

2 Likes