STAN on multiple cores occasionally crashing Linux without overwhelming memory

Also, the 10th gen Intel added some new multi core speed boost stuff. If you are doing any over clocking on the CPU or memory (or even if you aren’t), you may want to head into the BIOS and turn the CPU and memory speed down 10% or so to ensure stability.

3 Likes

Didn’t @bbbales2 do something like this with campfire?

Edit: Whups sorry wrong thread!

This solves it! Thank you so much!!

What was wrong

For those who might run into this issue in future, the issue seems to be with my 10th gen Intel Processor. Changing out some of the default settings in the BIOS to undo some of the optimisation seems to do the trick.

What I did

As you suggested, I tried running sudo memtester 1024 5 to test the memory, and got perfect results: no issues. I also ran 2 hours of stress testing using GtkStressTesting to see if it might freeze it, and it cleared it without a problem. So still no easy way to diagnose the issue.

This was the suggestion that solved it. I haven’t overclocked at all, and am running default settings on everything. I headed into the BIOS regardless. My BIOS, for anyone else who might have this problem in future, is American Megatrends Version 2.20.1276. I couldn’t figure out how to directly turn the CPU speed down, but I tried several things. What finally resolved the issue was under Advanced Mode > Ai Tweaker > Ai Overclock Tuner, the default setting was XMP I, which I changed to Auto. This automatically changed several other settings too:

  • DRAM Frequency [DDR4-3200MHz] → [Auto]
  • DRAM CAS# Latency [16] → [Auto]
  • DRAM RAS# to CAS# Delay [20] → [Auto]
  • DRAN RAS# ACT Time [38] → [Auto]
  • DRAM Voltage [1.35000] → [Auto]
  • RC6(Render Standby) [Disabled] → [Enabled]

I could then run the cmdstanpy model without a problem, on 6 cores simultaneously. I’ve done it twice now to make sure, and it worked both times (compared to 0 out of about 15-20 times before). I’ll return here if I start encountering this problem again on more than this specific example. But it sure feels like it’s working now.

What’s still unresolved

  • I’m still a bit mystified that it’s only STAN that could manifest the crashes. I would have imagined that dedicated memory tests or CPU stress tests should have failed too, but they found nothing… I’m very happy to try things out to resolve this for others in future, or to test any future modifications to STAN to evaluate if they protect against this.

  • We also don’t know if it’s for all 10th Gen Intel processors. I would imagine there are a few of them out there.

Lastly

Thank you so much to everyone for all your help! I really appreciate you all spending your time and energy on helping me to resolve this issue! It means a lot :).

5 Likes

Great to hear that solved the problem, perhaps unsurprising Intel is playing games these days given the pressure from AMD.

A bit of a guess, but unless you install a stress tester from source, the generic binary from your distribution isn’t going to warm up those transistor-heavy AVX2 arithmetic units, since it will only stress the more widely supported SSE4.2. You could probably have triggered the test with multicore linpack benchmarks or similar.

The 10700K seems to be the one targeting people who like to overclock and willing to play the lottery. It is probably a segment to avoid, unless you want to run into similar problems. If Stan is your big use case, it can be better to look at AMD, the Ryzens et al are a better deal.

Woof, glad to hear this was resolved! Random, in your bios do you also have physical ram scrambling turned off? It’s on by default usually. It’s a weird feature that I’ve never really seen fully documented (though this stackoverflow gives a good rundown about it.) It’s confusing but essentially scrambles physical ram to allow higher Hz through the ram sticks. Turning that off could be another source of problems. Talking about the literal flipping of bits is a bit out of my domain knowledge though.

If you like messing with hardware you could compile a stress tester from source and flip around your CAS, RAS, voltage etc till it breaks. Though tbh I’ve done stuff like that before and you’ll probably get more from flipping off linux safety flags (if you trust the software your running you can flip off spectre mitigation) and using some fancy malloc implementation like mimalloc. @wds15 and I have both seen 20% speed gains from using alternative mallocs

@maedoc PopOS looks like it pitches itself as a ML focused OS. Maybe they do something that mixed with the 10th gen causes this. I’m on Ryzen so can’t check

The microcode updates appear to negate this and the impact of the mitigations are apparently minimal, cf

(though Haswell fares better than older architectures) so using a better allocator is probably way better

2 Likes