One thing that might be worth trying is try running the model directly in cmdstan (in the command line) or using cmdstanpy like @ahartikainen mentioned. That would help narrow down where the problem lies (R or cmd(stan).
Thanks so much for the help. I’m pretty rusty on my python, but managed to get it running (and did it in Jupyter before I saw your edit with the code). I got everything working with the stan file and the json that I prepared for Rok yesterday. It went on for about 3/4 minutes before my system froze up again, without any changes to the progress bars. Attached the screenshot of how the notebook looked.
Lemme see if I can try it in cmdstan directly too…
Same outcome with cmdstan.
On the command line, I’m compiling, and then sampling using the following:
for i in {1..6}
do
./cmdstanr_test sample data file=cmdstanr_test_data- \
output file=output_${i}.csv &
done
I get the same bunch of warnings (like the following - but maybe 50 of them), as I get in cmdstanr and cmdstanpy (still trying to figure out what part of the model is ill-specified, but that’s another story for another thread)
Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: normal_lpdf: Location parameter[1] is -nan, but must be finite! (in '/home/granville/Repositories/Multilevel_TCM/R/Sim_data_model/Rok_Testing/cmdstanr_test.stan', line 197, column 4 to column 41)
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: normal_lpdf: Location parameter[1] is -nan, but must be finite! (in '/home/granville/Repositories/Multilevel_TCM/R/Sim_data_model/Rok_Testing/cmdstanr_test.stan', line 197, column 4 to column 41)
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: normal_lpdf: Location parameter[1] is -nan, but must be finite! (in '/home/granville/Repositories/Multilevel_TCM/R/Sim_data_model/Rok_Testing/cmdstanr_test.stan', line 197, column 4 to column 41)
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
and then it starts warmup, and then the system hangs. This time after just less than 7 minutes.
@rok_cesnovar I bet this is something in Stan that is misbehaving. Do we have some example repo where we could inject couts to see where it gets stuck.
edit. wait remove macOS, haha
great, did you use the same cmdstan installation or did you create a new one?
Do you get the same behaviour for models other than your own? A quick test would be to run the Bernoulli example provided with cmdstan
I wonder if we are doing something dumb in Stan when the user has a ton of non/infinite values? @rok_cesnovar @mathesong it may be helpful to run this with valgrind and heaptrack like
valgrind --leak-check=full --show-leak-kinds=all ./cmdstanr_test sample data file=cmdstanr_test_data- output file=output_1.csv
heaptrack ./cmdstanr_test sample data file=cmdstanr_test_data- output file=output_1.csv
to see if we have leaks or bad memory reads/writes somewhere. I’m really not sure what would cause a full freeze without also taking up all the users memory. I think we’d have to be writing memory to somewhere that leads to an unrecoverable state for the OS
The thing that is the most weird here is that we are not able to reproduce. I am not familiar with Pop!OS.
Going to try to install it as a virtual machine if that would help.
Can I just re-iterate how incredibly thankful I am for all this help!
- Same installation
- I tried the cmdstan bernoulli example, and it runs perfectly without crashing out on 6 parallel chains. However, it samples so quickly that I wonder if there isn’t time for anything to go wrong. I usually crash out after several minutes. In this case, the whole sampling is completed in less than 0.1 seconds for 10000 warmup and 10000 samples. But this is really promising at least that something works in parallel!
Pop!OS is just basically Ubuntu at its core, with a few tweaks here and there.
Ok, great! I’m running the valgrind now, and I’ll run the heaptrack afterwards and report back.
Thanks again everyone!!
Ok, then the error is probably in Stan part.
Could you try to install a new cmdstan with CmdStanPy?
python -m cmdstanpy.install_cmdstan
And then run the model again. (You can remove the cmdstan from ~/.cmdstanpy
folder after the test)
I installed a new cmdstanpy version, set cmdstan_path to the new directory, and then tried it again. Same result: crashes out after about 4 minutes. :(
Still waiting on the valgrind (it seems to be going really slowly? I’ll leave it on overnight). But I got the heaptrack. I did the suggested analysis (let me know if it would be better to share the whole zst file). I’ve attached it here: heaptrack_out.txt (403.5 KB)
I’ll update tomorrow with the valgrind if it’s finished.
And the valgrind is now finished after 10 hours. Attached here: valgrid_out.txt (49.0 KB)
I’m not any C++ expert, but the first warnings are normal (?) and given that your ram hasn’t filled, are probably not the reason for the behaviour.
The second error, is there a possibility that {Eigen, Matrix, resize} -block gets collected at some point but our program still tries to use it? Would this cause segfault?
Could you share the .hpp
file?
Sure - it’s here: cmdstanr_test.hpp (134.8 KB)
You don’t mention the hardware or if this is a virtual machine? I would suggest leaving dmesg -w
running in a terminal when you crash the machine since there are only so many ways you can crash Linux with a process like Stan and the kernel will usually complain about something.
Thanks for the suggestion!
Here’s my inxi -Fxz
System:
Kernel: 5.4.0-7642-generic x86_64 bits: 64 compiler: gcc v: 9.3.0
Desktop: Gnome 3.36.4 Distro: Pop!_OS 20.04 LTS
base: Ubuntu 20.04 LTS Focal
Machine:
Type: Desktop System: ASUS product: N/A v: N/A serial: <filter>
Mobo: ASUSTeK model: ROG STRIX Z490-F GAMING v: Rev 1.xx serial: <filter>
UEFI: American Megatrends v: 0607 date: 05/29/2020
CPU:
Topology: 8-Core model: Intel Core i7-10700K bits: 64 type: MT MCP
arch: N/A L2 cache: 16.0 MiB
flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
bogomips: 121596
Speed: 800 MHz min/max: 800/5100 MHz Core speeds (MHz): 1: 800 2: 800
3: 800 4: 800 5: 800 6: 800 7: 800 8: 801 9: 800 10: 800 11: 800 12: 800
13: 800 14: 800 15: 800 16: 800
Graphics:
Device-1: Intel vendor: ASUSTeK driver: i915 v: kernel bus ID: 00:02.0
Device-2: NVIDIA TU104 [GeForce RTX 2070 SUPER] driver: nvidia v: 440.100
bus ID: 01:00.0
Display: x11 server: X.Org 1.20.8 driver: modesetting,nvidia
unloaded: fbdev,nouveau,vesa resolution: 2560x1440~60Hz
OpenGL: renderer: GeForce RTX 2070 SUPER/PCIe/SSE2 v: 4.6.0 NVIDIA 440.100
direct render: Yes
Audio:
Device-1: Intel Comet Lake PCH cAVS vendor: ASUSTeK driver: snd_hda_intel
v: kernel bus ID: 00:1f.3
Device-2: NVIDIA TU104 HD Audio driver: snd_hda_intel v: kernel
bus ID: 01:00.1
Sound Server: ALSA v: k5.4.0-7642-generic
Network:
Device-1: Intel vendor: ASUSTeK driver: igc v: 0.0.1-k port: 3000
bus ID: 04:00.0
IF: enp4s0 state: down mac: <filter>
Device-2: Broadcom and subsidiaries BCM4352 802.11ac Wireless Network
Adapter
vendor: ASUSTeK driver: wl v: kernel port: 3000 bus ID: 05:00.0
IF: wlp5s0 state: up mac: <filter>
Drives:
Local Storage: total: 931.51 GiB used: 265.21 GiB (28.5%)
ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 970 EVO Plus 1TB
size: 931.51 GiB
Partition:
ID-1: / size: 907.53 GiB used: 262.59 GiB (28.9%) fs: ext4
dev: /dev/nvme0n1p3
ID-2: swap-1 size: 4.00 GiB used: 0 KiB (0.0%) fs: swap dev: /dev/dm-0
Sensors:
System Temperatures: cpu: 27.8 C mobo: N/A gpu: nvidia temp: 39 C
Fan Speeds (RPM): N/A gpu: nvidia fan: 0%
Info:
Processes: 395 Uptime: 6m Memory: 62.65 GiB used: 3.69 GiB (5.9%)
Init: systemd runlevel: 5 Compilers: gcc: 9.3.0 clang: 10.0.0-4ubuntu1
Shell: bash v: 5.0.17 inxi: 3.0.38
I’ll run the dmesg -w
now, and will update in the morning!
EDIT: also, forgot to say: not running it within a VM.
Thanks, as I thought, all of that is about the newest stuff that exists (not my boring GCC 4.8.5 on CentOS 7 on Westmere Xeon), so you might be running into a bug under the Stan layer of the stack. Will be interesting to see what the kernel says :popcorn:
Couldn’t resist following this until it crashed out. And we got something! “general protection fault”. I don’t know what it means, but I hope this means something to someone!
Of potential importance: I ran it at first, before realising I’d forgotten to set the seed. So I stopped it, pressed a few enters to space the new output from the old, and started it again. But it seems I got a similar output both times, and it hadn’t yet crashed my computer the first time (though it appeared just before I killed the process).
This can be faulty memory sticks, you might run memtest to check. You can also run various stress test utilities which trigger the crash (to confirm it’s not Stan). Lastly you might try to use more generic compiler flags to avoid eg avx instructions. These are guesses since this is a generic error message.
It might, but a process will just exit on SIGSEGV if a handler isn’t set (hopefully not the case with CmdStan or R), not crash the machine. A segmentation fault should also be deterministic in the number of processes running.