I will soon need to run Stan inside a Virtual Machine, and I was wondering whether I should expect the same (or similar) performance that I would get when running directly on the host operating system.
My question is rather general, but I’m particularly interested in Virtualbox (Linux guest, Windows 10 host) on Intel.
I know that “in theory”, since Stan is CPU-intensive and RAM-intensive, being inside a VM should not matter that much since the machine instructions are directly executed on the CPU anyway, and syscalls should be (relatively) rare for Stan. But “in practice”, reality does love to be different …
the thread you have quoted is very very interesting, thanks!
It seems that there is a general consensus against running Stan on Windows; many people report severe degradation in performance, mostly blaming either the inefficient Windows compiler(s) or that Stan is developed for Linux first and then ported to Windows. So, for the sake of a general discussion about Virtualized Environments, let’s put Stan-on-Windows out of the equation, and consider Stan-on-Linux only (virtualized or not).
Stan should (as far as I know, correct me if I’m wrong) spend 99.9% of the time on the CPU cores (crunching numbers and accessing the RAM), doing very few syscalls (I/O mostly). Syscalls are the place where performance issues usually arise in virtualized environments.
Even during Model compilation by the C++ compiler, when many files are accessed, I would expect the CPU+RAM to be the bottleneck, while trying to apply all those expensive numerical optimizations (loop unrolling, function inlining, etc etc).
During MCMC sampling - after the initial data loading from the filesystem, I can imagine syscalls rarely done only
to get more memory from the OS (e.g. when appending new MCMC samples to the chain output buffer)
to get the system time for measuring elapsed times
to output debug messages
I’m wondering if the above description is accurate: that is what I would “theoretically” naively expect, but I’m not a specialist in numerical programming, and I’m far from being an expert in Stan as well or virtualization. There are probably other important aspects that I have overlooked, and that arise in practice.
I am just looking into running Stan on VMs as well. As this thread did not continue (unfortunately), did you gain any additional insights in the process, which are worth to share?