Cmdstan: Disk Speed Optimisation?

I notice a substantial amount of time spent writing to disk during model fitting, especially using standalone generate_quantities. I presume this is being dumped to disk at each sampling step?

Does anybody see big speedup through increasing disk write speed? I’m wondering if this may be a common bottleneck for those who are working in the cloud/clusters with NFS mounted drives.

1 Like

Out of curiosity, how are you measuring the proportion of time spent writing to disk?

If you’re writing over NFS from a machine with some local storage, could you do a small experiment where you repoint stan’s output directory to the local storage and measure what difference it makes?

If you are running on linux, you could consider creating a tmpfs filesystem backed by ram and see what happens if you tell Stan to write to that (provided you don’t write so much data you run out of ram for doing anything else).

I’m not familiar with the internals of generate_quantities, but have seen similar issues in other systems - either with NFS or with databases, where migrating the NFS or database server to a new hosting environment with increased latency to where the client is hosted can cause a client app that wants to do frequent tiny-data network roundtrips to grind to a halt. For client applications with code that can be modified, sometimes you can get very large speedups if code can be rewritten to batch the IO and do more work per network roundtrip (c.f. classic “n+1 select antipattern” in enterprisey software using ORMs to query a SQL database over the network)

Also - how complicated a model are you running when noticing this issue?

I recall last year there was a thread looking at a performance issue with an incredibly trivial model – if the model is a toy “sample from a single 1d normal distribution” model then it is possible that the bottleneck ends up being the CPU-intensive work of formatting floating point numbers into human readable character representation before they are written to disk.

For more complex models it is less likely that the formatting the output will be bottleneck in performance!

Thanks for the thoughts on this. To be clear I’m not complaining about the time taken - just interested.

Experiment - I’d already considered this (I wasn’t trying to dump work on others), but it’s not obvious how to do that well - /tmp on workers is a tmpfs but isn’t shared, and other writeable paths are NFS mounts. Hence my question to sense-check before putting in the work.

Proportion of time - I was far too imprecise - I meant that I noticed the file grows throughout model fitting rather than being dumped from RAM at the end, so wondered how ~8000 writes to 4 x 30Mb-3Gb files on NFS, & all the necessary NFS synchs might affect model fitting time, especially with the worker’s CPU maxxed. I’d guess NFS speed would decline further with many workers all sharing the same mount, though differences in model fit speed with 1 worker Vs 250 isn’t something I can say I’ve noticed/considered before.

If this were a problem it might be solved by output_dir resulting in a single copy from a tmpfs /tmp at the end of the model, though no guarantee this combined time would be faster than repeatedly writing to an NFS output_dir. That could also be a ‘way in’ to sense-check this. I’ll try it later on.


Last time i perf’d a stan model i did not see a big cost for read / writes to disk

1 Like

There will definitely be a bigger proportion of I/O for standalone generated quantities because it doesn’t do any sampling or autodiff—everything is double-based and based on previous draws.

I’m not sure if there’s anything to be gained from our being more careful about buffering. It might not matter so much on my M2 notebook, but it’s probably going to matter for a cluster’s slow distributed drives.

Also, we’re redesigning a new command-line interface (not strictly replacing CmdStan, but as an alternative foundation for CmdStanPy and CmdStanR) that will support binary I/O, which doesn’t get bogged down with the conversion between ASCII and floating point and is much more compact for floating point values at high precision.

1 Like