I am in the process of attempting to run some Stan models on a research computing cluster. I am quite new to cluster computing, so apologies if I’m unable to provide the right information. To test things, I initially ran a R benchmarking script that computes things like a large matrix inverse/determinant, drawing a single sample from a high dimensional multivariate normal, etc. These run about 10x faster on the single cluster core than on my local machine, so no suprises there.
I was able to build the latest cmdstan release and compile/run my model. I deliberately chose a moderately large (~500 data points latent variable Gaussian process) problem which takes ~20 min to run locally.
What is suprising is that when I run cmdstan model on the cluster the sampling appears to be slightly worse on my local machine (per the total time to run chains and per the gradient evaluation timing “adjust your expectations accordingly” message). I am able to request enough memory that the memory utilization is relatively low.
My ultimate goal is to run a model that takes about 6 hours on my current M1 laptop with 8 chains, 500 warmup samples, and 125 post warmup samples. This is about a minimum to get acceptable convergence. I understand that I can request a large number of cores to run even more chains with a smaller number of iterations but to me this seems like I cannot actually run the models any quicker than on my personal computer. Instead, I can only potentially generate more samples. Is this correct?
In general, does this suggest a potential issue with model compilation/execution I can try to troubleshoot? Or is it just an inherent limitation of Stan programs without using reduce_sum or other manual parallelization.
The C++ toolchain on the cluster is:
g++ --version
g++ (GCC) 14.2.0
Copyright (C) 2024 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
make -version
GNU Make 4.2.1
Built for x86_64-redhat-linux-gnu
Copyright (C) 1988-2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
I would argue this actually is a bit surprising – many traditional clusters feature CPUs that can be actually slightly slower than something like an M1 on single core performance, but make up for it by having 64 (or 128, or more) cores available.
Based on the speedup, I would speculate that things like the large matrix inverse may have been using a copy of BLAS on the cluster that really is multi-threaded under the hood (if so, and if your model is linear algebra heavy, you might want to see some of Aki’s advice for compilation).
More or less, yes. Without manual parallelization work, Stan probably wont go much faster than the single-threaded performance of the core each chain ends up scheduled on.
I believe this is the case, however the cores available claim to only have a single thread per. So I’m still not sure why it is occurring. Maybe it’s actually an issue with my local install of R or the vector math libraries.