Buying a new computer - best hardware for fast Stan performance

Hi folks,
my department is going to buy a new computer that will be mainly used for Stan (running in R). What is your recommondation for the hardware (considering the trade-off between money and performance/ computation time)?

Thanks in advance!
Alexander

2 Likes

I’ve been wondering the same as I’m soon replacing my home PC. Are Alienware machines well-suited to running STAN (also in R)? I’m eyeing a new coffee-lake model for the hex-cores with room to expand to dual GPU’s sometime later. Will such power be useful in future STAN ?

My advice is not to worry about raw CPU power too much because runtime is driven by 1) algorithmic issues such as proper scaling of parameters, numerical problems caused by poor fit, whatever posterior geometry drives stepsize to be too small, etc…; 2) model optimization to trim the auto-diff tree size and share calculations. Everything else is at least an order of magnitude down the line.

One thing that does slow you down is your ability to run multiple models with multiple chains at once so I’d get a dual-CPU motherboard with slightly dated hex-core processors before I tried to get the latest fastest CPU. SSD disks are nice for getting things to compile faster.

1 Like

Nice!

Stan runs MCMC chains in parallel on each of your processor’s cores, so if it were me I’d try to get a processor with as many cores as possible. I think the new Intel and AMD ones that came out this past Summer have up to 16 cores. More RAM is always cheap and gives you good bang for your buck. I think @Bob_Carpenter also mentioned in another thread that GPU support is coming soon in Stan, so you probably want to get an nvidia GPU for that, maybe even in addition to the GPU that will drive your monitors.

I’m tagging @bbbales2 because he’d have good advice on this. He taught me everything I know about hardware.

1 Like

Thanks for the advice guys. I should explain where I’m coming from on this. Currently I mostly use JAGS on my laptop or old Xeon hex core PC. But I’ve taken note of the developments within STAN of MPI based within thread parallelisation and GPU, so since I’m thinking to upgrade the PC anyhow, I was wondering whether something like an Alienware machine would be good to take advantage of these new algorithms. Unfortunately 16 cores etc out of my reach, but a new hex core I could later stuff a GPU or two into later on is doable - and the Alienwares seem easy to upgrade - but I have no experience of them. I do want the single core speed too for other reasons so buying older chips not attractive @sakrejda.
Apologies @Alexander, I do not mean to hijack your thread!

A GPU should be good for large Cholesky decompositions. If you can do a suitable break down of your problem into good chunks, then MPI will be amazing. I think MPI is more general in terms on when it speeds up your model while GPU is really specifc (but IF a GPU can speed up your case, then you will probably want it). For MPI: The more CPUs the better and what hasn’t been mentioned yet is fast RAM.

… but MPI will be a bit of a pain for the user to program up his model…

2 Likes

I looked at this a while ago (disclaimer, I have not looked at it since v2.16), and here is what I found using a fairly complicated model on large data:

For my model with some large matrix operations, the fastest speedup was from getting on a system with lots of L3 cache on the CPU(s). I ran the same model on a bunch of different hardware and the slowest speeds were from the system with the highest clock speed - a consumer-grade workstation (i7 @ 3.8 GHz, 8MB L3 cache - 4 days). Middling times were on some compute servers (Google Cloud, campus clusters) with decent clock speeds, but I was sharing with other jobs (~3.2 GHz - 2 days). The fastest times were on a system I cobbled together with used server parts off Ebay. That system has the lowest clock (and oldest CPUs) but highest L3 cache (dual Xeon @ 2.6 GHz, 25 MB L3 per chip - 20 14 hrs). edit - just checked and the run-times were lower than I remembered

Hardly authoritative, but from my experience running other models since then, Stan benefits greatly from large cache on the CPU and high memory bandwidth. My low clock, high memory bandwidth “frankenstein” workstation is my go-to for running Stan. I would guess that the newer AMD Threadripper CPUs in a workstation could be a great value, as they have gobs of cache. I will get a chance to play with one in the coming months and will report back.

5 Likes

@wds15 - These days I’m doing alot of multi-level/hierarchical models of different types for modelling longitudinal multiple outcomes dataset. Will MPI be a help for those?

@mespe - really interesting thanks. I was eyeing up a model with this chip: https://ark.intel.com/products/126686/Intel-Core-i7-8700-Processor-12M-Cache-up-to-4_60-GHz

Yes, for hierarchical models I do expect speedups do be quite nice with MPI. MPI has the cost of communication which needs to be balanced by the computational cost per unit. Hierarchical models are usually good in this regard as you have per unit few parameters and lots of data such that you need to communicate not that much, but you need to calculate a lot per parameter set - which is what you need for MPI to work well.

1 Like

For reference, I built my dual Xeon system for $500 USD. That said, I have been playing around with computer hardware for a long while and have lots of experience building systems, so your mileage might vary.

The other consideration is what kind of models you are running - it really only makes sense (in my mind) to fret about hardware if you are running models which are hitting some kind of realistic limit. For most of my day-to-day work, a laptop is more than enough. Even with that larger model, I developed the model and tested it on my laptop with a subset of the data, and only needed the computation for the final stages with the full data set.

That line for me was when after every optimization I could think of the model was still taking long enough that I could not run it over night and look at the results the next morning.

1 Like

That’s going to be true for most mathematical operations in decently optimized software. Getting data from memory to the CPU registers is really really expensive.

I don’t know how much the interfaces parallelize compilation, but you can do a lot in parallel from CmdStan on teh build side.

For parallelizing chains, that’s useful up to four or so—enough to diagnose multimodalities and non-convergence, but beyond that, you’re mainly getting proportionately higher effective sample sizes for the same runs. You still have to pay the price of warmup.

When we get MPI done, the way to speed things up will be parallelizing within chains. That’ll be able to use a lot of cores. So 16, 32, or a cluster full of cores would be good in that situation if you really need to scale. Personally, I think I’m going to get one of the new iMacs because I won’t have to learn a new OS and I want a nice 5K screen.

If it’s even a question, good solid-state disks are a necessity, since compiling Stan brings in an awful lot of little files.

4 Likes

Maybe just being contrarian but I have an old server with crap HDD’s and I don’t really notice the slower compilation speed… but then again I usually do way more model validation than compared to model writing so I just go do something else when compiling a model.

@mespe -once upon a time I’d have been all for such a build but my knowledge is way out of date for such shenanigans. The Xeon chip in my current desktop is not suitable for a dual chip board and has slow RAM speed in any case, so no point in harvesting that (my 4 year old MBP is much faster running JAGS models than the desktop, probably due to faster RAM). My motivation here is that I’m building alot of these models on different datasets, and waiting hours/overnight for results really screws with my workflow. So even marginal gains per model would add up for me. The potential gains offered by MPI would be amazing. But I need to first transition from Jags to stan anyhow. I’m simply thinking ahead since I was thinking to upgrade the PC anyhow.

Thanks all for your responses. Most helpful and informative!

1 Like

Hey @mespe (or anyone else who has thoughts), I was curious if you’d since had any new insights on desirable specs for Stan. My laptop just died and needs to be replaced, so I unfortunately have to figure out what sort of model to buy.
I often run somewhat complex models (e.g. >25k observations with thousands of parameters), though ironically I’ve only ever hit the limits of my RAM when using the survey package in R, which I had understood to be less computationally intensive than Stan.
Thank you!

Buy a CPU with as much cache as you can get. The next thing to maximize is the CPU count… though if you model amends to map_rect parallelization then you may weight more CPUs a bit more. The key is to get the AD tape of the Stan program fit into your innermost as possible caches.

1 Like

I just picked up a Threadripper 1950X (4 GHz 16 core/32 thread 32MB of L3 cache). Is there a Stan benchmark test somewhere I could run and post results from? Could be interesting to make a generic benchmark people can run and post their specs with.

1 Like

This one, for example:

but that one does not include map_rect stuff. I have written a Stan program which is suitable for map_rect testing (and there are a few scripts posted on the forum).

@peopletrees - I also rarely have issues with RAM and Stan (cmdstan or Rstan). But then most of my systems have 16GB minimum.

One thing to be aware of RE: CPU count - running on hyperthreaded cores can be much slower than physical CPU cores. @Bob_Carpenter has mentioned this a few times in other threads, but if you have a CPU with 2 physical cores + 2 hyperthreaded, you are often better off running only 2 chains in parallel rather than 4. I have experienced 4 chains on 2 physical cores (i.e., sequential 2 chains per core) finishing sooner than 4 chains in parallel so often that I now just default to only using physical cores in parallel with Stan.

FYI - parallel::detectCores() in R includes hyperthreaded cores, so the rstan advice of “For execution on a local, multicore CPU with excess RAM we recommend calling options(mc.cores = parallel::detectCores()).” should probably have an asterisk or be amended to “parallel::detectCores(logical = FALSE)”. I would guess this advice currently actually slows down the analysis on many systems. @bgoodri - thoughts?

1 Like

Whether utilizing hyperthreads makes things better or worse can depend on the posterior. I haven’t looked it in great detail or after Spectre.

I would be really curious to know what features of a model/posterior work well with hyperthreading. I personally have found it rarely helps in my own work, but then I tend to use a small set of model types.