Multicore Speedups are different between models


#1

Hi there,

I’m running a bunch of Stan models with different number of CPU cores. Ideally, for the same number of chains and iterations, if there is no coordination between chains, the speedup (defined as runtime of 1 core / runtime of multiple cores) should equal to the number of cores. However I find some Stan models cannot get this ideal speedups.

For example, let’s compare the runtime of 1 core and 4 cores. Among the models in Stancon (https://github.com/stan-dev/stancon_talks), the first one (Twelve Cities: Does lowering speed limits save pedestrian lives?) has a speedup of 4. The third one (Advertising Attribution Modeling in the Movie Industry, with model ‘stancon_data.rda’ ) is only 1.8. In both models, we can only see calling stan() function. I have tested the two models in 3 CPUs, across 3 very different microarchitecture families. The speedups are consistent.

My question is, does the multichain NUTS algorithm requires some coordination between chains (I don’t find it in the manual or the paper), like exchanging samples once in a while? What may cause the speedup difference between models?

Thanks in advance!


#2

Wild, but slightly informed speculation follows:

At a guess, the difference would be in the the adaptation and step length selection. Stan starts the chain at diverse starting points, so for difficult models it’s possible that the adaptation would take longer depending on the starting point. Because the chains find their own diagonal mass matrix during the warmup phase, they also won’t really be running the same Markov chain (the scaling will be different), so it’s possible that one chain will be scaled worse than the others and will therefore take longer for the modified NUTS criterion to tell it to stop exploring along the trajectory.


#3

No


#4

There is no coordination between chains. If you are using rstan (what I am most familiar with), you can run each chain independently, and then combine the results at the end - if you pay for compute resources this is going to be the most efficient way to do it.

I would check the total running time of the chains to see if it is similar between the parallel ind serial case. Since some chains will finish earlier, but the function only completes when the longest running chain finishes.

Another, less likely possibility given what you said about microarchitectures, is that the CPUs are not truly independent. Some CPU architectures have virtual cores that can run some instructions in parallel but has bottlenecks for others.


#5

Yes I find some chains finish earlier than others. To compare the parallel and serial runs (of advertisement attribution), the total runtime of the chains are different. In general, parallel takes longer than serial in all chains.

Let me paste the outputs below. It does looks like it’s actually using virtual cores, which shares the one physical core. If it is really the case, I don’t understand why the sharing of physical cores is consistent across 3 CPUs, and different from models (why some models like twelve cites do not share physical cores).

serial outputs:

SAMPLING FOR MODEL ‘ad_attribution’ NOW (CHAIN 1).

Chain 1, Iteration: 1 / 750 [ 0%] (Warmup)
Chain 1, Iteration: 75 / 750 [ 10%] (Warmup)
Chain 1, Iteration: 150 / 750 [ 20%] (Warmup)
Chain 1, Iteration: 225 / 750 [ 30%] (Warmup)
Chain 1, Iteration: 300 / 750 [ 40%] (Warmup)
Chain 1, Iteration: 375 / 750 [ 50%] (Warmup)
Chain 1, Iteration: 376 / 750 [ 50%] (Sampling)
Chain 1, Iteration: 450 / 750 [ 60%] (Sampling)
Chain 1, Iteration: 525 / 750 [ 70%] (Sampling)
Chain 1, Iteration: 600 / 750 [ 80%] (Sampling)
Chain 1, Iteration: 675 / 750 [ 90%] (Sampling)
Chain 1, Iteration: 750 / 750 [100%] (Sampling)
Elapsed Time: 338.999 seconds (Warm-up)
301.73 seconds (Sampling)
640.729 seconds (Total)

SAMPLING FOR MODEL ‘ad_attribution’ NOW (CHAIN 2).

Chain 2, Iteration: 1 / 750 [ 0%] (Warmup)
Chain 2, Iteration: 75 / 750 [ 10%] (Warmup)
Chain 2, Iteration: 150 / 750 [ 20%] (Warmup)
Chain 2, Iteration: 225 / 750 [ 30%] (Warmup)
Chain 2, Iteration: 300 / 750 [ 40%] (Warmup)
Chain 2, Iteration: 375 / 750 [ 50%] (Warmup)
Chain 2, Iteration: 376 / 750 [ 50%] (Sampling)
Chain 2, Iteration: 450 / 750 [ 60%] (Sampling)
Chain 2, Iteration: 525 / 750 [ 70%] (Sampling)
Chain 2, Iteration: 600 / 750 [ 80%] (Sampling)
Chain 2, Iteration: 675 / 750 [ 90%] (Sampling)
Chain 2, Iteration: 750 / 750 [100%] (Sampling)
Elapsed Time: 308.289 seconds (Warm-up)
185.655 seconds (Sampling)
493.944 seconds (Total)

SAMPLING FOR MODEL ‘ad_attribution’ NOW (CHAIN 3).

Chain 3, Iteration: 1 / 750 [ 0%] (Warmup)
Chain 3, Iteration: 75 / 750 [ 10%] (Warmup)
Chain 3, Iteration: 150 / 750 [ 20%] (Warmup)
Chain 3, Iteration: 225 / 750 [ 30%] (Warmup)
Chain 3, Iteration: 300 / 750 [ 40%] (Warmup)
Chain 3, Iteration: 375 / 750 [ 50%] (Warmup)
Chain 3, Iteration: 376 / 750 [ 50%] (Sampling)
Chain 3, Iteration: 450 / 750 [ 60%] (Sampling)
Chain 3, Iteration: 525 / 750 [ 70%] (Sampling)
Chain 3, Iteration: 600 / 750 [ 80%] (Sampling)
Chain 3, Iteration: 675 / 750 [ 90%] (Sampling)
Chain 3, Iteration: 750 / 750 [100%] (Sampling)
Elapsed Time: 344.956 seconds (Warm-up)
181.927 seconds (Sampling)
526.883 seconds (Total)

SAMPLING FOR MODEL ‘ad_attribution’ NOW (CHAIN 4).

Chain 4, Iteration: 1 / 750 [ 0%] (Warmup)
Chain 4, Iteration: 75 / 750 [ 10%] (Warmup)
Chain 4, Iteration: 150 / 750 [ 20%] (Warmup)
Chain 4, Iteration: 225 / 750 [ 30%] (Warmup)
Chain 4, Iteration: 300 / 750 [ 40%] (Warmup)
Chain 4, Iteration: 375 / 750 [ 50%] (Warmup)
Chain 4, Iteration: 376 / 750 [ 50%] (Sampling)
Chain 4, Iteration: 450 / 750 [ 60%] (Sampling)
Chain 4, Iteration: 525 / 750 [ 70%] (Sampling)
Chain 4, Iteration: 600 / 750 [ 80%] (Sampling)
Chain 4, Iteration: 675 / 750 [ 90%] (Sampling)
Chain 4, Iteration: 750 / 750 [100%] (Sampling)
Elapsed Time: 333.21 seconds (Warm-up)
182.352 seconds (Sampling)
515.562 seconds (Total)

Inference for Stan model: ad_attribution.
4 chains, each with iter=750; warmup=375; thin=1;
post-warmup draws per chain=375, total post-warmup draws=1500.

     mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff

lp__ -1721.08 0.51 10.92 -1743.65 -1728.15 -1720.99 -1714.18 -1700.02 462
Rhat
lp__ 1.01

Samples were drawn using NUTS(diag_e) at Mon Jul 10 05:17:32 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

parallel outputs:

SAMPLING FOR MODEL ‘ad_attribution’ NOW (CHAIN 1).

SAMPLING FOR MODEL ‘ad_attribution’ NOW (CHAIN 2).

SAMPLING FOR MODEL ‘ad_attribution’ NOW (CHAIN 3).

Chain 1, Iteration: 1 / 750 [ 0%] (Warmup)
SAMPLING FOR MODEL ‘ad_attribution’ NOW (CHAIN 4).

Chain 2, Iteration: 1 / 750 [ 0%] (Warmup)
Chain 3, Iteration: 1 / 750 [ 0%] (Warmup)
Chain 4, Iteration: 1 / 750 [ 0%] (Warmup)
Chain 3, Iteration: 75 / 750 [ 10%] (Warmup)
Chain 4, Iteration: 75 / 750 [ 10%] (Warmup)
Chain 1, Iteration: 75 / 750 [ 10%] (Warmup)
Chain 2, Iteration: 75 / 750 [ 10%] (Warmup)
Chain 3, Iteration: 150 / 750 [ 20%] (Warmup)
Chain 4, Iteration: 150 / 750 [ 20%] (Warmup)
Chain 2, Iteration: 150 / 750 [ 20%] (Warmup)
Chain 1, Iteration: 150 / 750 [ 20%] (Warmup)
Chain 3, Iteration: 225 / 750 [ 30%] (Warmup)
Chain 4, Iteration: 225 / 750 [ 30%] (Warmup)
Chain 4, Iteration: 300 / 750 [ 40%] (Warmup)
Chain 3, Iteration: 300 / 750 [ 40%] (Warmup)
Chain 2, Iteration: 225 / 750 [ 30%] (Warmup)
Chain 1, Iteration: 225 / 750 [ 30%] (Warmup)
Chain 4, Iteration: 375 / 750 [ 50%] (Warmup)
Chain 4, Iteration: 376 / 750 [ 50%] (Sampling)
Chain 3, Iteration: 375 / 750 [ 50%] (Warmup)
Chain 3, Iteration: 376 / 750 [ 50%] (Sampling)
Chain 2, Iteration: 300 / 750 [ 40%] (Warmup)
Chain 1, Iteration: 300 / 750 [ 40%] (Warmup)
Chain 4, Iteration: 450 / 750 [ 60%] (Sampling)
Chain 2, Iteration: 375 / 750 [ 50%] (Warmup)
Chain 2, Iteration: 376 / 750 [ 50%] (Sampling)
Chain 1, Iteration: 375 / 750 [ 50%] (Warmup)
Chain 1, Iteration: 376 / 750 [ 50%] (Sampling)
Chain 4, Iteration: 525 / 750 [ 70%] (Sampling)
Chain 3, Iteration: 450 / 750 [ 60%] (Sampling)
Chain 2, Iteration: 450 / 750 [ 60%] (Sampling)
Chain 1, Iteration: 450 / 750 [ 60%] (Sampling)
Chain 4, Iteration: 600 / 750 [ 80%] (Sampling)
Chain 2, Iteration: 525 / 750 [ 70%] (Sampling)
Chain 3, Iteration: 525 / 750 [ 70%] (Sampling)
Chain 1, Iteration: 525 / 750 [ 70%] (Sampling)
Chain 4, Iteration: 675 / 750 [ 90%] (Sampling)
Chain 2, Iteration: 600 / 750 [ 80%] (Sampling)
Chain 1, Iteration: 600 / 750 [ 80%] (Sampling)
Chain 4, Iteration: 750 / 750 [100%] (Sampling)
Elapsed Time: 595.036 seconds (Warm-up)
387.19 seconds (Sampling)
982.226 seconds (Total)

Chain 3, Iteration: 600 / 750 [ 80%] (Sampling)
Chain 2, Iteration: 675 / 750 [ 90%] (Sampling)
Chain 1, Iteration: 675 / 750 [ 90%] (Sampling)
Chain 2, Iteration: 750 / 750 [100%] (Sampling)
Elapsed Time: 727.999 seconds (Warm-up)
328.42 seconds (Sampling)
1056.42 seconds (Total)

Chain 1, Iteration: 750 / 750 [100%] (Sampling)
Elapsed Time: 738.686 seconds (Warm-up)
323.963 seconds (Sampling)
1062.65 seconds (Total)

Chain 3, Iteration: 675 / 750 [ 90%] (Sampling)
Chain 3, Iteration: 750 / 750 [100%] (Sampling)
Elapsed Time: 614.926 seconds (Warm-up)
530.843 seconds (Sampling)
1145.77 seconds (Total)

Inference for Stan model: ad_attribution.
4 chains, each with iter=750; warmup=375; thin=1;
post-warmup draws per chain=375, total post-warmup draws=1500.

     mean se_mean    sd     2.5%     25%      50%      75%    97.5% n_eff

lp__ -1722.65 0.53 11.58 -1745.98 -1730.4 -1722.15 -1714.47 -1701.15 485
Rhat
lp__ 1

Samples were drawn using NUTS(diag_e) at Fri Jul 7 13:58:40 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).


#6

Compare to advertisement attribution, where parallel chains takes longer per chain than serial chains, parallel and serial of twelve cities take about the same amount of time. That explains why twelve cities has better speedup.

I’m wondering how stan invoke “cores” (via fork or something?), whether it is the OS who decides the scheduling of the processes or threads.


#7

RStan does sockets on Windows and forks on everything else. The wall time is random, although you won’t get anything close to a linear speedup unless each chain is on a different physical core.


#8

This can happen, but if the sampler is well-behave then the differences from chain to chain should be small.

You can do this only if each of the individual chains are well-behaved.

When you run MCMC you have to be very careful that you’re getting an accurate answer, which is hard because there’s no way to prove that a given sampler will be accurate for your specific model. Instead all we have are conditions that we know should not happen for well-behaved chains (i.e. we have necessary conditions but not sufficient ones).

One of the common ways that these pathologies manifest is chains behaving differently depending on where they are in parameter space. This is why we run many chains and compare them with R-hat – if behavior of even one chain deviates from the behavior of the others then we should doubt the validity of all of the chains.

So if you run multiple chains and you see drastically different speeds then it’s likely your sampler can’t handle your model. To verify this check out the R-hats as well as all of the other diagnostics (especially divergences). If the diagnostics look fine and the speed differences aren’t huge (between 1 and 2) then the variation is likely due to small differences in adaptation or core performance, in which case the variations can be ignored.


#9

Whether to put one process on one physical core is decided by the OS. I do use the same OS across 3 CPU platforms. If it is the case, sounds like switching a OS/ scheduler should give us different speedup number.


#10

Unless the computer has something else intensive going on, at least when the computer is not running on battery, the OS should be smart enough to not put four chains on one physical core when there are other physical cores idle. I don’t think it is worth worrying about.


#11

I’m working on computer system, running Rstan specifically. I try to understand the performance of the models when I scale up the number of cores. When I observe the consistently low speedup of models like advertisement attribution across 3 CPUs, I thought it’s some intrinsic property of the model. I’ll do some experiments to prove it’s because of the OS though.

Thanks a lot!


#12

I thought that in Rstan, running multiple independent chains and then combining them with sflist2stanfit function do exactly the same thing as setting the chains argument to sampling? Is there a difference when the chains are poorly behaved?


#13

Would you point me the part of code where Stan forks new processes?

Thanks!


#14

Also my reply starting with “I’m working on computer system” was meant to reply to you. I forgot to click the reply button…


#15

#16

Operationally they do the same thing but in either case, you should merge samples from separate chains only when the samplers have not encountered any pathologies. Similar reasons why you can’t just blindly trust MCMC estimators.


#17

Updates: the speedups are different on 3 CPU platforms. There was a plotting bug. It’s probably because of the system schedules the processes differently. Now it makes sense.

Thanks a lot everyone!


#18

There can also be problems in inits, with some inits out in the tail making it very hard to find the typical set through our adaptation scheme, even if sampling would be OK if you could find it.


#19

It may just be variation from run to run. Also, if you have only two physical cores, running three chains is going to cause you to slow down relative to running two chains in parallel (may cores report double their number of physical cores because of Intel’s “hyperthreading” marketing that made it down to chip responses to queries about number of cores). A third confounding factor is memory bandwidth. If you have relatively large models and relatively slow memory and/or memory buses, you can find slowdowns due to memory contention as well as due to processor contention.


#20

Thanks for replying! The variation from run to run for advertisement attribution model is not very large (a standard deviation of 29 seconds compared to the mean runtime of 1200 seconds with 4 cores). The CPU (i7-6700K) has 4 physical cores, therefore twelve cities can have a speedup of 4.

The max peak memory bandwidth is 12.6 GB/s, compared to the bandwidth on my processor which is 34GB/s. Advertisement attribution does have very high bandwidth. It’s possible the max bandwidth of the processor is actually read + write. That means read and write may have 10+GB/s. The high bandwidth is because the processor has relatively small last level cache. Once I switch to a processor with much larger last level cache, advertisement attribution achieves a speedup close to 4.

I thought it’s the OS keeps scheduling advertisement attribution on 2 physical cores, somehow. I feel like the bandwidth is a more reasonable explanation.