Chains don't progress in parallel + model does nothing for a while before a quick sampling

Hi,

I am working with an ODE model in STAN. I have used the Map-Rect() to let each person calculate their own likelihood in parallel, to have a faster model.

When I tested 4 short chains before running longer chains (make sure no chain would be blocked somewhere, leading to super slow model fitting), I noticed that often the 4 chains don’t progress in parallel (I have cores = min(nChains, parallel::detectCores())).

And more strange, sometimes the model does nothing (don’t even consume CPUs ) for quite a while before starting sampling (and the sampling can be quick…)!! Below is an example I got yesterday night (100 people+4 chains+40 iterations for testing). I was surprised to see that these chains were running somehow sequentially rather than in parallel. I was not sure is it due to the memory issue (I had stack memory issue occasionally)? or due to the model specification(e.g. too wide parameter space)? or due to Map-Rect()? How I can make the chains run in parallel to be faster?

#####################################

Click the Refresh button to see progress of the chains
starting worker pid=22288 on localhost:11798 at 20:33:14.974
starting worker pid=20188 on localhost:11798 at 20:33:15.315
starting worker pid=22884 on localhost:11798 at 20:33:15.629
starting worker pid=25336 on localhost:11798 at 20:33:15.930

SAMPLING FOR MODEL ‘VK_Drug_Parallel’ NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0.031 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 310 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: WARNING: There aren’t enough warmup iterations to fit the
Chain 1: three stages of adaptation as currently configured.
Chain 1: Reducing each adaptation stage to 15%/75%/10% of
Chain 1: the given number of warmup iterations:
Chain 1: init_buffer = 3
Chain 1: adapt_window = 15
Chain 1: term_buffer = 2
Chain 1:
Chain 1: Iteration: 1 / 40 [ 2%] (Warmup)

SAMPLING FOR MODEL ‘VK_Drug_Parallel’ NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 0.046 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 460 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: WARNING: There aren’t enough warmup iterations to fit the
Chain 2: three stages of adaptation as currently configured.
Chain 2: Reducing each adaptation stage to 15%/75%/10% of
Chain 2: the given number of warmup iterations:
Chain 2: init_buffer = 3
Chain 2: adapt_window = 15
Chain 2: term_buffer = 2
Chain 2:

SAMPLING FOR MODEL ‘VK_Drug_Parallel’ NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 0.031 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 310 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: WARNING: There aren’t enough warmup iterations to fit the
Chain 3: three stages of adaptation as currently configured.
Chain 3: Reducing each adaptation stage to 15%/75%/10% of
Chain 3: the given number of warmup iterations:
Chain 3: init_buffer = 3
Chain 3: adapt_window = 15
Chain 3: term_buffer = 2
Chain 3:
Chain 3: Iteration: 1 / 40 [ 2%] (Warmup)
Chain 3: Iteration: 5 / 40 [ 12%] (Warmup)

SAMPLING FOR MODEL ‘VK_Drug_Parallel’ NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 0.062 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 620 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: WARNING: There aren’t enough warmup iterations to fit the
Chain 4: three stages of adaptation as currently configured.
Chain 4: Reducing each adaptation stage to 15%/75%/10% of
Chain 4: the given number of warmup iterations:
Chain 4: init_buffer = 3
Chain 4: adapt_window = 15
Chain 4: term_buffer = 2
Chain 4:
Chain 4: Iteration: 1 / 40 [ 2%] (Warmup)
Chain 3: Iteration: 10 / 40 [ 25%] (Warmup)
Chain 3: Iteration: 15 / 40 [ 37%] (Warmup)
Chain 3: Iteration: 20 / 40 [ 50%] (Warmup)
Chain 3: Iteration: 21 / 40 [ 52%] (Sampling)
Chain 3: Iteration: 25 / 40 [ 62%] (Sampling)
Chain 3: Iteration: 30 / 40 [ 75%] (Sampling)
Chain 3: Iteration: 35 / 40 [ 87%] (Sampling)
Chain 3: Iteration: 40 / 40 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 814.961 seconds (Warm-up)
Chain 3: 482.029 seconds (Sampling)
Chain 3: 1296.99 seconds (Total)
Chain 3:
Chain 1: Iteration: 5 / 40 [ 12%] (Warmup)
Chain 1: Iteration: 10 / 40 [ 25%] (Warmup)
Chain 1: Iteration: 15 / 40 [ 37%] (Warmup)
Chain 1: Iteration: 20 / 40 [ 50%] (Warmup)
Chain 1: Iteration: 21 / 40 [ 52%] (Sampling)
Chain 1: Iteration: 25 / 40 [ 62%] (Sampling)
Chain 1: Iteration: 30 / 40 [ 75%] (Sampling)
Chain 1: Iteration: 35 / 40 [ 87%] (Sampling)
Chain 1: Iteration: 40 / 40 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 7081.22 seconds (Warm-up)
Chain 1: 277.698 seconds (Sampling)
Chain 1: 7358.92 seconds (Total)
Chain 1:
Chain 4: Iteration: 5 / 40 [ 12%] (Warmup)
Chain 4: Iteration: 10 / 40 [ 25%] (Warmup)
Chain 4: Iteration: 15 / 40 [ 37%] (Warmup)
Chain 4: Iteration: 20 / 40 [ 50%] (Warmup)
Chain 4: Iteration: 21 / 40 [ 52%] (Sampling)
Chain 4: Iteration: 25 / 40 [ 62%] (Sampling)
Chain 4: Iteration: 30 / 40 [ 75%] (Sampling)
Chain 2: Iteration: 1 / 40 [ 2%] (Warmup)
Chain 2: Iteration: 5 / 40 [ 12%] (Warmup)
Chain 4: Iteration: 35 / 40 [ 87%] (Sampling)
Chain 4: Iteration: 40 / 40 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 9534.48 seconds (Warm-up)
Chain 4: 256.129 seconds (Sampling)
Chain 4: 9790.61 seconds (Total)
Chain 4:
Chain 2: Iteration: 10 / 40 [ 25%] (Warmup)
Chain 2: Iteration: 15 / 40 [ 37%] (Warmup)
Chain 2: Iteration: 20 / 40 [ 50%] (Warmup)
Chain 2: Iteration: 21 / 40 [ 52%] (Sampling)
Chain 2: Iteration: 25 / 40 [ 62%] (Sampling)
Chain 2: Iteration: 30 / 40 [ 75%] (Sampling)
Chain 2: Iteration: 35 / 40 [ 87%] (Sampling)
Chain 2: Iteration: 40 / 40 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 365.391 seconds (Warm-up)
Chain 2: 206.01 seconds (Sampling)
Chain 2: 571.401 seconds (Total)
Chain 2:

If you are running 4 chains on 4 cores, you shouldn’t expect speedup from within-chain parallelization. I don’t know how your scheduler handles requests for 400 processes (100 shards x 4 chains) on 4 cores, but maybe that’s responsible for the behavior you’re seeing. If you have 4 physical cores available and want to run 4 chains quickly, don’t turn on within-chain parallelization.

Thank you. I have 62 physical cores + 100G RAM. I must use within-chain parallelization otherwise it is not possible to run my ODE model on 4500+ people.

I am not clear how the resources are distributed for each chain, but I can see in the task manger that the 4 chains used all CPUs (dont know why there are 5 Rscript.ext).

I also noticed that the model get faster (more chains progress in parallel) when I store less parameters. So, there is an issue of memory as well? But only 11% of the 100G RAM is used when I run the model.

Do you have any idea how to accelerate the model? Thank you.

Yeah, that should be plenty enough to see speedup, but I was confused by this:

In terms of getting more speedup, when you say

That could maybe be due to the speed of writing the output, or it could definitely arise if, when you store fewer parameters, you end up writing code that packs more of the computation inside the map_rect