I’d like to consider what we’re evaluating when we consider parallelizing evaluation of the the log density or MCMC algorithms. I was writing a comment for Stan issue #2818, but thought the discussion should be generalize. That issue and associated branch illustrate a 35% speedup of a single chain by using two cores to evaluate the Hamiltonian forward and backward in time simultaneously, but the percent speedup isn’t so important compared to when we need it.
There are two primitive evaluations.
-
Task 1: speed to convergence
-
Task 2:
n_eff / sec
after convergence
For task 1, parallelizing a single chain dominates. For task 2, running multiple chains dominates because MCMC is embarassingly parallel. To summarize,
- After convergence, we’re better off running multiple chains than parallelizing a single chain. Before convergence, we’re better off parallelizing log density evals and the MCMC algorithm .
What I think we do in practice is a combination of tasks 1 and 2.
-
Task 3: speed to given
n_eff
We might target n_eff = 1
for debugging and rough model exploration. We’ll target n_eff = 100
or 1000
for inference, though our editors and users may demand more.
The smaller the n_eff
target in Task 3, the larger the relative performance gain we’ll get for parallelizing log density and HMC algorithms.
What I’d like to think about going forward is how to
- massively parallelize adaptation, and
- how to monitor adaptation for convergence so we don’t do more of it than we have to.
During adaptation of static Euclidean HMC, our goal is to
- find the typical set, and then
- explore the typical set thoroughly enough to
- estimate covariance (for the metric), and
- adapt step size to match acceptance target.