I’d like to consider what we’re evaluating when we consider parallelizing evaluation of the the log density or MCMC algorithms. I was writing a comment for Stan issue #2818, but thought the discussion should be generalize. That issue and associated branch illustrate a 35% speedup of a single chain by using two cores to evaluate the Hamiltonian forward and backward in time simultaneously, but the percent speedup isn’t so important compared to when we need it.
There are two primitive evaluations.

Task 1: speed to convergence

Task 2:
n_eff / sec
after convergence
For task 1, parallelizing a single chain dominates. For task 2, running multiple chains dominates because MCMC is embarassingly parallel. To summarize,
 After convergence, we’re better off running multiple chains than parallelizing a single chain. Before convergence, we’re better off parallelizing log density evals and the MCMC algorithm .
What I think we do in practice is a combination of tasks 1 and 2.

Task 3: speed to given
n_eff
We might target n_eff = 1
for debugging and rough model exploration. We’ll target n_eff = 100
or 1000
for inference, though our editors and users may demand more.
The smaller the n_eff
target in Task 3, the larger the relative performance gain we’ll get for parallelizing log density and HMC algorithms.
What I’d like to think about going forward is how to
 massively parallelize adaptation, and
 how to monitor adaptation for convergence so we don’t do more of it than we have to.
During adaptation of static Euclidean HMC, our goal is to
 find the typical set, and then
 explore the typical set thoroughly enough to
 estimate covariance (for the metric), and
 adapt step size to match acceptance target.