I have to respectfully disagree with @sree_d: if anything this paper demonstrates many common mistakes that are made when trying to compare Markov chain Monte Carlo methods.
Every probabilisitic computation method approximates expectation values with respect to a given target distribution; in the Bayesian context that target distribution is a posterior distribution of interest. The only well-defined performance metric across all possible methods is the cost needed to achieve a given approximation error for a given expectation value.
Markov chain Monte Carlo does not, in general, have any guaranteed error quantification and so there’s no way to quantify that performance without considering functions whose expectation values are known exactly already. In some cases, however, a Markov chain Monte Carlo central limit theorem holds which quantifies the approximation error probabilistically. Assuming that the central limit theorem holds then the scale of the error for the expectation value of the function f is given by the Markov chain Monte Carlo standard error,
\text{MCMC-SE}[f] = \sqrt{ \frac{ \text{Var}_{\pi}[f] }{ \text{ESS}_{\pi}[f] } },
where \text{ESS}_{\pi}[f] is the effective sample size for that function. For much more see Markov Chain Monte Carlo in Practice.
If a Markov chain Monte Carlo central limit theorem holds for all of the samplers being considered then the sampler performance can be ranked by \text{MCMC-SE}[f] per unit computational cost. The (wall clock) run time is often used as a proxy for computational cost. If only the ranking is of interest, and not the actual distance between performance quantifications, then the same ranking is given by comparing \text{ESS}_{\pi}[f] per computational cost. Be careful, however, because \text{ESS}_{\pi}[f] and \text{MCMC-SE}[f] are nonlinearly related the differences in performance will not scale in the same way.
If a Markov chain Monte Carlo central limit theorem doesn’t hold then the effective sample size \text{ESS}_{\pi}[f] is meaningless. An empirical effective sample size can be evaluated but it won’t correspond to any meaningful quantity let alone quantify performance.
In this ideal circumstance where all of the central limit theorems hold then \text{ESS}_{\pi}[f] / \text{Cost} can be further decomposed into \text{ESS}_{\pi}[f] / N_{\text{iterations}} and N_{\text{iterations}} / \text{Cost}. In other words the incremental effective sample size and the cost to run each iteration of the Markov chain. Care has to be taken, however, because neither number means much by itself; it’s only the product that matters for quantifying sampler performance.
With all of that said, in order to properly compare performance between Markov chain Monte Carlo algorithms we would have to
-
Fix a target distribution \pi
-
Verify that all samplers satisfy Markov chain Monte Carlo central limit theorems for \pi.
-
Fix a well-behaved function of interest f.
-
Run each sampler and compute \text{MCMC-SE}[f] / \text{Cost} or \text{ESS}_{\pi}[f] / \text{Cost}.
Step 2 is very difficult for many samplers. In practice often the best we can do is verify it for one sampler and then use the estimated expectation value from that sampler as a ground truth when quantifying the errors for the other samplers. At the very least the sampler estimates have to be compared to ensure that they’re all compatible with each other. This isn’t perfect but often the best we can do.
The NIMBLE comparison paper doesn’t verify Markov chain Monte Carlo central limit theorems, it doesn’t compare the actual estimates for consistency, and it doesn’t make comparisons in the context of a single, fixed function. There are multiple other misunderstandings throughout the paper – for example with automatic differentiation the cost of evaluating the gradient is not much higher than evaluating the target density itself and, as @jsocolar mentioned, the cost per gradient is an incomplete number by itself as it doesn’t take into account the effective sample size per gradient evaluation. Both of these numbers are affected by the choice of prior model but in very different ways.
Markov chain Monte Carlo, and probabilistic computation in general, is a very much both conceptually subtle and mathematically sophisticated. It’s complex and difficult and misunderstandings are inevitable. But sloppy comparisons like this propagate those misunderstandings, which then confuse users who just want to quantify their posterior distributions.