This has been discussed before as parts of other topics, most recently
Blog: A gentle Stan to INLA comparison, but I think it will be good to have it all in one place.
Let’s say we have different types algorithms for Bayesian inference: MCMCbased (HMC, MH, Gibbs…), structural approximation (INLA…) and hardwareaccelerated (multicore/GPU) variants.
We want to compare them (in terms of computation speed, but we can’t do that without also looking at “accuracy”) on a specific model and fixed input size. We are interested in how this changes as input size grows, but I assume that won’t be the difficult part.
Its a “soft” question and there is not a “best” solution, so the goal is to maybe just come up with a procedure that most people wouldn’t have any major objections if they saw someone doing it like this in a scientific publication.
For now, just a few observations/questions to start the discussion:

Assuming that you have a good procedure for comparing algorithms, hardwareaccelerated variants are in most cases trivial to deal with  the only difference will be that they take (hopefully) less time to compute. Running independent MCMC chains and pooling the samples is an exception, because the results will be different (and possibly much worse than running a single chain). In this context I see it as a different type of MCMCbased inference and it should be compared as such.

For comparing different MCMCbased algorithms we seem to be the closest to reaching a consensus Run them for the same amount of time and compare effective sample size [ESS] (or run them until they reach a particular ESS and compare the times):

How do we deal with tuneable parameters such as the proportion of warmup samples? Technically, we could find the settings that maximizes ESS (minimizes time), but I don’t think that’s how people would do it in practice (= run it with defaults until something goes wrong) and it would probably take even longer than just running the initial chain for a long time.

How reliable is ESS for this purpose? There are different ways of estimating ESS, but they are all based on the assumption that the chain has converged? Intuitively (and from practical experience), I’d say it can fail miserably in some cases.

ESS can’t be used if something like INLA is included in the comparison. As some have done, we can replace ESS with how well the model estimates the parameters (mean squared error of the estimated means, for example), but that can only be done if the true values of the parameters are known. Or, we can measure overall goodnessoffit, with (approximate) crossvalidation or test data (but this becomes a problem with temporal and spatial data).