I took the liberty to move this to a new topic as it is IMHO quite a separate discussion from the original post.
@louis-mandel: Really glad you joined the forums to give us clarifications! Thanks! I just want to reiterate the concern that it is unclear whether the wall-clock time comparison is sensible. The paper doesn’t state what was exactly compared - was it “both backends at their default settings” or “both backends to compute N iterations” or somethig else? As mentioned above, we currently believe a fair comparison should involve the effective sample size (ESS), e.g. by computing ESS/second. If the ESS/iteration ratio is not the same between the two backends, then one could be slower to compute the same amount of iterations but be faster to get a result of the same accuracy. I understand this is less of a concern if both compared algorithms are NUTS variants and thus the ESS/iteration is likely quite similar, but would still be great to see that reported!
Thanks a lot for the work you have put into this.