This is something I’ve been thinking, but with limited time never I had time to anything, but really cool that you have done this and have a framework to test this with many many models.
I agree and add that with small sample size near 100, it would be useful to get more accurate tail quantiles or probabilities, but tail quantiles are not direct expectations and probabilities are an expectation of non-differentiable step function.
Based on the results additional constraints are that the posterior is not too high dimensional and relatively close to Gaussian. Often we see slow mixing and small effective sample size when the posterior is far from Gaussian e.g. multimodal or banana shape. It would be useful to have more that kind of examples. The best setting probably would be posterior which is not far from Gaussian posterior and good mixing, but very expensive log density evaluations, e.g. with ODE models. If the posterior is relatively close to Gaussian there would be possibilities to estimate tail properties with some smooth functions of x.
I agree with 1-3, but in (4) in pre-convergence setting the expectation of control variate is not 0.
So in a linear case we need more draws than the number of parameters? Or higher effective sample size than the number of parameters? This would limit the usefulness to low dimensional cases with very expensive log density evaluations. Unless low rank or structured covariance matrix is used?
This is also one restriction I had in my mind.
Oh, when I started writing all above I didn’t notice @LeahSouth’s excellent reply, which provides additional ideas which might help!