@Bob_Carpenter I’ve since done a fair bit of work on these comparisons. I now compare models with two datasets and a synthetic dataset using the ESS calculations in Stan for the comparison software. I was running Stan with multiple chains but the comparison software only runs one chain at a time so I switched to a single chain comparison. I have a conference paper submitted with the comparisons (noting that Stan would give roughly #coresx the number of samples in the same wall time).
I also compare VB in Stan against a custom-coded model using the synthetic dataset from the authors I referenced in an above post. I can’t reproduce the reported results with their python code. Stan’s VB does very well against their reported results though. They use RMSE against the synthetic data as their measure (similar to some others in my field). The issue is they run their MCMC for 100,000 draws and NUTS reproduces the results long before that point. In fact, I can very quickly reproduce the synthetic data in Stan running mean field VB (matches mean parameters well) then use the result as starting values for 2,000 NUTS samples (improves the match on covariance parameters).
I got help from @mike-lawrence here and my code is posted here.