Thanks @betanalpha for the useful comments.
Can you make this suggestion more concrete? We’re still in progress of adding tools for comparing new results to “gold standard” and suggestions with as many details as possible are welcome. The idea of storing “gold standard” draws was that the user is not limited to pre-defined set of expectations, but we are happy to get recommendations for any default set.
Preferably we would want have more than 100k, but the current choice is a compromise. We could sample for some posteriors even more than 100k draws and compute set of pre-defined expectations to save time and memory of the users of the database. Those using Stan could also anytime get more draws if there is doubt that 10k is not enough. As the database is meant for many things, I’m not sure if we would like to have 100k for all.
Sure, and we don’t currently have “gold standard” for LDA and it’s set to NULL. We think it’s useful to have also models and posteriors which we know to be difficult. Using the keywords and whether “gold standard” exists, it is possible to run tests for only the cases you trust.
The database has information how the “gold standard” has been obtained, but it might also useful to have this kind of tiers as shorthand notation for the most common choices.
I agree
We know that about stat_comp_benchmark
and that is one reason why we have been working on this.
We hope this will be a community effort and people will submit pull requests for their favorite models and posteriors. And even if not making the effort for complete pull request, even just recommending models with best practices Stan code and description why the model/posterior is interesting.