evaluation for new algorithms (much like @Rayleigh_L has done with ADVI, just easier)
better indexing
perhaps better adoption of Stan
My suggestions
This is where I could use your help because these are suggestions and I don’t know how to actually make this work. In my mind, here’s what I think would be great:
some sort of common organization for examples that we know to work
different levels of examples (or perhaps tags?)
bare minimum. Contains Stan program, data, script to run.
has documentation. Describes the data and the math?
has verification. It’s been run for a long time and we know that inference converges?
Is there some way of marking the models that have things and mark the models that don’t? Is there a way to trigger travis on new pull requests to see which of these things it has (of course, it’ll still need eyes on it). Is there a way to tell which ones have 0 warnings?
Is there some way we can set up some regression test? I don’t think the dogs example works anymore (in bugs examples). It used to in earlier versions of Stan.
@thel, any ideas? If you’re looking at static analysis, this would be the right place to target it.
After 2.13, I want to review all the example models.
I think the most productive way to do this is to start
a new repo for reproducible models that we can start clean
and get going for a couple models before trying to get
all 300+ models we have in order.
I think that’ll have to be separate from our overall
example-models anyway, because for the BUGS examples, I
think we want to code their model using a parameterization
that works and is relatively efficient, then code the model
with priors we like for the same data (presumably not monkeying
with the likelihood. Presumably only the latter would go into
the new testing/good-example repo.
Or maybe we can separate out a good/bad split at the top
for the repo? Or a way to mark individual models?
This is a great idea! We urgently need this - I have been doing tons of performance tests for ODE models and these always helped me to find out quickly what works and what doesn’t (in terms of speed).
I think the stan_demo command is a good place to start, but this is only a start as you ask for more - stan_demo is about running the models, but we want a bit more here, right? Maybe the new rstantools R package is a good start?
We should start with a small set of a handful of models and then think about a conversion script if we like.
One thing we do that I think causes tremendous pain in the end is reading variables out of the R environment implicitly. It’s a great convenience for users but then I think people don’t exactly know what they fed to stan and it makes it hard to read the models. I’m of course not suggesting that we take that functionality away but it would be nice to enforce a little bit of explictness in the examples and make sure they are runnable from cmdstan
One organization I can imagine is that every example would look like
I don’t think we’ll be able to get any CI to accurate judge the robustness/accuracy of a model, but it would already be a huge improvement if we could have CI that tests whether or not a model can run. By having inits and data in the Rdump format with matching file names (i.e. name.stan, name.init.R, name.data.R) then the CI could be automated without any need for makefiles as Thel suggested in the email.
Because of the subtleties with validating fits (if we could automate it then we wouldn’t be running so many models ourselves!) I think the inclusion of an example-model has to be done by hand.
My recommendation would be a new repo with three main folders (or even three separate repos if people don’t think that would clog things up too much). One would be a staging area for submissions where CI would verify that doc exists and the model runs. Then moderators (i.e. us) would check the model and classify it into either the good models repo or the bad models repo (having examples of models that don’t work can be equally as educational).
Given by-hand verification, the good models repo could also then be subject to more intense CI upon changes to the underlying Stan code, serving as a basis for regression tests. For example, we could check for divergences, E-BFMI, R-hats, etc. Or even have target values for various expectations set and validated when the models are moved into the good models repo.
Yeah, I think at a minimum we need a way to turn this “feature” of rstan off. Maybe an argument “inherits” (idiomatic in R) that can be set to FALSE. If it’s FALSE and the list of data provided to stan() is missing a variable then it should error even if something with that name and structure is in the global (or local) environment.
@betanalpha, I wasn’t really targeting a CI (I’m assuming you meant continuous integration). I just think it’s about time we had some proper regression tests that we ran before shipping code.
@syclik I don’t think example models will be great for regression – the performance results are just too noisy and can lead to too many false positives. For rigorous regressions tests I think we need the tests I built in the stat_valid branch.
I was advocating for continuous integration just to automatically validate example models submitted by users, allowing for automated staging before someone human has to get involved, reducing the needed person-power.
Here, I was using “regression testing” as in software regression testing. As in things that worked continue to work. Not as a test for making sure the samplers were all valid.
Sure – the problem is that model fits are statistical and occasionally fit values will fall outside of the expected values even if nothing changes in the underlying sampler code. I’m just saying that naively checking fit results won’t make a great regression test because of too many false positives when nothing changes. We’d really need to follow the pattern in the stat_valid_test branch, and even that requires knowing exact quantiles.