@thel, I could use your help thinking about this and how to make this sustainable. I’m sure @bgoodri has some ideas that are key here too.
Right now, there’s no structure to our example-models repo: https://github.com/stan-dev/example-models
What could we do if we had structure
- regression tests for Stan version
- speed tests across Stan interfaces
- evaluation for new algorithms (much like @Rayleigh_L has done with ADVI, just easier)
- better indexing
- perhaps better adoption of Stan
This is where I could use your help because these are suggestions and I don’t know how to actually make this work. In my mind, here’s what I think would be great:
- some sort of common organization for examples that we know to work
- different levels of examples (or perhaps tags?)
- bare minimum. Contains Stan program, data, script to run.
- has documentation. Describes the data and the math?
- has verification. It’s been run for a long time and we know that inference converges?
Is there some way of marking the models that have things and mark the models that don’t? Is there a way to trigger travis on new pull requests to see which of these things it has (of course, it’ll still need eyes on it). Is there a way to tell which ones have 0 warnings?
Is there some way we can set up some regression test? I don’t think the dogs example works anymore (in bugs examples). It used to in earlier versions of Stan.
@thel, any ideas? If you’re looking at static analysis, this would be the right place to target it.
After 2.13, I want to review all the example models.
I think the most productive way to do this is to start
a new repo for reproducible models that we can start clean
and get going for a couple models before trying to get
all 300+ models we have in order.
I think that’ll have to be separate from our overall
example-models anyway, because for the BUGS examples, I
think we want to code their model using a parameterization
that works and is relatively efficient, then code the model
with priors we like for the same data (presumably not monkeying
with the likelihood. Presumably only the latter would go into
the new testing/good-example repo.
Or maybe we can separate out a good/bad split at the top
for the repo? Or a way to mark individual models?
This is a great idea! We urgently need this - I have been doing tons of performance tests for ODE models and these always helped me to find out quickly what works and what doesn’t (in terms of speed).
I think the
stan_demo command is a good place to start, but this is only a start as you ask for more -
stan_demo is about running the models, but we want a bit more here, right? Maybe the new rstantools R package is a good start?
We should start with a small set of a handful of models and then think about a conversion script if we like.
One thing we do that I think causes tremendous pain in the end is reading variables out of the R environment implicitly. It’s a great convenience for users but then I think people don’t exactly know what they fed to stan and it makes it hard to read the models. I’m of course not suggesting that we take that functionality away but it would be nice to enforce a little bit of explictness in the examples and make sure they are runnable from cmdstan
One organization I can imagine is that every example would look like
4 directories, 4 files
with something pleasant in doc/ like a jupyter notebook or knitr.
The makefile would have one target that prepares the data file and another to run stanc and go from there.
This is a very unix-first way of doing things, which suits me but might make it a little harder for contributors?
I don’t think we’ll be able to get any CI to accurate judge the robustness/accuracy of a model, but it would already be a huge improvement if we could have CI that tests whether or not a model can run. By having inits and data in the Rdump format with matching file names (i.e. name.stan, name.init.R, name.data.R) then the CI could be automated without any need for makefiles as Thel suggested in the email.
Because of the subtleties with validating fits (if we could automate it then we wouldn’t be running so many models ourselves!) I think the inclusion of an example-model has to be done by hand.
My recommendation would be a new repo with three main folders (or even three separate repos if people don’t think that would clog things up too much). One would be a staging area for submissions where CI would verify that doc exists and the model runs. Then moderators (i.e. us) would check the model and classify it into either the good models repo or the bad models repo (having examples of models that don’t work can be equally as educational).
Given by-hand verification, the good models repo could also then be subject to more intense CI upon changes to the underlying Stan code, serving as a basis for regression tests. For example, we could check for divergences, E-BFMI, R-hats, etc. Or even have target values for various expectations set and validated when the models are moved into the good models repo.
Yeah, I think at a minimum we need a way to turn this “feature” of rstan off. Maybe an argument “inherits” (idiomatic in R) that can be set to FALSE. If it’s FALSE and the list of data provided to stan() is missing a variable then it should error even if something with that name and structure is in the global (or local) environment.
I agree that we shouldn’t be reading variables implicitly
with our examples.
I aggregated suggestions ages ago at the bottom of the wiki
The key thing we wanted is a data simulator and fitter.
I agree that settling on a directory structure will be necessary.
And I think unix-like is the only way to go.
As long as “model-name-data.R” isn’t R specific — that is,
it restricts to the syntax of R dumps that CmdStan can read,
it should be fine.
@betanalpha, I wasn’t really targeting a CI (I’m assuming you meant continuous integration). I just think it’s about time we had some proper regression tests that we ran before shipping code.
@syclik I don’t think example models will be great for regression – the performance results are just too noisy and can lead to too many false positives. For rigorous regressions tests I think we need the tests I built in the stat_valid branch.
I was advocating for continuous integration just to automatically validate example models submitted by users, allowing for automated staging before someone human has to get involved, reducing the needed person-power.
[sorry – didn’t realize I didn’t respond]
Here, I was using “regression testing” as in software regression testing. As in things that worked continue to work. Not as a test for making sure the samplers were all valid.
Sure – the problem is that model fits are statistical and occasionally fit values will fall outside of the expected values even if nothing changes in the underlying sampler code. I’m just saying that naively checking fit results won’t make a great regression test because of too many false positives when nothing changes. We’d really need to follow the pattern in the stat_valid_test branch, and even that requires knowing exact quantiles.
Yup. I’m for both, but I’m also thinking we could have caught 2.10 if we had something simple kicking around.
The stat_valid_test would have definitely caught it. The key is that it can also control false positives.