I was wondering if there are any obvious static analysis tasks and other programming language tasks that Stan could really benefit from. I am asking because there might be MSc students here in Oxford who could enjoy contributing in that way.
Some examples of things I had in mind:
extensions to the type system, like tuples, function types;
new datatypes like associative arrays (dictionaries);
for Stan 3, dependency analysis which automatically determines in which program block a variable belongs;
static analysis to detect scenarios in which inference is likely to go wrong (do we think this could be useful/doable?).
Does anyone have examples of such things they’d like to see implemented? The more concrete the better! It’s especially good if it’s also academically interesting.
I’d honestly really like the sparse stuff Dan and Aki have been talking about. There’s some threads around here somewhere on that. I dunno if they ever found a path to get that implemented. But that might be more drudgery with tests in the Math library than anything else. Dan has a thread around here with quite a few details.
I’ve also been thinking about some (probably quite simple) metaprogramming system. There is a rudimentary #include mechanism already in place, but it would IMHO be cool to
a) Make including more structured to disallow the worst anti-patterns found in C (e.g. only let users include at the top level and include only whole model blocks). Even better, introduce proper scoping mechanisms. There is some discussion of this at https://github.com/stan-dev/stan/issues/2224
b) Allow compile-time parametrization of the models (resulting in conditional compilation of certain parts of the model). E.g. choosing a link function within a complex model by an enum parameter - this may alter the number and type of parameters to optimize and so is not easily implemented in Stan now. Once again I think it would be good to be very restrictive and allow only conditional compilation of whole blocks, force users to specify all params and do some type-checking etc. (Rust’s #[cfg] feature is IMHO a good starting point). A nice static analysis task would be to make the compiler check that the program is compilable under all allowed parameter assignments.
Sure, I’m not saying we shouldn’t do what you’re suggesting, I’m pointing out an edge case we need to deal with. Stan’s limited ability to encapsulate makes it difficult to create a useful #include that also respects some basic rules as you are suggesting. Personally I think it might be useful to allow blocks to appear multiple times and be concatenated so that the #include mechanism could remain simple.
We rebuild it every log density evaluation, so static analysis on it isn’t really static. Some other autodiff systems reuse their “tapes” and sometime optimize them—CppAD is an example.
I think that’s all waiting on @mitzimorris finishing the underlying type refactor. I think it’s going well, but it’s not ready yet. She’ll know more.
This is the big one we want to tackle. Andrew calls it something like “pedantic mode” and we’ve also called it “lint” there is the beginning of a Wiki with some things we know we can easily detect statically: Home · stan-dev/stan Wiki · GitHub
That’d be cool. There are includes right now, but that’s about it.
I don’t quite see what that’d give you. Is this part of a workflow with multiple models? The advantage would be that the’re all in one file even if they all have to be recompiled?
Scoping for includes? I thought that’s what Krzysztof said would then break RStanArm.
I’ll give an example. The task that I personally work with are biological models, where you have gaussian process for a regulator molecule and measurements for some target molecules. Now there are multiple ways the derivative of target molecule concentration could depend on the regulator including for example: target' = k1 * (1 / exp(-w * regulator - b)) - k2*target and target' = w * regulator - target except for target, all the inputs are Stan parameters to be determined. Those two “link functions” differ in both number of parameters and priors required for the shared parameter w. But there is a lot of code around this including numerical integration, the GP stuff etc. that is exactly the same, so I would like to have a compile-time parameter to choose between those two types of models.
Currently the alternatives are
a) put everything in functions (possibly with lots of parameters) and have two different main model files that only call those functions and differ in which “link” they use. Since I cannot pass functions as parameters, this would still result in a lot of duplicated code.
b) include parameters for both links in the model and pass the desired link type as int data variable, then if on this data variable in the model and transformed parameters blocks, the parameters for the unused links still exist in the model, but are not involved in any statements. The parameters still consume memory though.
My intention is to have a code that basically folows b), but makes it explicit that at runtime the if conditions are always evaluated the same and gets rid of the unnecessary variables. Alternatively I can imagine that you are allowed to use some kind of if in the parameters block if the condition is based only on data, but that seems hacky and I am not sure it would be easy to implement.
Also you can easily imagine that Stan caches the compiled model separately for for different compile-time parameter values, so recompilation would not be that frequent. This would probably require all compile-time parameters to be declared and have a type, but I believe strong typing is a good idea anyway.
True if we changed the way current #include syntax works, but I think Krzysztof agrees with me that having backwards compatibility on this is easy (e.g. introduce new syntax and make #include deprecated).