I’m currently writing an R package implementing some rstan models. One option is to write a new stan-file for every variation of the model, but the compiled model files are large, which makes the package large. A current draft of the package is only moderately smaller than rstanarm, although I believe rstanarm offers a wider range of models and options. I’ve looked at rstanarm’s code for plain regression models, which is quite general and contains a lot of if-statements.
My questions are:
Can I reduce the size of my R package by collapsing several models into a single stan-file, using if-statements (and other potential solutions)?
Will using a lot of if-statements and writing a long stan-file make the sampling algorithm run noticeably slower?
Would you (stan-developers or package developers) recommend doing this?
If I try to do this, do you have any advice? Things to avoid? Any useful tricks accommodate different data scenarios and model specifications?
Collapsing several models will most likely reduce file sizes, though it does depend on how similar they truly are and how much code is shared between all the branches. I believe rstanarm does it due to CRAN restrictions, but it might be a good idea even if you aren’t worried about CRAN specifically
This will be slower, but its possible that it will be so slight you don’t care. It partially depends on how “high up” the if statements are. If they were all in the transformed data block (for example), it will likely make no difference. If instead they are deep inside some loops in the model block, expect it to matter. So it depends on your model.
rstanarm uses a lot of the useful tricks that are out there. The main ones are using arrays that are either size 1 or size 0 to “enable” or “disable” a variable from existing. The other tips are all general software efficiency ones, such as trying to hoist the conditionals as far up as you can to avoid re-evaluating them in the hot parts of the code.
For example, even though the code is slightly less obvious, the second will generally preform better:
// option one
for i in (1:N) {
if(some_data_constant){
a();
} else {
b();
}
}
// option two
if(some_data_constant){
for i in (1:N) {
a();
}
} else {
for i in (1:N) {
b();
}
}
Luckily, if your conditionals are predictable (which these should be), your branch predictor will generally do a very good job guessing at what to do, and the slow down will be minimal in most cases, but it’s still worth being careful and measuring the time each version of the models take
I think 1MB is the max package size. CRAN’s size restrictions coupled with lack of package dependency management are why we don’t have an up-to-date RStan.
Probably not noticeably slower even if they’re deeply nested. If Stan code was better at memory locality and branch point prediction elsewhere it might matter, but there’s already a ton of conditional code down at the low levels so I doubt you’ll notice this. It all gets translated to C++.
For example, if you want to allow two different forms of posterior, it shouldn’t be noticeable overhead to do this:
Ok, thanks! I guess I will give it a go. I have another Stan-based package already on CRAN, which according to CRAN checks has an installed size of about 10MB on Windows and 170MB on Mac. This results in a ‘note’ in their checks, but they’ve accepted the package anyhow. I just want to avoid my next package becoming larger than this (and ideally I hope to make it smaller).
Thanks! This is very useful. CRAN requirements is part of it, but I have a preference for limiting the size anyway. I am considering which models/features to offer users, and may end up dropping some features to avoid the package getting too large – unless I manage to write more general code without losing too much sampling speed. I think I will try to put more stuff into fewer stan files and see how it works out.