Background
Because model compilation and MCMC are inevitably time-consuming, I find it extremely helpful to write entire data analysis projects as formal end-to-end Make-like declarative workflows. Here is a sketch of an actual Makefile
that might do this.
all:
samples.fst
samples.fst: run_model.R model.rds
Rscript run_model.R
model.rds: compile_model.R model.stan
Rscript compile_model.R # Sets auto_write = TRUE and save_dso = TRUE.
data.fst: simulate_data.R
Rscript simulate_data.R
clean:
rm -f *.fst *.rds
In practice, because I use R, I usually opt for a Make-like R package like drake
or targets
instead of GNU Make itself. (Full disclosure: I am the creator and maintainer of both these R packages.) In fact, I am trying to develop best practices for using rstan with targets
, and my first attempt is here.
Issue
I am not sure if I am writing the model compilation target correctly in the Makefile
above (or here with targets
). That target assumes the input file is model.stan
and the output file is model.rds
, so the model will recompile and downstream targets may rerun if either file changes. But I am not sure it is sufficient to track the RDS file. I feel as though a step like this should also include the actual DSO file and any other binaries that get created. How do I find the names of these files? In general, what is the best way to reproducibly track the compilation step and completely guarantee that downstream targets do not recompile the model?
I realize the DSO file name might not be known in advance, and there may be multiple output files from compilation. This presents a challenge for GNU Make, but not for drake
or targets
because both can easily handle multiple dynamic input/output files per target. For the model compilation target in the targets
example, the compile_model()
function just needs to return the names of all the files that model depends on.