Because model compilation and MCMC are inevitably time-consuming, I find it extremely helpful to write entire data analysis projects as formal end-to-end Make-like declarative workflows. Here is a sketch of an actual
Makefile that might do this.
all: samples.fst samples.fst: run_model.R model.rds Rscript run_model.R model.rds: compile_model.R model.stan Rscript compile_model.R # Sets auto_write = TRUE and save_dso = TRUE. data.fst: simulate_data.R Rscript simulate_data.R clean: rm -f *.fst *.rds
In practice, because I use R, I usually opt for a Make-like R package like
targets instead of GNU Make itself. (Full disclosure: I am the creator and maintainer of both these R packages.) In fact, I am trying to develop best practices for using rstan with
targets, and my first attempt is here.
I am not sure if I am writing the model compilation target correctly in the
Makefile above (or here with
targets). That target assumes the input file is
model.stan and the output file is
model.rds, so the model will recompile and downstream targets may rerun if either file changes. But I am not sure it is sufficient to track the RDS file. I feel as though a step like this should also include the actual DSO file and any other binaries that get created. How do I find the names of these files? In general, what is the best way to reproducibly track the compilation step and completely guarantee that downstream targets do not recompile the model?
I realize the DSO file name might not be known in advance, and there may be multiple output files from compilation. This presents a challenge for GNU Make, but not for
targets because both can easily handle multiple dynamic input/output files per target. For the model compilation target in the
targets example, the
compile_model() function just needs to return the names of all the files that model depends on.