Best practices for Make-like declarative workflows with Stan

Background


Because model compilation and MCMC are inevitably time-consuming, I find it extremely helpful to write entire data analysis projects as formal end-to-end Make-like declarative workflows. Here is a sketch of an actual Makefile that might do this.

all:
	samples.fst

samples.fst: run_model.R model.rds
	Rscript run_model.R

model.rds: compile_model.R model.stan
	Rscript compile_model.R # Sets auto_write = TRUE and save_dso = TRUE.

data.fst: simulate_data.R
	Rscript simulate_data.R

clean:
	rm -f *.fst *.rds

In practice, because I use R, I usually opt for a Make-like R package like drake or targets instead of GNU Make itself. (Full disclosure: I am the creator and maintainer of both these R packages.) In fact, I am trying to develop best practices for using rstan with targets, and my first attempt is here.

Issue


I am not sure if I am writing the model compilation target correctly in the Makefile above (or here with targets). That target assumes the input file is model.stan and the output file is model.rds, so the model will recompile and downstream targets may rerun if either file changes. But I am not sure it is sufficient to track the RDS file. I feel as though a step like this should also include the actual DSO file and any other binaries that get created. How do I find the names of these files? In general, what is the best way to reproducibly track the compilation step and completely guarantee that downstream targets do not recompile the model?

I realize the DSO file name might not be known in advance, and there may be multiple output files from compilation. This presents a challenge for GNU Make, but not for drake or targets because both can easily handle multiple dynamic input/output files per target. For the model compilation target in the targets example, the compile_model() function just needs to return the names of all the files that model depends on.

2 Likes

RStan should already be checking whether the mtime of the .stan file is later than the mtime of the .rds file. Their names are the same except for the suffix.

I haven’t used it, but the drake r package could be helpful.

@wds15: @wlandau is the developer of drake :)

Ups… thanks for the hint…

Drake looks super cool… just did not yet have the time to use it…

This is neat. I use drake in my paid work (where it has been an amazing tool to track/accelerate our pipeline development), and stan in my volunteer work, so having @wlandau show up here seeking to unite the two is exciting! Unfortunately I’m not knowledgeable enough in this particular realm to help. Just wanted to express enthusiasm!

(And by reading the post more carefully I have now learned of the drake successor targets!)

1 Like

Update: I am in the process of switching my Bayesian work over to cmdstanr, which appears to work much more seamlessly than rstan for drake and targets. I am feeling pretty satisfied about the transition so far.

3 Likes