Choosing the new Stan compiler's implementation language

Hey all,

@sakrejda asked for a write-up for how we thought about which language to choose. Here are the desiderata for language features when choosing a language to write a compiler in, some of which are specific to the needs of Stan given that we essentially create desktop apps (vs cloud something something).

  1. Pattern matching - this makes tree transformations first-class in a language, enabling you to write compiler phases much more easily.
  2. Fun - we want something that will inspire programmers to volunteer and help us build the compiler.
  3. Amenable to research - it would be great if the language enabled easy collaboration with the broader academic Programming Language Theory community.
  4. Solid, modern tooling - testing frameworks, package and dependency management, code formatting, build systems… these little things count for a lot.
  5. Distribution - must be capable of producing statically-linked native binaries containing all of their own dependencies.
  6. Community - it’d be great if there was a good community we could turn to for help and programmers.
  7. Production use - we want something that will be battle-tested and maintained by other people because it’s important to their business.

C++ has the last three pretty well, but the first four it mostly fails. OCaml and Rust both do a good job on all of these. Choosing between those two, it’s so far mostly coming down to three things:

  1. Parsing libraries - Rust doesn’t really have good parsing libraries for real PL development. OCaml has the awesome Menhir parser.
  2. Community and inertia - Rust and OCaml are both smaller than C++, but Rust has a lot of inertia and OCaml has a long history and solid base with Jane Street and INRIA still actively maintaining it.
  3. Pattern matching - in Rust, pattern matching is ruined a little bit by their borrow checker, and generally programming in Rust has a bit of overhead from this.

This decision is not taken lightly. We’ve implemented a simple arithmetic expression interpreter end-to-end here in both OCaml and Rust, and you can (and should if you have time!) compare for yourself the relative complexity. Going on gut and experience here (along with Matthijs enthusiasm for Menhir), it seems like the parsing library and lack of borrow checker are pointing us in favor of OCaml. Not having to write our own recursive descent parser (like we would need to in Rust) will save us a lot of time and let us minimize time spent maintaining two systems.

3 Likes

What is the planned output of the compiler? Are you going to generate C++ code or generate binaries directly or something else?

First we’ll generate C++ the way the current stanc does. Longer term, eventually I’d like to compile as much ahead of time as we can and perhaps make an interpreter backend as well so we don’t need a C++ compiler at runtime.

I’m not that familiar with either OCaml or Rust, but it appears as if the OCaml compiler can generate compile-able C code
http://caml.inria.fr/pub/docs/manual-ocaml/intfc.html#s%3Aembedded-code
So, if it were the case that interface maintainers just needed a machine that had the OCaml dependencies and could include the C code for stanc in their sources, that would be easier that trying to get the OCaml dependencies everywhere someone might try to build stanc. Although the C code may have to link to some additional libraries that may be hard.

I’d be surprised if OCaml generated code didn’t need some runtime library.

Yeah, but I was assuming I could just use gcc to compile that runtime library. At least C functions can be called from R. As far as I can tell, there is only one (now defunct) project I can find that tried to integrate OCaml and R.

Thanks, Sean—that’s a fantastic summary of the huge amount of work that’s gone into this decision. It’s like writing a grant—you have to do a ton of the work to convince yourself it’s even possible.

We’ll also be outputting human-readablel, serialized ASTs, which should make it easy for someone in another language to pick up and do what they want with.

An interpreted Stan, even running at a fraction of the speed of the compiled code, would be transformative. Game-changing? Sorry—can’t help myself—it’s grant season and North American baseball finals time (what we quaintly call the “World Series” in these parts).

Did you have something in mind you wanted to call? Is it OK to just exec it the way we have to exec g++ or clang++? The output model class will be a C++ object that can be called. More on that design ASAP.

2 Likes

I forgot to add that we can provide an output mode to produce a machine-readable/human-readable AST, the full Stan program, and the C++ model class. Is there something else you’re thinking about, @bgoodri?

Having stanc output whatever C++ is fine. I am worried about getting stanc built and called if it is written in OCaml. If I dump out a C representation with ocamlopt -output-obj -o stanc.c stanc.ml, then I can include stanc.c in the rstan sources. But I may have to build much of the OCaml runtime library in the rstan sources in order to link stanc.c to it.

I’d be happy to look into compiling the new stanc OCaml code into C code if that makes the transition easier when it comes out in 6 months or so, but by then hopefully the new installers exist and it’s not an issue as those will let us easily distribute statically-linked native binary versions of stanc.

This was what got me most excited! But @Matthijs eventually convinced me that we should start by targeting the same C++ interface current Stan targets as that should let us skip figuring out math library AOT and release stanc3 sooner (we want to minimize the amount of time we spend working on two distinct compilers). But I still very much think it’s feasible. Even if we don’t build a true interpreter, we might link against LLVM and libclang such that we don’t need to distribute a C++ toolchain anymore and can compile our own emitted C or C++ at runtime. That + compiling as much of the math library ahead of time as possible should ease the distribution and compilation pain points.

I think it is safe to assume that CRAN is not going to install a statically-linked native binary version of Stan on its servers. Maybe or maybe not the OCaml libraries. So, no package like rstanarm that parses a Stan file on CRAN is going to be able to build on CRAN unless rstan is self-contained.

Good point. So obviously the rstanarm CRAN package could use the same install() approach, but that’s not as cool as the current one step install. Let me think about fixing that…

When building on CRAN, can you make web requests? You could download the stanc binary and run it and use the normal CRAN build process to generate the rstanarm binary. Worst case, you could check in the compiled stanc binary into the repo used to build rstanarm. Or you could have non-CRAN continuous integration run stanc on your models and check in the generated C++ model code, and then compile as normal.

The CRAN process does check whether URLs in documentation are valid, but no, I don’t think that CRAN is going to go along with downloading binary stuff from the Stan website. Packages such as rstanarm do include the C++ in their sources now, but they also create stanmodels and call stanc to associate the Stan code with them.

I think building the necessary OCaml libraries in the rstan sources in order to link against is a relatively easy task because that appears to just be some gcc stuff. More difficult would be to convince CRAN to install the OCaml libraries on their servers from source, but that is a non-small undertaking with all their servers and may have to be updated from time to time. Harder still would be to convince CRAN to build a Stan installer from source on CRAN and install from it.
Most difficult and probably impossible, would be to convince Stan to download binary Stan stuff.

Has existed for a few years now
https://groups.google.com/d/topic/stan-dev/wiUMmmDsrHw/discussion

1 Like

Wait, sorry, let’s break these out into different options and talk about them. Let’s ignore downloading things on CRAN.

  1. Checking in the stanc binary into rstanarm’s git repo
  2. Checking in all of the generated C++ code required so that CRAN can build without stanc
  3. OCaml -> C - Just going off my priors with similar projects, I doubt it’s a trivial undertaking to get OCaml to spit out good C that CRAN can just compile. For example, the feature you’re talking about appears to emit a C file containing mostly OCaml bytecode and must compiled and linked against the full OCaml bytecode interpreter. CRAN can’t install OCaml and it’s written in OCaml so we can’t install it from source on CRAN. So that one seems like a no go. There are other attempts but they get increasingly jenky.

For (3), I was understanding from


which links to an old version of the section of the manual that I was linking to, is that it is possible to compile the bytecode with gcc and then link to the OCaml runtime (which I would probably have to build).

For (2), the C++ already gets generated when a package like rstanarm is uploaded to CRAN, but it needs a stanmodel to put the compiled Module into.

I was slightly wrong. Actually, the C++ is being generated from the Stan code during installation. But it still needs a stanmodel to hold the Module.

That’s the same link I posted, but the link to the manual is broken. The first link for a google search for that is some github project from 2013 with 100 commits. It seems like 1) and 2) from above should be much, much cheaper and more robust. What does it mean that it "needs a stanmodel to put the compiled Module into`? Sorry, I’m not that familiar with the R side of things. But I’m assuming we can check that in, too.

What’s left to ship cling as a backend with RStan? Did anyone successfully run a model on cling yet?