Choosing the new Stan compiler's implementation language


#21

cling doesn’t have anything to with R. I just fire cling up at the command line. Years ago I remember running a transition or two for a model but I don’t think I ever bothered to run a whole chain.


#22

But the -output-obj -o foo.c option to ocamlopt is in the current OCaml doc (which I guess moved to a new link). Anything that can, in principle, be compiled with gcc (or fortran but that is not relevant here) is, in principle, workable with R even if it means compiling a lot of libraries. And C functions can be called from R in memory. OCaml binaries probably can’t be called from memory, so you would have to make a shell call. Currently, I don’t even know how I would get an OCaml-based stanc onto CRAN to call if I didn’t build it myself in a R package.

Currently, packages like rstanarm are calling stanc in three places. We could perhaps think about ways to bypass those. But there are 64 packages Stan-related packages on CRAN currently with more to come. I don’t think a way that either requires a OCaml install or to run the Stan code through some external thing every time you change the Stan code is a good workflow for those developers.


#23

Hm, seems like a point in Rust’s favor. Isn’t Rust easier to integrate into R than OCaml?


#24

I don’t have any direct experience with either OCaml or Rust. I am just assuming that anything that can be compiled with gcc is manageable no matter how burdensome.


#25

For R reference:

rustr

It looks like someone forked the R Ocaml library and has been actively developing on it

Just including this for info, I have no preference as this is wayyy out of my domain


#26

I think this is also worth noting
https://blog.s-m.ac/ocaml-rust-ffi/
Although the title says " Natively run OCaml from Rust" which is not what we want, the blogpost actually claims to “see how we can make OCaml functions available to either C or Rust , in two different ways.” using the -output-complete-obj flag that, for me, says

ocamlopt --help | grep -F "output"
  -o <file>  Set output file name to <file>
  -output-obj  Output an object file instead of an executable
  -output-complete-obj  Output an object file, including runtime, instead of an executable

#27

I’m a bit confused. Wasn’t one of the things we wanted to do to move away from our dependency on CRANs C++ compilers?

Is it now obvious that a compiler in Ocaml is not going to fly? If so, then I’ll stop working on my prototype.


#28

Don’t stop working on the Ocaml prototype. I’m pretty sure if it came down to it, I could compile something one way or another.

I think basically everything that was said in the draft thread about compilers was wrong but I was waiting to talk to Bob about it rather than having another flame-war on Discourse.


#29

OK! I’ll keep working on it then. If it turns out we need to switch to rust hopefully the code could be ported with not too much effort (except for the parser).


#30

Full disclosure I just lurk in the Stan community, but I am a member of one of the Rust teams.

There are a lot of parser libraries in the Rust community. Nom, combine, pest and lalrpop off the top of my head. I can’t speak to their competitive advantages to a handwritten one, parsers are not my area of expertise. The Rust community is very welcoming so it is not hard to get a hold of the maintainers, if you have questions.


#31

Thanks! I looked at nom and combine but they’re parser combinator libraries aimed at parsing binary formats. I made prototypes in both pest and lalrpop - pest is not fully documented yet and I wasn’t able to parse something fairly simple with it. lalrpop is a bit more mature but when used to parse a programming language you have to write your own lexer and it seems like it’s not that far from a hand-written parser. My current opinion is that if we were to go with Rust we would write our own parser.

@Bob_Carpenter wants to take CRAN out of the critical path of a Stan release and avoid spending valuable maintainer time dealing with their weird CI system. So even before a new stanc is made we hope to release something similar to install_tensorflow() that first just downloads the correct C++ toolchain and eventually removes any dependency on CRAN compile servers by also downloading correct stanc binaries. Packages that use stanc at runtime will need to do the same thing that packages that use tensorflow do now, as far as I understand. And if they use it only at build time like rstanarm there are solutions 1 and 2 above (and maybe 3 if @bgoodri gets that working satisfactorily).


#33

Just to back up a bit, we don’t distribute a complete RStan on CRAN right now. The user still has to configure a C++ toolchain, which has been where most of our pain in installation has come in—getting that installed and talking to R and Python properly.

The goal for Stan 3 is to make it easier to install, not harder.

The current working draft proposal (see below) is aimed at trying to get consolidated, simple installers for users that encapsulate Stan, dependent libraries (Eigen, Boost, etc.), and CmdStan (including stanc, which is most likely going to be coded in OCaml). This would lighten the CRAN distribution, not complicate it.

I’m trying to douse the flames here, not accelerate them. I’m trying to toss this stuff out for discussion precisely to uncover potential problems and refine the design. So please comment.

LLVM would be all sorts of useful if we could get everything done that way. @seantalts has also been suggesting we do more with it.


#34

TL, DR: I think the language in the Stan3 draft proposal on installation

  1. Mischaracterizes the nature (largely toolchain configuration) and scope (largely Windows) of the problem
  2. Proposes a solution (“fully bundled RStudio installation using their C++ tools”) that does not exist and RStudio is not planning
  3. Ignores work toward a solution that is ongoing (RTools40) or that has already been implemented (the pkgbuild R package)
  4. Fails to consider what are likely the biggest installation challenges we will face in 2019 (having R that is built with g++-8 in April and what to do about stanc being in OCaml or Rust or whatever)
  5. Limits what we can do with ideas for Stan3 that Andrew and I have discussed at the meetings to work with Stan output in a way that is more consistent with probabilistic programming
  6. Does not live up to standards expected of / by people in the free and open source software (FOSS) community

The nature and scope of the problem

Calling install.packages("rstan", dependencies = TRUE) has rarely thrown an error and then doing library(rstan) allows it to sample from models that have already been compiled. Nor do the toolchain installers fail to install a toolchain. The primary problem has been the configuration step to use an installed toolchain among people who have never compiled code before, and even that is largely confined to Windows.

Up to Stan 2.17, we have had few installation / configuration problems on Linux and until very recently (when Apple moved its C++ headers out of /usr/include for Mojave) we have had almost no installation / configuration problems on Mac for many months prior to that, thanks to the installer put together by @coatless that uses the FOSS version of clang (that supports OpenMP) rather than Apple’s clang (that doesn’t).

But instead of saying “Windows has a difficult configuration step for RStan users so let’s work with other FOSS developers to address it”, the Stan3 draft proposal envisions changing the entire Stan installation process across all OS to be essentially a script that downloads a bundle of dependent libraries, toolchain stuff, and binary stanc. Whatever might be the benefits of that approach, it is repudiated by Linux distributions and would cause the removal of
https://tracker.debian.org/pkg/r-cran-rstan
from Debian free repositories and those of Debian-derived distros (like Ubuntu), which in turn would cause the removal of anything that depended on it.

“fully bundled RStudio installation using their C++ tools”

RStudio v1.2 does not come with a C++ compiler and I have not heard anything from them to suggest they are thinking about including one. I think the confusion stems from the fact that RStudio does use libclang for autocompletion and syntax checking in its editor
https://blog.rstudio.com/2018/10/11/rstudio-1-2-preview-cpp/
This is a big improvement but it does not help with RStan installation / configuration problems.

Also, I would note that RStudio is doing the right thing by not bundling libclang with RStudio on Linux, which allows users to download / install / utilize a more recent version of libclang from their distribution’s repositories.

Ignoring ongoing or already implemented progress

On Windows, the RTools installer does not, by default, edit the PATH variable because that would be irresponsible. If every piece of Windows software modified the PATH so that it came first, then a lot of other Windows software would break. Thus, RTools allows the user to “opt-in” to changing the PATH but they do not always do that and often do it incorrectly, particularly if the user has or has had previous versions of R or RTools.

So, an RStudio employee (Jim Hester) has recently released an R package called pkgbuild that essentially scans the Windows registry and does a lot of other things to find the appropriate version of RTools on the disk (and offers to download it from within RStudio if it cannot be found) and then temporarily modifies the PATH so that the appropriate version of RTools comes first for the duration of the expression passed to the pkgbuild::with_build_tools() function. In addition, you can execute pkgbuild::has_build_tools(debug = TRUE) to see what the problem is if there is one. RStan 2.18.x has been using pkgbuild::with_build_tools() to compile Stan programs at runtime, and it seems very promising overall, although it broke prophet on CRAN and one Stan user hit a rare bug where it as expecting a character vector for the CRAN mirror rather than a list of character vectors. But these are to be expected for a package that has only had a couple of releases.

The pkgbuild package does not address the issue that Stan says that the C++14 standard is required but the released versions of R for Windows do not know anything about C++14. Thus, for RStan 2.18.1, a Windows user has to specify CXX14 = g++ in the ~/.R/Makevars file. However, this was only intended to be a temporary situation in order to get RStan 2.18.1 onto CRAN before the expiration of the eviction notices that were given to packages like rstanarm which come with Stan programs that did not build against StanHeaders 2.18.0 anymore due to their using the pre-C++11 compiler flags.

For RStan 2.18.2, the plan is to call pkgbuild::with_build_tools() in order to compile Stan programs at runtime but request that it be compiled with the C++11 standard (which the released versions of R for Windows do know about) while adding -std=c++1y to the end of the compilation flags, which will override the -std=gnu++11 flag added by R. So, it should be the case that a Windows user can run the Rtools35 installer without changing the PATH, rely on pkgbuild::with_build_tools() when building a Stan program at runtime to temporary set the PATH to wherever RTools35 was installed, allow R to think it is using g++-4.9 with the C++11 standard, and to actually use what g++-4.9 implements from the C++14 standard, irrespective of whether the Windows user has a ~/.R/Makevars file. In addition, it now seems to be the case that I can pass -march=native without crashing newer Windows machines so there is no performance penalty from not specifying it in a ~/.R/Makevars file.

This implementation of this plan has been delayed for a variety of reasons, including three different CRAN administrators insisting that it won’t work due to the incomplete C++14 implementation in g++-4.9 and the fact that the Windows version of R that people download does not recognize the existence of C++14. My plan does actually work, but CRAN has a legitimate point that Stan 2.18 has caused a major disruption for many of the 64 Stan-related R packages on CRAN by insisting on the -std=c++1y flag but is apparently not even using any of the C++14 features (I know we do in the unit tests because I wrote some polymorphic lambdas to test integrate_1d but apparently nothing from C++14 is in the main part of Stan Math or Stan Stan yet). This is an embarrassment and requiring library versions or flags that are more strict than what is actually needed by a piece of software is frowned upon in the FOSS community and actually prohibited by the Debian Free Software Guidelines.

At the same time, the person who maintains RTools has put together a beta version of RTools40 to be used with the next major version of R (that is usually released in April). This can be used today with RStan and I have put instructions for doing so on a wiki page that is currently hard to find if you did not already know it was there because I first want people to get a RStan 2.18.2 on their released versions of R. RTools40 has g++-8.2 which has a complete C++14 implementation (and defaults to it), supports STAN_THREADS (and probably MPI / GPU stuff), and has a port of ArchLinux’s package manager that can install Windows versions for like 1000 libraries (and potentially stanc someday). Toolchains, libraries, and other things used by multiple clients should always be the responsibility of package managers but due to Microsoft’s inability to develop one for Windows, CRAN has had to play more of that role.

During Stan’s history, we have contributed zero dollars, opened zero bug reports, and made zero pull requests to RTools despite it being essential for Stan reaching as far as it has on Windows. This is even more deadbeat than companies with billion dollar market caps who use Stan but have contributed zero dollars, opened zero bug reports, and made zero pull requests to Stan. And now the draft is proposing a Stan Installer that comes with a binary C++ toolchain that would look pretty much exactly like RTools40 except it would sit under whatever directory Stan was installed in, take up GB of space, and not be easily used to compile any other R package or FOSS project that the user wants. But none of the paid Stan developers has much experience with, time, or inclination to develop C++ toolchain installers, so we would probably end up paying a mercenary to do it instead of funding the sole developer / maintainer for RTools:

Big challenges in 2019

When R 3.6 is released (likely in April) it will be built with RTools40 and g++-8.2, in which case we will face the challenge that R 3.4 and 3.5 will be built with RTools35 and g++-4.9.3. With g++-8.2 we could use all of the C++14 standard and even all of the C++17 standard. But a lot of companies and Linux servers won’t be ready to upgrade to R 3.6 on the day / week / month / year it is released. Technically, using a C++17 compliant compiler would be fine on Mac / Linux with R 3.4 and 3.5, but CRAN does not seem to allow R packages to have a different R version requirement for Windows than other operating systems.

The Stan-related R packages on CRAN are going to have to be compilable with both g++-4.9 and g++-8.2 until we force a move to R 3.6. We should be trying to make it easier for other FOSS developers to write packages like rstanarm that come with pre-compiled Stan programs. The rate of people that have the technical skills to write and validate good Stan programs is not accelerating, so Stan is going to be more reliant on such packages (and things like brms and tmbstan) to keep accelerating the rate at which people use Stan for better statistical analysis. The Stan3 draft says that it is committed to supporting things like rstanarm but doesn’t say how that is actually going to be accomplished during all these changes.

Another big challenge is how to build and call a stanc on CRAN that is written in OCaml, or Rust, or some other language that is not C, Fortran or C++. There has been some brainstorming upthread about this but there doesn’t seem to have been any before the draft was written. I think there are some routes that could work (like dumping out the C code and building it with gcc along with the OCaml runtime or getting CRAN to install ocamlopt that can dump out a shared object that can be processed by gcc) but downloading a Stan Installer onto CRAN as part of the RStan build / installation process likely isn’t one of them. It seems at a minimum that would be something CRAN would expect a user to affirmatively opt-in to, which makes it incompatible with the automated processes used by CRAN. For example, how would a package like brms build its vignettes, run its unit tests, etc. on the CRAN servers without being able to invoke stanc unless a person presses the Y key to install it?

Probabilistic Programming

The idea of Stan3 has meant different things to different people at different times over the past four years. For @ariddell, it was mostly about putting things that were implemented separately in each of the interfaces into a Stan API and unifying naming conventions. For @Matthijs, it is mostly about redesigning the Stan language (even if that is not technically incompatible with the previous language, it will be a big break in user’s minds). The actual draft for Stan3 on Discourse has a lot of bullet points about GitHub that could be done at essentially any time with no adverse effect on backwards compatibility.

For me, RStan3 was mostly about moving to ReferenceClasses or R6 classes to adopt more Pythonic calling conventions and be able to do more probabilistic programming things with the output, along the lines of some goals @andrewgelman had. To do that effectively in R, you need the instantiated Stan program to be in RAM so that you can query information about the symbols, access the things in the data and transformed data block that were conditioned on, invoke transformations that were applied to the parameters, know what the state of the chains is so that you can advance it more, etc.

All those things are much harder or impossible to do if the Stan model is not in RAM. And it is not at all clear how I would get the Stan model into RAM if it were compiled by a compiler that comes with the Stan Installer and differs from the compiler (flags) that were used to build R (particularly on Windows). It seems that the draft is primarily concerned with just drawing from the posterior distribution with a shell call and getting the output in a JSON or binary format (which is fine) rather than anything that has been done (e.g. expose_stan_models, log_prob, etc.) or could be done dynamically.

FOSS Responsibilities

It would be a great April Fool’s day prank to troll the Boost mailing list with a proposal to bundle a C++ compiler with the next Boost release that worked with all the newer Boost libraries. After all, what good are all these C++ header files if you don’t have a C++ compiler to build stuff with them? Hopefully, one of the Boost developers would fall for it and go off on what a bad idea it would be technically and how irresponsible it would be to even hint that Boost was not FOSS-compiler agnostic or willing to fix reproducible issues that arose with any FOSS compiler.

The notion that a piece of software should come bundled with everything it needs to execute is a corporate idea that tends not be thought of highly by the FOSS community. Instead of trying to get one piece of software to work without regard for how it interacts with other software, FOSS developers usually take the position that FOSS should intensively utilize other FOSS libraries, compilers, etc. and if there is a problem then you work with the other FOSS developers to address it so that the improved FOSS libraries, compilers, etc. work better for all FOSS projects that utilize them. Some things Stan could do along these lines is to help @coatless make the Mac compiler installer work on Mojave, or help him make it so pkgbuild will offer to install it on a Mac if it is not there, or help him make a version available with clang-6 to use with R when it is built with a clang-6. Or help Jeoren figure out how to get link-time optimization to work on Windows so that Stan models — and everything else — compile faster while using less RAM and running faster. Each of those things would both benefit Stan users and benefit non-Stan users that are building R packages or even non-R packages. That way Stan isn’t just taking from the efforts of the FOSS developers but actually giving something back to the FOSS community.

Finally, I wish you would stop lumping dependent libraries into the discussion of installation issues. BH and RcppEigen have never caused a Stan program to not compile on Windows nor presented any additional installation / configuration obstacle, but they do effectively trigger an update to the Boost and Eigen sources under Stan Math (that hasn’t been problematic in a long time). Just say that you favor a more corporate distribution model that bundles a particular version of the libraries and the compiler instead of exploiting the difficulties Windows users have historically had configuring Rtools to justify developing an installer that just-so-happens to also pin the versions of the dependent libraries across all OS because you do not want Stan developers to have to deal with the effects of changes in the dependent libraries. You could end up winning that argument but don’t win it with red herring arguments and by ignoring the counterarguments of Linux users and others who identify strongly with the FOSS community that projects like Stan are expected to at least be testing with the beta versions of the dependent libraries in order to hit as many problems as possible and get them fixed before their final releases.


#35

The INLA team’s experience is that CRAN is not very pro packages that rely on externally hosted binaries. Although they haven’t yet, they have recently threatened to boot packages off Cran that depend critically on INLA because of this. CRAN is the main way most R users interact wit R packages so this would be very bad.

It would be worth trying to pin down CRAN on their policy before taking this route.


#36

Hi all. I don’t have much if anything to add on the technical issues but if anyone wants a user’s perspective I’m happy to share.
A


#37

If OCaml is really worth the effort their build looks straightforward and relies only on things already in rtools (GNU make, GCC, etc…) so it might be reasonable to package ocaml as an R package and provide it to Stan that way. I can check how resource intensive the build is.


#38

I’m fairly sure PyPI (Python’s CRAN) will allow external binaries.

That said, I’m in broad agreement with what Ben said. Much of it applies
to Python. In particular, I think Python’s equivalent of pkgbuild
(setuptools or conda) is likely to be far more reliable than a suite of
bash/powershell/whatever scripts.

But if there are developers interested in working on such a solution, I
wouldn’t want to discourage them. There’s some chance it could work for
some users. Experimentation and alternative implementations are
typically a sign of health in FLOSS.


#39

That’s awesome, I’ve been using Arch for a while and I was about to start using it for managing random scientific packages without root privileges.


#40

@bgoodri thank you for detailed write up and mention.

My plans going forward are to modify r-macos-rtools such that:

  • support for clang-6 exists.
  • Mojave is supported.
  • compiler paths are set via .Renviron instead of Makevars

I should have a new update later today or tomorrow.


With this being said, the main issue I’m running into is my Apple developer program credentials have lapsed. As a result, I can’t easily sign the installer and, thus, when the installer is made available there will be a Gatekeeper popup.

Ask: Is anyone on the Stan team enrolled in the Apple developer program and would be willing to sign the installer?

I’m reluctant to fork out $100 for the 1-year enrollment as I am rarely building software packages needing an installer. (Note: This amount was previously donated by Prof. Timothy Bates to address the Gatekeeper popup).


#41

Thanks @coatless . We can take up the $100 at the Stan developers meeting Thursday.