Choosing the new Stan compiler's implementation language

Rust strikes me as a good target language for the compiler

I don’t think anybody was proposing Rust as a target language. I’m skeptical that Rust would be a good choice because I’ve read that the type system gets in the way of easy manipulation of multidimensional arrays.

Just commenting where I think there are misunderstandings and leaving my value and moral judgements out of it:

Anything you can expose through a C API and DSO you can expose with an arbitrary RPC (remote procedure call) system over e.g. a socket. I don’t see why you think there are fundamentally different capabilities with a DSO - the one difference is that with a DSO you can share memory, which was previously thought to be important for performance but I am happy to lay that idea to rest as well if need be.

Non-admins cannot install to this location.

Kevin, if you or another Haskell fan were willing to help us with this diff with either research into OCaml or a prototype we can compare, that would really help us to more rigorously reconsider Haskell.

Wait, didn’t you just say you weren’t that familiar with OCaml? :P

It’s not obvious to me we need the strictest possible language (that wouldn’t be Haskell anyway, that’d probably be Idris or Coq or Ada or …) - we’re not sending spaceships to the moon, we’re translating from Stan to C++. So clearly the different spots on this purity continuum all have some merit and it’s a bit of a judgement call based almost entirely on things that aren’t language features - things like the above desiderata, and the people we have involved with and likely to contribute to the Stan compiler going forward.

I think all of the things you’re saying about Haskell I have heard said about multiple other languages with static typing or XYZ language feature that fixes all your code real good.

I’m so so so so sorry to have started a language war here. I blame @sakrejda :P

This is a very legitimate concern, perhaps the one most dear to my heart. On your link, Haskell is down near 0.24% or something, so it seems the fidelity is kind of low. I looked around and found this huge StackOverflow survey: Stack Overflow Developer Survey 2018

Looks like Haskell is at 53.6% “loved”[0] with OCaml at 41.5%. Notably, Rust is #1 at 78.9% (hopefully this helps answer some other questions raised in the thread above). So according to this, we might have a 20% easier time finding open source contributors with Haskell over OCaml, and 33% easier with Rust over Haskell. If you or other Haskell people want to help contribute a proof-of-concept using the currently accepted best libraries for e.g. parsing (and help us figure out which language extensions we should be using, etc) I think Haskell could be in the running. It would need to show that the symbol table monad is not unduly burdensome, a common subexpression elimination pass is easy to read, parsing is relatively painless and can generate good error messages for annoying parsing tasks (like missing semicolons), and that kind of thing.

[0]% of developers who are developing with the language or technology and have expressed interest in continuing to develop with it

I’ve played around a tiny bit with OCaml, but haven’t written anything non-trivial in it. My guess is that it would be a good choice for the compiler, as the ML family of languages were designed to facilitate development of the HOL proof assistant, which involves lots of compiler-ish stuff, including sophisticated parsing and manipulation of symbolic expressions. I was just reacting to the complete absence of Haskell from the list, not insisting that it is the One True Solution. I do think that Haskell is a better-designed language, but it does have a steep learning curve.

1 Like

A difference is expose_stan_functions is calling Rcpp::sourceCpp which creates the C API and calls dyn.load to put the functions in R’s memory. You can call args on them. It is a pretty well-worn workflow. I am not familiar with anything similar using RPC stuff to make compiled functions available inside a R session, but we could look into. We do stuff with sockets for the parallel chains already, but that is background.

If that is a concern, we can make a fork that tweaks the paths on

to some place writable or create a menu option to let the user choose a path. The main point is that this is the most complete, modern, etc. compiler for Windows available right now so CmdStan should use it.

1 Like

I’m not sure if something amazing exists already, but it would look something like https://grpc.io/ which doesn’t have official R bindings. So we might need to make our own bindings for that (or choose something similar) - and to be totally clear, it’s fine if this is a separate (FOSS) module!

Likewise with the rtools40 toolchain - if we can use 95% of something that already exists that’s awesome. And of course everything we write will be FOSS. And good software design splits these things up modularly if possible, of course. So a totally valid way of achieving the high level goal (paraphrased as something like “we need to really take control of our install process and provide a path for windows users without Admin rights with no GUI clicks”) might be to contribute PRs to an existing toolchain installer like this. I hope it’s okay with you if a mercenary works on that :P The higher level goals here are what’s important - for this one, attempting to eliminate the number of potential Stan users we lose due to toolchain issues. And for the DSO thing, attempting mostly to allow us to spend much less time on code maintenance, RStudio crash bug-hunting, releases with CRAN, with a nice-to-have of using C++17 sooner.

Yes. I should say my previous post about mercenaries could easily have been interpreted as saying I have something against people who do freelance contract work, so I should have phrased it more explicitly. There isn’t anything wrong with being a software mercenary, but there is something wrong with a project like Stan contributing no dollars, no bug reports, and no pull requests to other FOSS projects that we have relied on for seven years and then paying someone else to build something Stan-specific that overcomes some of the challenges we have had with those FOSS projects, instead of first trying to work with them to build something that benefits all sides.

3 Likes

Obligatory https://www.joelonsoftware.com/2000/04/06/things-you-should-never-do-part-i/

One quibble: Ben said “The Rtools installer does not require admin rights to install.” I need admin rights to install it on my work computer.

1 Like

FYI: httpstan is essentially the RPC system desribed here. The only real
difference is that httpstan uses HTTP 1.1 and gRPC requires HTTP 2.
PyStan 3 is just a thin client making “RPC” calls. It works fine.
Supporting expose_stan_functions will be tricky but I think it’s doable.

1 Like

@jmh530 Is that the case if you installing Rtools to a writable directory and do not check the box to edit the PATH?

Dear Kevin,
have you seen the dependencies of haskell-stack in ArchLinux (AL)? It is required by no package, but has over 60 (!) dependencies on various haskell packages.

You just ruled out Rust and OCaml…

You are right that I forgot to mention also Fortran, Java, PHP and perhaps some other languages. But this already shows what I intended to say at the very end of my previous comment: the diversity of languages is increasing, and thus dependencies (and if you continue this chain, also the number of threads). Although diversity is expected to provide stability according to ecological sciences, I do not know (and honestly have doubts) if this works out in operating systems. I guess many users want to know and control what is going on in “their house”. It is therefore, I try to opt for the minimal and the same time the best solution. Anyway, the devs are great experts in this field and they will go for the best approach. BTW, I like the discussion in this forum a lot!

1 Like

This is a very nice feature to have. Especially for the audience rstanarm is aimed at.

I’m wondering how much it factors into people’s decisions to use brms vs. rstanarm.

And I’m wondering how much the advantage of not requiring a compiler will change if having a compiler is easy.

In the end, I think for the users who want to use something like brms or rstanarm, we want to make installation as easy as possible. One touch on CRAN is about as easy as it gets for most R user, so rstanarm is great there.

This is helpful to know.

I’m not sure if you were serious or not, but I very much do not want to try to filter our user base by making installation an obstacle.

I don’t think we know the answer to the mostly part. The problem with mailing lists is that we largely hear about problems people have. Nobody opens an issue or mails the list when installs go smoothly.

I think it might be helpful for us to grab some typical users (let’s say first year tats students who’ve never used Stan) and see if they can get it installed on their Windows or Mac machine. It would be useful to see where they fail now no matter what we do.

I wouldn’t say this is easy. Then we have to get people installing servers, which can also be problematic on locked down machines and brings its own headaches. Not insurmountable, but I don’t think calling this “easy” is fair.

And it won’t be as efficient, but @ahartikainen and @ariddell are going with this model for PyStan3 (they’re using http, but I’m not sure what protocol is being passed).

My point isn’t that we can replace the efficiency of the log density and gradient and transform function, but that maybe it’s not that important that we have them.

The future’s more relevant than the past here. Even Python (1990) is well past the very start of the most common OS’s.

I didn’t realize that was even an option with something like Rtools. This leaves me confused—why didn’t the R developers on Stan try to work with Rtools if that was an option?

I’m also considering the following as major, systematic problems:

  • the time spent wrestling with CRAN to be a problem as it slows down our release cycles dramatically

  • the upper bound it places on us for development

    • C++ compilers
    • Eigen, Boost, etc.
  • the lack of ability to synchronize releases on CRAN

  • whatever Apple does next

The problem is that these are unknown unknowns that we get tied to by assuming we run on system-installed versions of everything.

I agree and that’s why we’re at the proposal stage trying to evaluate consequences of various decisions. We each have our own end of the elephant and I’m trying to get a higher-level perspective so that we can speed up development across the project.

The problem we’re trying to fix is that Stan’s hard to install. The goal is to make it easier to install. I’m not sure what you mean by distribute—I think of that as just shipping bytes around.

Is there a design doc for any of this anywhere? I’m not sure what you mean by PPL stuff here. You mean representations of random variables based on draws and posterior processing?

Why is it marked beta? I’m always reluctant as a user to start downloading beta versions of things that have a long history of making stable releases (though in RTools, I believe some of their stable releases are based on betas of other tools, but I could be misremembering here).

I’m a bit confused about what you mean by Stan here. You and I and everyone reading this are Stan. It’s not a resource we can apply to a problem. We can ask people to help. We can create issues. You can do it or I can do it. But there’s no Stan out there to do things! This is one of the things that makes this project very hard to manage. Another option is requesting support through NumFOCUS, which is the closest thing the Stan project has to a budget.

That’s what people say about measure theory and differentiable manifolds, too. What I mean is that they aren’t particularly difficult as instances of advanced mathematics, but as concepts for applied and computational statisticians, they’re often insurmountable hurdles.

Whoever wrote that didn’t write numerical algorithms. Of course, we’re not considering something like OCaml for numerical algorithms.

All you need is continuation-passing style in purely functional programming to recreate the full power of goto. Being purely functional isn’t an end in itself in realistic settings, because too much winds up getting packed into state that’s passed around and you have the same problem only now it’s at the object level rather than the meta level.

I haven’t used either Haskell or OCaml before, by the way. But I have used denotational semantics.

Indeed. Though for parsers, the memory semantics is usually trivial. Right now, everything’s being done in C++, for example. And there’s no worry at all about memory in the framework if you code in standard C++ idioms (that is you don’t malloc in your semantic actions!).

Best quote of the thread.

It hasn’t devolved to that yet, but the bait’s right there in the title.

Are we talking Rtools here? Did you find bugs that you just didn’t report? I didn’t know we had any gripe with it other than the versions of C++ they chose, which I didn’t think were negotiable.

We’ve certainly reported bug reports and even suggested fixes to Eigen and Boost.

I’m a little confused on the OSS eithics thing and what you think our obligation as a responsible OSS project is. I sort of figured we were ethics neutral if we were producing a ton of code as a project. Are we supposed to be contributing to Linux, the C++ compilers gcc and clang, the intermediate languages Python, R, and Julia, all the R packages we use like Rcpp and ggplot2 and similarly for other languages, the core C++ libraries like Sundials, Eigen and Boost, the installers like homebrew and Rtools, and basic unix tools like make, etc. It’s a big list of things we use and it would be interesting to lay that out. But we can barely afford to keep our own operation going, so I’m not sure where you think the donations should come from. I don’t think we can take the grant money we take in and donate it, so it’d probably have to be out of NumFOCUS or we could just donate out of pocket ourselves.

Wow! I remember when Joel was being quoted everywhere.

I think it’s dangerous to make unqualified recommendations as one-sided as this.

The most extreme case of this approach was at McDonnel-Douglas. A friend of mine from grad school worked there rewriting the 1961 IBM operating system on every new piece of hardware that came out so they didn’t have to rewrite their giant CAD-CAM program. Now this thing was written on cards and was never going to change. So it was just a matter of keeping it working. That’s about where we’re at with our C++ parser.

If we ever want to modify the house of cards we have, we need to change it. Nobody can modify the existing code because nobody can make a single change and get Spirit Qi to compile.

It’s not a question of whether we rewrite it, it’s when. I’ve been wanting to do this for a few years now. We had to refactor the major guts of the current system to make any progress at all on the type system, and that took something like six person months. We can entirely rewrite the whole thing from scratch in a more modern language in less time than that.

Now let’s look at JAGS. A complete rewrite of BUGS that’s way better than the original BUGS both in terms of performance and in terms of software readability.

Oh, and how about clang++? A completely new C++ compiler that’s blowing away gcc.

Should I go on? How about Facebook ripping out their non-performant front ends and replacing it? Or how about Hadoop rewriting map-reduce because Google’s wasn’t open source? All those things worked out well.

2 Likes

I was just referring to repositories of linux distributions and CRAN as distribution channels

It is ABI uncompatible with the current and past releases of R. It works with the beta version of the next major release of R (and hard-codes some paths) and they are working through stuff with the other 13K packages. So, it isn’t suitable for a lot of situations, but it is totally suitable for CmdStan on Windows (modulo the hard-coded path thing) and arguably worthwhile for using RStan (but you do have to co-install a different version of Windows and there are no binary R packages yet).

The main development components of Rtools40 are not beta, including make, sed, grep, and the latest stable version of g++. GCC has had two point releases on this branch, so there shouldn’t be any critical bugs that haven’t been hit by somebody yet. Moreover, if I had to guess whether the latest stable g++ or the unsupported, partially C++14 compliant g++-4.9 had more bugs, I would guess the later. Moreover, STAN_THREADS works with former and not the later. And it has ArchLinux’s package manager. So, anyone who wants to use CmdStan on Windows and is allowed to install to C:\rtools40 should already be doing so. The case for using it to run RStan is a closer call but a lot of individuals should. Stan (i.e us) should be using it to test on Windows (although we still have to test with g++-4.9 too) and report any issues before they hit R users in April.

It’s on GitHub. Jonah and I don’t know a lot about toochain development or Windows, but I do regret not having tried to do more to make the PATH thing less rough. And I wish I had thought of pkgbuild five years ago because that seems like the right approach. And I am a bit worried that Rtools is being developed and maintained by one (fantastic) postdoc who needs to get 13K packages building with a new compiler by April.

I think these are important points for discussion. Granted, I have been trying to get all the Stan-related packages on CRAN to be on the same page since September 12th (and that was only because CRAN was being upgraded in the early part of September and that I didn’t want to start a migration during or shortly before StanCon Helsinki. So, it has more like mid-August since I was ready to start.) and we’re still not there yet. But I wouldn’t consider the 2.18 issues to be systematic since most of them were related to jumping the -std flag from pre-C++11 to an incomplete C++14. Historically, I have had StanHeaders on CRAN

https://cran.r-project.org/src/contrib/Archive/StanHeaders/?C=M;O=A

within a few days of them being tagged

and rstan binaries were built about a week after, although 2.17 was also an aberration because we were dealing with the fallout of the char* vs. string slowdown and the fact that C++11 stuff had gotten introduced to develop in the meantime.

As far as being limited by the C++ compiler(s), yes this has been true due to Windows. But I don’t think it is binding on us that much currently. We don’t have any C++14 syntax in the main part of Stan yet but it looks as if we’ll continue to be able to set the -std=c++1y flag if we want to use polymorphic lambdas, auto return, etc. It is unfortunate that we can’t use C++14 constexpr but we can utilize the C++11 version. Since the target date on the Stan3 draft is fourth quarter 2019 and by April 2020 R will have gone through a whole annual release cycle with g++-8.x, I think we could get to C++17 then. Maybe earlier if we willing to say to people who can’t upgrade R yet that they just have to live with a C++14-ish StanHeaders / RStan.

With regard to Boost and Eigen, I strongly believe we have not been bounded. I was always willing to put Boost fixes or new libraries into RStan if they were not in BH, and it amounted to like 5 commits

over three years, all of which got dropped for 2.18.x because they were no longer necessary with C++11/4. I would do the same with RcppEigen, although I have never had to and Eigen does not release often.

If Apple can delete /usr/include then they could do anything. But it looks like we’re getting a fix into the Mac toolchain installer within 10 days of the issue being brought to our attention. And the Mac toolchain uses upstream clang, so we are insulated from changes to Xcode’s clang.

All-in-all, I think those issues (and some others) are orders of magnitude less problematic than the difficulty of configuring Rtools by hand.

I love all language equally, but Stan most equally of all.

2 Likes

Folks + @bgoodri & @Bob_Carpenter,

As I hinted in a now deleted post, I do have a prototype for an all-in-one installer build for macOS that combines the pre-existing installer packages for:

  • R (Official CRAN Binary)
  • RStudio IDE (Desktop-version)
  • Developer Toolchain (r-macos-rtools), and
  • an assortment of R packages (Rcpp, RcppArmadillo, RcppEigen, Stan, tidyverse, ??).

I’m aiming to release this mid-November.

Before that, I need to have final approval from RStudio to embed their IDE as I’ve only been given preliminary permission today and I’d like to have a short beta test period to ensure all the kinks are worked out. If you have any suggestions or comments, please let me know.

2 Likes

That sounds great! Have you talked with Hester about having pkgbuild pop something up asking if the user wants to install it when it is not found on the disk?

I certainly do not encourage it to make Stan installs hard! Stan is pretty easy to install already now from my perspective given that it does depend on a few hefty things. The compiler thing is probably the worst dependency to manage, but I don’t quite see a way around that… and I have to say that RTools limiting the C++ standard we use to some extend is something I consider a good thing given that corporate use of Stan has to happen on systems which are much slower to upgrade. Turning Stan into a deployable thing which brings its own compiler could be a solution. Ideally we make this installer thing optional if possible (and even with an installer being there it would be great to still keep things on the ground in terms of required minimal compiler versions).

(BTW… @betanalpha… have you considered to tell people to install rstan from an MRAN mirror which keeps snapshots of CRAN at a given time-point (look for CRAN time-machine)? This should solve your installation issues during the course as you can then select a state of CRAN which has a consistent rstan eco-system. It is just a matter of telling R the right MRAN mirror to use and that’s it.)

I also wanted to echo the utility of expose_stan_functions which we have in rstan thanks to @bgoodri! This facility is really amazing. They key benefit from it for me is that I can run posterior simulations using the same code as I have used it for doing the inference. In addition it allows nice and easy debugging and unit-testing of my Stan models. I don’t mind if these applications can be achieved with something different (gq services/whatever), but those are really nice side-products of this magic glue between Stan and R.

BTW, isn’t the discussed service API a way to make all the interfaces more coherent? I thought this is the solution to a currently cluttered stan/interfaces situation - and in a way this would be the solution to a coherent stan<->interface.

1 Like

@jmh530 Is that the case if you installing Rtools to a writable directory and do not check the box to edit the PATH?

@bgoodri I can’t remember the exact issue now. I just sort of threw up my hands and just had the IT people install it for me (recently switched groups at work and no longer have admin rights). Have no problems with installing Stan at home.

I’m certainly NOT against change. That’s not the point I wanted to make.

I suppose the point I wanted to make was that there is a difference between completely re-writing code and re-factoring it. If Facebook is ripping out stuff and replacing it, presumably they have unit tests that pass when they start and when they finish. That’s re-factoring. They didn’t throw out all Facebook code and start fresh. Hadoop’s map-reduce is something entirely, doesn’t seem relevant. Your clang/gcc and bugs/jags examples are better, more in line with my thinking. Note that clang isn’t gcc 2.0. jabs isn’t bugs 2.0. These are completely separate projects. When I see Stan 3.0 making all of these changes, I think Python 2.0 to Python 3.0 transition…probably a good thing but with some pain.