Choosing the new Stan compiler's implementation language

First. some short responses before responding to @bgoodri’s longer post.

$100 is below the level that’s worth discussing at our meetings. We can pay for it out of the dev budget at NumFOCUS which doesn’t require pre-approval, just dealing with their reimbursement paperwork. Or if you give us something we can enroll in, we can probably do it with the NumFOCUS credit card.

Yes, please do, @andrewgelman.

Will this be a problem because RStan requires an externally installed C++ compiler? If they start enforcing a no-external dependencies policy, they’re going to lose a lot of packages.

My experience is that when you try to pin someone down on policy, they sensibly go with the most restrictive one they can imagine, figuring they can always loosen it later without pissing people off.

Let me preface this by saying I haven’t personally had any problems installing RStan on Mac OS X ever. I’m also skipping the quoting because Discourse ate my first draft, so I’m doing it offline.

My main goal is to make life easier for our devs and for our users. So I really do want comments.

Changing the language for the parser and AST from C++ to OCaml is motivated for entirely different reasons than install problems with C++. It would be nice to have a languag ein which we can quickly and easily develop new language features, with decent error messages, and semantics much more quickly than we can in C++. It would also be nice to transform intermediate representations for efficiency—this has been our to-do list for years, but is very very painful in C++. This is going to be critical if anyone wants new features in the language.

I’m trying to come up with a plan to address what I see as issues around RStan and PyStan development, release, and install that will be cost effective and robust going forward.

Specifically, I would like to:

  • cut down on the time it takes between Stan being done and a workable R solution to accessing Stan being released,

  • to remove the possibility of CRAN being in an inconsistent state,

  • to make it easier for users to install on all platforms,

  • to take measures to ensure we don’t break important downstream packages like Prophet,

  • to allow us to move forward with C++ compilers, Boost, and Eigen libraries independently of CRAN/R requirements,

  • to remove the amount of installation issue and forum traffic so that Stan doesn’t look so wobbly around installs.

This may not all be possible.

The installation issues for RSTan 2.18 have been across platforms and have come at different stages.

I had thought RStudio was going to release devtools as part of their new releases as soon as Rtools was up to a workable version. I thought JJ said that’d be about a year from when we met at Columbia. I probably misunderstood, though, if that’s not @bgoodri’s impression. Should I follow up to check, or won’t that matter?

Is there anyway to check that adding things like pkgbuild will have negative effects on downstream packages like Prophet?

I wasn’t clear on the C++14 recommendations. Were you recommending we stick to C++11? We can recommend C++11 until we move to C++1y features. There are a lot of things we’d like to use beyond C++1y, especially around things like polymorphic closures. @bgoodri—if you know about these things ahead of time, please bring them up so we can address them. This is the first I’ve heard (that I recall at least) about c++1y being an issue.

I didn’t realize that whoever maintained Rtools was asking for donations. If you want us to make donations, you’ll have to request donations. I’d rather not be called a deadbeat over some debt I didn’t even know I had. Are there other things like this?

Please don’t call consultants mercenaries. But yes, the plan would be to get help on installers for platforms, not just Windows, so that we could (a) be up to date with recent compilers, (b) be up to date with dependent packages, and © include tools not part of Rtools. The other thing we could do with our own installer is build a better Wizard that wouldn’t confuse all of our users and would do the right thing for Stan in terms of modifying makevars.

Those “Big Challenges in 2019” are the main reason we’d like to decouple the C++ compiler from R and ideally run out of process. I know it’s a huge change, but it seems like we don’t have the person power to keep up with what we need to do for R now.

I disagree that the rate of new Stan developers coming on line is constant. I think it’s growing with the Stan population, which isn’t exactly linear growth.

Figuring out how to support things like rstanarm going forward is why I’ve been emailing @bgoodri and @jonah to try to set up some meetings and why there are no entries under those things. Do you guys have something like a roadmap for the R ecosystem around Stan somewhere?

Before this draft was written, Sean and Matthijs have built example proofs of concept and made sure we could launch executables on all of our major platforms. There are two ways that systems do this now that I know of. RStan gives people instructions that they have to run outside of CRAN to install system tools before RStan will work. The tensorflow package builds in a script in their CRAN package to go and download things from the web. I like this latter appraoch as it worked really well for me to install the tensorflow package.

Why do we need packages like brms to run unit tests on CRAN? I think we should just assume that the external install works and provide tests for that the way we test everything else.

I intentionally broke out the Stan 3 language here as that’s not going to break backward compatibility. If (and it’s a big if) we move to something blockless, it still won’t break backward compatibility. But the first plan is to move the implementation so we can nail down all the other things we need to do, like add tuples, ragged structures, closures, etc. We’ll never get those done in C++. We’ll be able to rebuild the entire parser and code generator in OCaml and do at most one or two of those things before we’d be able to do one of them in C++.

We don’t need everyting to be in RAM to access transforms, etc., but it will be necessary to do it efficiently. I know we have that all in there now, but I have no idea who’s using it and for what. I’ve never gotten any examples of anyone using RStan or PyStan to develop algorithms through those exposures. I do know some people like to write Stan functions to make faster R functions, but that’s not our core use, so I don’t think it’d be terrible to not suppor that or to move it to something like an RcppStan package.

I guess I wasn’t clear enough that I wasn’t proposing getting rid of the existing RStan or PyStan interfaces. I’m proposing adding out-of-process versions that address problems with the current installations. These would be simple to write and would provide all the current RStan functionality, but some of it would be a lot less efficient.

Please don’t characterize other people’s proposals as making good April Fool’s Day fodder, no matter how appropriate you think the analogy is. We want to keep these forums polite.

If you want to call having a single easy-to-use, completely bundled installer “corporate”, go right ahead. I think corporations are doing things right in some ways, which is why people will continue to pay them for things. This is why Windows still exists—it’s the only platform that’s serious about backward compatibliity.

I’m not actually proposing that we get rid of the existing capabilities of building everything from source with a custom C++ compiler—I’m just proposing that we also encapsulate a bunch of stuff that we know our users need. Our users are for the most part not like our devs—they’re not managing multiple C++ environemtns—they usually don’t even have one.

Indeed, I’m bundling the library (Rcpp, BH, RcppEigen, StanHeaders, etc.) dependencies and compiler dependencies together. From the core Stan C++ developer (i.e., my) perspective, they are all the same—they’re restrictions on what we can use in our code. BH and RcppEigen and StanHeaders all provide dependencies in RStan releases.

I thought the inability to synchronize releases and version dependencies on CRAN led to StanHeaders and RStan getting out of synch. To not get out of synch with Boost and Eigen, we have to support their existing libs and their next libs until BH/RcppEigen switche over, and only then can we remove the old support. This is more of an issue for developers than users. I’ll try to keep the issues more separate in the future, since I seem to be confusing people here, which was not my intent.

The FOSS standard of everything working with everyting else is nice, but I don’t see how to make it work in practice given the resources and tools we have. As an aside, do FOSS purists shun Docker containers for the same reason that they are overly corporate in their approach to bundling?

I can’t quite reconcile R and RStan and how well they live up to those FOSS principles. At the very least, we should be testing more. Doesn’t RStan require the C++ to stick to the latest BH and RcppEigen and something ABI-compatible with whatever C++ compiler R was compiled with? If Python did the same thing, we might be in a place where we had to support two entirely different versions of Boost and Eigen and C++ (actually, I think we are there). (I think Python may do the same thing and we’re just letting PyStan break in most places—I think that’s part of the motiation for PyStan3, but I doin’t see how having a standalone http server is going to help with that.)

The problem for us isn’t so much that we don’t want to do the right thing but that we don’t have the support staff to pull it off.

I think it’s very unfair to say that we’re taking from the efforts of FOSS but not giving back. We’re giving back Stan! For the core Stan, we have been filing issues with Boost and with Eigen when they come up and are clearly bugs other than design decisions that make our life hard. Should we be donating dollars to all the tools we use? From the top down, that’s R, Python and Julia, g++ and clang++ and whatever’s going on in Windows, Boost, Eigen, Sundials, Rcpp, knitr, ggplot2, …? How much?

3 Likes

How is it that Haskell isn’t on that list? It’s a best-in-class language when it comes to writing compilers.

  1. Pattern matching: very strong here.
  2. Fun: yes.
  3. Amenable to research: yes.
  4. Solid, modern tooling: stack, hspec/hunit/quickcheck.
  5. Distribution: yes.
  6. Community: yes.
  7. Production use: yes, especially in financial analysis. See also FP Complete (https://www.fpcomplete.com/about-us)
2 Likes

Can you contrast Haskell with OCaml for us in those fields?

For the sake of transparency, here is an attempt at representing my thought process here:
It was on the list but got eliminated somewhat early mostly because @Matthijs (a true PL theorist) really hates it, haha. I also found it very frustrating when I was trying to build a database in it a couple of years ago, or do even basic algorithms if they involved mundane data types like… strings. Managing state is difficult and I don’t think I’m nearly clever enough by half for monad transformers. They haven’t picked an alternative prelude yet and that seems troublesome. My experience is also colored by the fact that a friend of mine has extremely lucrative Haskell work he’s trying to subcontract out, but he can’t find anyone…

PS This quote on Haskell strings (from a Haskell supporter) is too funny not to share:

The String type is very naive, it’s defined as a linked-list of Char pointers. This is not only a bad representation, it’s quite possibly the least efficient (non-contrived) representation of text data possible and has horrible performance in both time and space. And it’s used everywhere in Haskell. Even posterchild libraries for Haskell (Pandoc, etc) use it extensively and have horrible performance because of it.

(emphasis mine to highlight the funny)

Perhaps it’s worth stepping back and answering a more fundamental question before discussing a new compiler written in something like OCaml – is CRAN a non-negotiable requirement of the Stan project?

I understand that CRAN is integral to the R experience and most R users are not comfortable downloading packages otherwise, but at the same time CRAN was absolutely not designed to handle packages like Stan and consequently we will always be hacking around if we want to maintain compatibility. I am not set one way or another, but some points that have not been made or made only tangentially that I think are really important here:

  • Stan as a project transcends Stan but maintaining CRAN compatibility has strongly influenced, and at times limited, the development of the entire project and that coupling can have awkward effects on the evolution of the project. There are significant costs to maintaining CRAN compatibility.

  • The current build system for RStan is convoluted and difficult for users without much programming experience. Yes the limitations of R, CRAN and Windows force us into that system, but it is a significant obstruction to new users especially when users have to cut and paste instructions that span multiple pages without any R scripts to check for correctly installed requirements and the like. Many Windows users don’t know how to access the Makevars file to modify it, let alone what the Path does! I teach about one course a month and not one has gone by where I haven’t had to waste valuable teaching time helping people out with install issues.

  • The separation of Stan into multiple packages and their staged updates leaves the project vulnerable for large periods of time. I have had multiple courses compromised because of RStan updates only being partially completed when the courses have started. The fact that there is no clear “RStan on CRAN will be compromised for the next X days” warning means that these problems are nasty surprises instead of expected issues that can be worked around. The recent RStan 2.17.4 issue was particularly bad.

  • The Stan community does not come entirely from the R community. That means both that potential new users aren’t as bound to CRAN as the general R community.

  • There are many benefits to the F(L)OSS approach of many packages with collaborative development, but the tools the resulting tools have a habit of being optimized for those in the F(L)OSS community already and not those we are often interested in targeting. People buy Macs and Windows because they want the “cooperate” user experience that limits the amount of thought they need to put into the process. Utilizing existing tools is awesome (we already do that with Boost and Eigen) but sometimes existing tools just aren’t designed for what we need and we have to develop our own (the math library).

In my opinion there are strong arguments on both sides of the “is CRAN a requirement argument” but it’s hard to develop a long term road map without it.

Given my courses and my other outreach activities I think that the absolute necessities moving forwards in addition to the core Stan library are interfaces in R and Python (and the CmdLine for rapid development) and common toolkit installers or instructions for Mac, Linux, and Windows and there are many ways of achieving that goal, each with their own tradeoffs.

4 Likes

Good question. Everything’s up for negotiation. I am curious as to what most of our users would think if we said “from now on, it’s GitHub installs, not CRAN” (assuming we could deliver a working stanc, which is the compilation bottleneck for memory).

What I was suggesting in the roadmap post was that we could take the tensorflow package apporach of having a CRAN package, but requiring other things to be downloaded. We already do that for C++, so I don’t see the big deal for doing it for more things.

This is one of the things I’d like to address. The main restriction is C++ compiler version now, but it’d also be nice not to be tied down on Eigen and Boost versions.

This has also been my experience and a major source of frustration.

There’s been much more effort on the R side because @andrewgelman is funding people to work on R, not to work on Python, because he uses R. Makes sense to me, but it leaves the project very R oriented.

I would like to build out Python more, but that seems like even more of a mess than R in terms of C++ compatibility. Which is why I’d like to get simple CmdStan wrappers working first. I think having those would solve a lot of our install problems for a lot of users.

1 Like

Please, NO Haskell. Despite some of its potential advantages, it creates a rat-tail of dependencies (e.g., see this ArchLinux discussion). My own experience is the same as presented in the linked discussion: While installing pandoc on ArchLinux, it pulled several Haskell packages on my laptop no other program required - and the number is growing! The size may be less a matter, but this means each time updating several Haskell packages when running a rolling release distribution. Additionally and perhaps more important, I consider more dependencies a likely higher potential for more problems.

I think that the (continuous) use of any language present from the very start on the most common OS’s (Windows, Mac, Linux, BSD) and used in core applications - i.e., C (Probabilistic C), C++, Python - prevents cluttering the OS and retain clarity and structure. This is very likely facilitating understanding and the search for errors. However, I admit that I do not know how this may ease further development, extension, and use of Stan.

I wouldn’t say I hate Haskell, but I did voice a preference against it for this particular project.

I use monads all the time to structure denotational semantics. I did a PhD in category theory so I don’t think I’m biased against monads. I just think they can add too much cognitive burden when doing simple things, particularly when you start mixing computational effects and need to use monad transformers. I’m not sure they’re the best programming abstraction for all purposes and I am hoping that they will be replaced by a better (particularly, more composable) mechanism for managing effects using the type system. In our particular case, I think they would create a much higher barrier to entry to working on the compiler.

Sure, you can write a lot of a compiler in a purely functional style, but ultimately it’s convenient to have some state around, for instance for the symbol table (convenient though not essential here: you can write everything in a pure state-passing-style with Maps rather than HashTables) and for the various compiler optimisations (really very convenient here, I think). This means our code would inevitably involve a bunch of monads.

The lazy evaluation is something I’m really not enthusiastic about in general. It results in unpredictable computational complexity - which for a compiler might not be super important - and can be very confusing for novice users. I think lazy evaluation is useful in some particular cases, but I’d rather not have it as a default.

Ultimately, my preference of OCaml over Haskell came from the fact that I believe that most people could understand a compiler written in OCaml in a couple of days, even if they’ve never seen an ML dialect before, while a compiler in Haskell would scare away most people who aren’t functional programming or category theory enthusiasts.

That said, there are a bunch of things I like a lot about Haskell: the type classes, the idea of having some mechanism in the type system for making users be explicit about the computational effects they want to use, the higher order polymorphism and generally the design of the syntax. Probably, the higher order polymorphism would be the only thing I’d really like from Haskell while writing a compiler, as I’ve experienced first hand that it can be useful for structuring some of the intermediate representations. Then again, even there I’d have mixed feelings as it really makes the learning curve for the project steeper for new developers.

@Kevin_Van_Horn, I’d love to understand what you’ve found useful though about Haskell over, say OCaml, when writing compilers. I’m far from experienced when it comes to writing compilers myself so I could well be overlooking important things here.

1 Like

Smaller clarifying points:

That wasn’t my impression, and they could have done that already with Rtools3* if they wanted to (lack of a complete C++14 implementation isn’t holding many projects back). Maybe part of the confusion is that the devtools package already exists but it does not actually come with development tools for compiling code? It is mostly geared toward documentation, unit testing, etc.

I checked on the Windows machine in my office and it was fine. Somehow, the CRAN windows server that builds and tests binary packages was configured in such a way that pkgbuild could not find its Rtools when building the binaries for prohphet. I am sure that was a surprise to everyone involved and it will get fixed. Until then, I’ll just set required = FALSE when calling pkgbuild::with_build_tools() in that context and it will go ahead and try to compile (which presumably will succeed since it did before we started using pkgbuild to build prophet).

I probably should have done more homework here. But since we had been unit testing with the -std=c++1y flag since August of 2017, I assumed that meant there was now some part of Stan Math or stanc or the algorithms that was using something from C++14 that was not in C++11 but was in g++-4.9. Except I guess we don’t yet, although I think one of Rob’s PRs does. Anyway, I was saying to CRAN “everything works perfectly if all these packages just set -std=c++1y everywhere” and CRAN was saying “that is because you are not using anything beyond C++11”. But at this point, we are actually in good shape for when Stan starts using polymorphic lambdas, auto as a return type, etc. if I could only convince CRAN that we are only planning to utilize what g++-4.9 can handle.

CRAN runs them for all packages, not just brms, for a bunch of operating systems. This is a good thing, even if it means your tests break on Solaris.

These are fair points. But up until July 2015, rstan wasn’t on CRAN and things were a lot worse from an installation standpoint, particularly on Windows. And with rstan being on CRAN, we have enabled rstanarm and a bunch of other packages to be on CRAN that utilize Stan but don’t require the user to have a compiler.

Going to C++14 has been rough for anticipated and unanticipated reasons, but like I said, I am optimistic that for rstan 2.18.2 on Windows

  1. People can just install Rtools without changing the PATH or selecting any non-default options or having a ~/.R/Makevars file
  2. The pkgbuild::with_build_tools() will find the Rtools (or offer to install it if the user skipped the previous step) and temporarily alter the PATH until it is done compiling
  3. R will pass -std=c++11 when it compiles a Stan model at runtime but the rstan plugin will trump that by passing -std=c++1y later in the list of flags, so we can utilize some C++14
  4. With C++14, rstan can pass -march=native without crashing R

So, if a Windows user does not have a ~/.R/Makevars file, the Stan model will compile at runtime with -O2 -g -march=native -std=c++1y. If they have a ~/.R/Makevars file, they could improve to -O3 and preclude the debug symbols, but it is not a big deal if they lack the ~/.R/Makevars file.

If the only thing Windows users have to do is click through an Rtools installer, I think everyone could live with that. On Macs, we were in really good shape until Mojave became prevalent and the toolchain installer is being updated for Mojave and clang-6 this week.

I’m definitely out of my depth on both software development and programming language theory here, but from a user perspective I wonder if a choice like Haskell and an implementation in (mostly) functional style couldn’t allow the whole thing to be more modular and extensible (for instance feeding an analytically-calculated jacobian to the sampler or implement a different one altogether could be as straightforward as writing a single function).
Granted, Haskell may not be the most widespread or user-friendly language, but maybe it’s more so than OCaml, I’m not sure.

Alternatively, were languages that already have interfaces to Stan like Scala or Julia considered? They should allow a functional style implementation while being without enforcing things like monads and probably allowing some state to be exploited, although I’m not sure they cover the whole list of desired features. Also, maybe something like that could be a step towards an interpreted version of Stan.

Like I said, I can only speak from a user perspective, and to me it would be useful to prototype models and other functions interactively directly in the implementation language, like in a Python or Julia shell, so maybe something like this would make the core more accessible to people like me that rarely dare to go into the guts of Object-Oriented implementations.

I don’t agree with this and was under the impression most of us disagreed with this - people do not reliably click through the RTools installer, and many people (half of Windows users I meet?) do not have admin rights in the first place. This is the status quo and the exact point of contention - it must change.

I think this illustrates the extremely pernicious nature of “But next version, everything will just work.” The current architecture has way too many failure points and modes and every release something breaks. From the outside, it seems like you spend most of your time fighting with the direct results of two choices: linking against Stan models at runtime from R, and releasing a C++ shared library through CRAN.

2 Likes

The Rtools installer does not require admin rights to install. You often need admin rights to change the PATH globally. But that is the point of having the pkgbuild package so that changing the PATH globally is not necessary.

With the Mac thing, we are currently fighting the fact that Mojave moved the C++ standard headers out of /usr/include which has been the typical place for headers on a UNIX-like system going back decades.

The shared libraries in StanHeaders (that has CVODES) and rstan haven’t been big problems because most people on Windows / Mac either install the binaries from CRAN or have the capability to build them from source locally.

But loading a dynamic shared object containing a model that is compiled at runtime is a pretty central design choice. If a user can get it compiled, it is mostly non-problematic. Two potential problems are that it could be compiled with a different ABI than R and / or Rcpp and that there is a limit to how many dynamic shared objects you can load, so people can get themselves into trouble if they are reloading them from inside a loop. The first hasn’t been a problem in practice with Windows (yet) because g++-4.x has been the only choice for like a decade. It hasn’t been a problem on a Mac because clang always seems to be ABI compatible with itself even across different versions. It was an occasional problem on Linux when g++ transitioned from 4.9 to 5.0 and people got a mismatch one way or the other when they tried to dynamically load the shared object, but that hasn’t been at all prevalent for a few years.

Certainly, you could have a design that avoided dynamically loading a shared object entirely and did shell calls. With that, you give up things like expose_stan_functions and it is harder or less efficient to do things like log_prob. You couldn’t use grad_log_prob with some other optimization routine in some other R package. It’s not as clear yet what it would mean for packages that come with Stan models that create Rcpp Modules at installation time or how to do the ReferenceClass / PPL stuff I had been planning for RStan3.

That is not a small amount of stuff to give up, and while we can talk about how valuable that is, I don’t think the historical problems Windows users have had setting the PATH right for Rtools justify changing the way Stan is distributed, installed, and used on all platforms, particularly when there are people working on making the Rtools thing more seamless.

There is no need to change the global PATH variable… I am having a user specific one on Windows and it works just fine.

Much has been said in this thread… so I keep it brief. I was at first very skeptical at adding another huge dependency like OCaml/Rust/… The idea to solve this by distributing a binary of stanc is still scary to me. It may actually work after all, but being able to setup RTools and go from there felt always like a good thing. However, hearing that moving to OCaml will likely accelerate the development of the language (and its parser) is a great thing to trade all of this for (given we find good ways of distributing things). Maybe not relying on the latest and greatest OCaml will ensure that this beast can be installed for the major OSes using macports/homebrew/Ubuntu/Redhat easily.

Wrt. to CRAN: I do think CRAN is a great platform to distribute rstan! By all means - this makes like a lot simpler to users, I think. Of course, messing with Makevars is not good, but well… maybe this is a good filter of our user base anyway (there is still rstanarm+brms+…). How to make CRAN work with the OCaml approach is not obvious to me yet.

The thought to have a rcmdstan package is a great one from my perspective - it makes working with stan a lot easier on clusters!

The idea to run stan in as an interpreted script does not sound really useful to me other than for educational purposes. While this is nice, I am not sure if this is the primary objective for Stan.

Yes, this Mojave thing has some specific cause that is unique to this particular time our release was held up. The point here is the larger pattern across Windows and Mac, Eigen headers, etc.

This is the problem, they mostly cannot. Also, users like Andrew are still experiencing crashes that bring down RStudio. That coupling doesn’t have to happen for performance reasons.

There is an easy route for this to being able to efficiently call log_prob etc with far less maintenance burden and less coupling - a cmdstan package with a server mode that R or any other language can connect to over a socket using protobuf or whatever format is convenient.

The point is that we historically spend a ton of resources because of this architecture, the end can not be reasonably said to be in sight, and with some effort we can change to a more robust architecture and lose very little. I think Bob is advocating keeping around a version that links in to R, but even with the most extreme proposal I think the only thing that might change from a user’s perspective is that everyone, including downstream packages that use rstan, must call something like install_rstan() once before use, which is again also how tensorflow works.

I’m not sufficiently familiar with OCaml to really contrast the two.

As for strings, there are two responses:

  1. The efficiency issue only matters if you’re processing a very large amount of text. At the scale of a single Stan source file it’s a non-issue. My time-series-model compiler uses Haskell and String for strings, and I haven’t noticed any performance problems.

  2. Anyone who uses Haskell to process a large amount of text uses the Data.Text and/or Data.Text.Lazy modules together with OverloadStrings language pragma. It’s pretty much a drop-in replacement for Data.String.

ernest, are you using stack? It’s the standard Haskell tool for building and managing package dependencies, and does a very good job of it. It avoids the problems you mention like cluttering the OS.

I think that the (continuous) use of any language present from the very start on the most common OS’s… and used in core applications… prevents cluttering the OS and retain clarity and structure.

You just ruled out Rust and OCaml…

This is what I dislike most about this discussion about the draft. We have had one widespread, major, systematic problem, namely, getting users to configure things so that make can find the (correct) Rtools on Windows. (although it is not true that they mostly cannot). The first dozen things that a FOSS project should do in a situation like that is to work with Rtools and other FOSS projects that have that same configuration problem to try to lessen it. The Stan project has not done that ever, but fortunately for us, other FOSS projects that have that same Rtools configuration problem have been making good progress on it recently.

To get around an ASCII configuration problem on Windows, Stan changes are being proposed to the distribution, installation, and calling process that would apply to each of the operating systems. Those are very broad changes that could avoid the Rtools configuration problem, but that is something that now seems solvable with very narrow internal changes (utilize `pkgbuild::with_build_tools() to compile Stan programs at runtime, which temporarily changes the local PATH so that the proper version of Rtools is first and then sets it back after it is compiled).

If we work with the developers of Rtools, pkgbuild, the Mac toolchain installer, etc. we can make progress that helps not just us and users of Stan but anyone who is compiling stuff for R packages or even not R. Developing our own Stan installer that comes with a compiler is a large undertaking and one that wouldn’t be that helpful for anything but compiling Stan programs.

Proposed changes that make it harder to distribute Stan, or that curtail current features, or are otherwise disruptive to people’s workflows need to meet a high burden before we go off implementing them. A R package that runs a Stan Installer script to pull in a compiler, dependent libraries, stanc binary, etc. is a non-starter for Linux distributions so putting that into RStan would get r-cran-rstan removed from Debian, Ubuntu, etc. CRAN operates much like a Linux distribution for R packages and is unlikely to agree to run a Stan Installer on its servers when there is already a compiler and dependent libraries available. That is going to be a problem, perhaps a big problem, for anyone who has or wants to have a CRAN package that depends on RStan.

Giving up the process of dynamically loading a shared object into RAM has costs too. We certainly can add more stuff to CmdStan, but I don’t see a way to do expose_stan_functions without a DSO. And expose_stan_functions has been awesome, allows people to utilize Stan in a different way from within R, is important for unit testing packages that come with a bunch of Stan functions that their programs are calling, etc. I haven’t gotten that far along with it yet, but without a DSO, it looks as if it would be hard to a lot of the PPL stuff that I had been planning for RStan3 where the C++ object would have a pointer to the draws in R’s memory and the R object would have pointers to the symbols from the data and transformed data that are in C++'s memory.

Since I accidentally hijacked the thread before, I want to re-emphasize that I think the effort to try rewriting the parser in OCaml or Rust or Haskell should keep going with that. I think (and it seems @Krzysztof_Sakrejda would agree) that RStan and CRAN could accommodate whatever that turns out to be, although I am not completely sure how yet.

I also think that CmdStan 2.19 for Windows should come with (a script to download?) the RTools40 beta, which has make, sed, grep, etc. in addition to g++-8.2 and ArchLinux’s package manager if the user wants to install some stuff that is ancillary to Stan. g++-8.2 is already way better than g++-4.9 and supports STAN_THREADS. It installs to C:\rtools40 without changing the global PATH so we could (have the script) use that prefix and things are ready to go. If we set out to build a Stan Installer that came with a compiler on Windows, we would be very hard-pressed to do a better job than what Jeoren has already done with Rtools40. And it couldn’t hurt if Stan helped his effort with some CmdStan testing. If there is enough interest in a PyCmdStan and / or RCmdStan that shell calls CmdStan, that’s fine too.

1 Like

Matthijs: I’m baffled as to why you think you need state in order to have a symbol table. Data.Map does the job just fine. I think you’re engaging in premature optimization by insisting on HashTables. I’m also baffled as to why you think compiler optimizations require state. I’ve written the initial version of an optimizing state-space-model compiler in Haskell; the only thing state-like is using a monad to help with generating unique symbols for transformations that involving introducing new local variables.

This means our code would inevitably involve a bunch of monads.

What’s the problem? Monads and do notation aren’t that hard.

Ultimately, my preference of OCaml over Haskell came from the fact that I believe that most people could understand a compiler written in OCaml in a couple of days… while a compiler in Haskell would scare away most people who aren’t functional programming or category theory enthusiasts.

Point taken. But there is a large and growing community of Haskell programmers. In the TIOBE Index for October 2018, Haskell ranked as the 39th most popular programming language, while OCaml didn’t make it into the top 50.

https://www.tiobe.com/tiobe-index/

As to why I like Haskell over OCaml… I’ve been a software developer for decades, and care a lot about correctness of my software. I’ve been writing statistical software for 9-1/2 years now, and one of the challenges of statistical software is that it is difficult to unit test – you don’t know what the answer should be for any but the most trivial of cases – and easy to have non-obvious bugs where you get the wrong answer but don’t realize that anything is wrong. Haskell excels in environments where correctness is very important. Haskell’s rich type system helps a lot here; it is often said that “once your code compiles it usually works” (Why Haskell just works - HaskellWiki). In addition, having everything be purely functional is a big win when it comes to writing independent unit tests, and fosters innovations such as the QuickCheck library for property testing. Finally, Haskell excels at powerful abstractions in a way I haven’t seen in any other programming language.

Rust strikes me as a good target language for the compiler… but why is it being considered for implementing the compiler itself? The lack of garbage collection is a big disadvantage when you’re doing lots of symbolic expression manipulation.