Boost build into stan-math?

are people ok with having boost build in stan-math eventually?

This would be very useful for the planned integration of boost mpi and boost serialization library.

Pro: Hopefully easy deployment of those two libs
Con: Lots of extra stuff in stan-math

I am not sure yet what options are best, but before embarking on boost build, I wanted to run this by the team.

Is there really another alternative?

The different ways we could do this:

  1. (simplest) just include all of boost in the math distribution
  2. put things that aren’t used by Stan / Math that we need for conditional builds (MPI / GPU) into a separate folder
  3. expect the user to install separately; boost can be installed using things like brew or apt

My preference is to include everything from boost into the math library. I know that means there’s pruning done by RStan and PyStan, but I think that step is already done now. It’ll make it easier for developers if it’s all sitting there.

How do other, similar software packages handle this issue?

My sense is that bundling external libraries is generally regarded as
undesirable. It’s certainly not needed on Linux since you can install
boost using the system’s package manager.

That’s not necessarily true. On a desktop that one administers, that’s usually true. However, the reason a Boost build is being considered is so that Boost.MPI can be used in Stan, which means that there’s a good chance that someone is going to want to run Stan on a Linux HPC cluster. Users on clusters would generally not be using the system’s package manager. They would either use whatever Boost implementation that the cluster has available (if any), or compile Boost.MPI from source in their home directories.

Furthermore, in my experience, even HPC clusters that have Boost available often don’t have Boost.MPI available. (This is probably because there are usually several compilers and MPI implementations on a cluster, and it would be a pain to have a Boost.MPI implementation for all those compiler/MPI combinations, and only mildly useful to have a Boost.MPI implementation for a subset of them.)

I would stay away from relying on the system package manager since Linux systems are still bad at allowing multiple versions of Boost. I certainly have compiled Stan against my system Boost but I wouldn’t foist that mess on anybody… and we don’t have the resources to package properly. That’s not to say we couldn’t be agnostic to the source and we should definitely not make it more difficult to use an independently installed version.

TL;DR: I think we should distribute source. There are incompatibilities with different boost versions.

That’s a good question and I think it’s highly non-standard. I was looking through the “Trending C++ repositories on GitHub”. Most projects don’t use boost.

These are some that include the source:

These are some that require installation:


The real issue we have is that as we’re pushing into MPL, not all versions of boost work. This is actually true for Stan’s stanc too. It is much easier relying on one version of boost.

There are (at least) two audiences we care about:

  1. developers
  2. users of RStan, PyStan, CmdStan (and everyone else that depends on CmdStan)

For the most part, both audiences are aligned. It’s much better to have an easy install rather than require a more complicated process in order to download the right libraries.

To easy development, I’d really suggest we include the whole source.

For release in things like PyStan, I’d suggest we run bcp or manually remove the source if that helps. If it’s really a concern (and I don’t want to do this unless there was a very, very good reason), we could possibly break out the third party dependencies into a separate submodule. That would mean Math gets tagged with a version of the dependencies, but the dependencies aren’t included in GitHub’s downloads and it’s easy to just use the tagged version of the repo without any of the libraries. That really adds a lot of overhead for the rest of the process, so I’m really against it, but it is possible.

@jjramsey : Thanks, your post confirms our current plans which is to expect that there is some MPI installation on the target system already while the rest should be contained in stan-math. The MPI installation will already be part of the cluster where MPI is used, but all the rest cannot be expected to be available (let alone in adequate versions). For Mac users a macports/homebrew installation of openmpi or mpich should be straightforward. On Linux the installation of MPI should also work easily; I have no clues if this is doable on windows.

Now, I also think that minimize what gets distributed is a good thing. On the other hand, we already have experienced boost bugs which forced us to do small changes to boost. So having it there is good. How about we include only what we really need. So in addition what we have now, we include in addition

  • boost build
  • mpi & serialization library

Does that make sense?

I haven’t checked lately what the difference is, do you have some idea? All I know is that the math repo recursive download is slow.

1 Like

One thing I wonder about integrating Boost.MPI into stan-math is now you plan to deal with the various quirks in MPI installation that happen on various clusters. Sometimes the available MPI C++ wrapper is simply mpic++ or mpicxx, and IIRC, the Boost.MPI build process handles those almost automatically. However, sometimes there are multiple MPI wrappers; for example, on some clusters, the wrapper for g++ is mpicxx, while the wrapper for the Intel C++ compiler is mpiicpc. In some cases, there is no explicit MPI wrapper. For example, on Crays, there is usually a wrapper script for the C++ compiler called CC, which often handles both serial and parallel compilation. These issues in turn may smoke out bugs and issues in the build system currently in use in stan-math and CmdStan.

Well, these quirks are the reason to go for boost build. As far as I know, you can tell boost build which mpi installation to pickup from a given system. All one needs to do is to tell boost build which mpicxx or mpiWhatever to take. If that is not sufficient, then one even has the choice to tell boost build exactly from where to include what and link against what (the very hard way).

So as long as the user can supply its project-config.jam or user-config.jam (not sure yet), we should be good, right?

Hmm… I recall some old tweaks for the intel compiler. I am not sure if these are still in there. My point was that it can be convenient to have full control over the sources…cvodes is the same story. For cvodes we have to ifdef out all those stupid printf statements to make CRAN happy.

Are you planning on the user supplying the .jam file directly, or do you expect to generate the .jam file based on user input (for example, variables in make/local)?

I have not yet started to think in detail about it. The more we can automate, the better. I do not think that make/local should be translated to a jam file, but a make/mpi could be used for that. I hope we can make the „easy“ configurations work out of the box automagically, but also leave things flexible enough for those non-standard cases.

… if you like to contribute to this mpi build problem you are more than welcome to draft something…

You got me all wrong, I’m all for including the sources.

I really appreciate all the time you’ve put into this but unfortunately puts you into the position of knowing better than most people what needs to be done. Are there specific tasks you want help with? For example if there’s a branch that builds by following some instructions, I could do some comparisons of how difficult it would be to provide Boost in different ways but I don’t want to get into figuring out how to build a branch/example from the beginning.

Hi!

Help would be great and much appreciated. Since the boost mpi 1.64 is broken, the first thing anyone could do is upgrade boost to 1.65.1.

On stan-math I am working on the branch feature/concept-mpi-2 which includes at the top level a MPI_NOTES file which details steps to get a working system. The final tweaks to get the mpi tests compiling require a make/local file. The one which I have setup for myself is in the cmdstan branch feature/proto-mpi.

The mpi tests in stan-math are under prim / mat / functor .

If you want to try to get those tests up and running that would be great. In case of any issues, I am happy to help out.

Alternativley, I could wipe my system and see if the steps detailed in the file are sufficient. I think they contain all what is needed, but I would know once I clean things.

I can set it up locally, I’ll just ask separately if there’s an issue.

It’s a tradeoff. Bundling is easier for users as they don’t have to manage dependencies. You can see what a pain this is from something like the Tensorflow install instructions. Bundling is wasteful in that it can lead to multiple installs of the same software.

Given that our users aren’t unix sysadmins for the most part, I tried to err on the side of making it as easy as possible for them to install things. So we bundle the OS libraries on which we depend into Stan. You don’t have to use them for PyStan if it’s easier to do some other way.

For CmdStan: bundled, but you can use a command-line directive to use a different version than the bundled ones

For RStan: bundled (has to be for CRAN), but also has the option of pointing to different library version

Because CRAN forces RStan to remain consistent with the CRAN BH package (Boost headers), it really restricts what we can do. This is the main drawback with forced external linkage. But it’s not an intrinsic property of external package management.

Lots of packages on GitHub include other packages as submodules because Git makes that easy.

Obviously there’s a trend toward bundling with things like Docker (but I believe you can have a partial Docker image that links externally).

Tensorflow uses external dependencies, so their instructions on how to deal with all the management that entails is rather long: https://www.tensorflow.org/versions/r0.12/get_started/os_setup
We’ll be going down that route for MPI and GPUs, I’m pretty sure.

Hmm—looks like Tenosrflow only works with Nvidia GPUs, just like MxNet.