Tips for using cmdstanpy as a dependency in a Python package

Hi! Keen to get thoughts on what the best approach is for having cmdstanpy as a dependency in a Python package. For the prophet package we currently have cmdstanpy as an optional backend (see GitHub - facebook/prophet: Tool for producing high quality forecasts for time series data that has) and end-users need to install cmdstan themselves before installing prophet. Is there a recommended way to make the installation just one step? Since prophet requires a build step anyways, I was thinking we could put install_cmdstan() somewhere inside setup.py.

And a follow-up question, is there any guidance on how we might include a stan model in a package without requiring the end-user to build the package? Installs can take a while (building cmdstan + compiling the stan model included in the package) and can fail for different reasons on different machines, so I’m wondering if there’s a way to package everything into a wheel. I guess I’m looking for an equivalent of how RStan does this, as described here Guidelines for developers of R packages interfacing with Stan • rstantools

2 Likes

latest CmdStanPy now has conda installer that also will install CmdStan - Installation — CmdStanPy 1.2.0 documentation

include a stan model in a package without requiring the end-user to build the package?

is this something you could do with Conda?

What if the models are compiled and needed files are packed together with the wheel?

→ execution files + tbb

It seems like prophet is on conda-forge already, so this should be very easy to do. (Edit: to be 100% clear, I mean including cmdstan[py] as a dependency should be very easy. I haven’t tried to see if conda could ship a compiled stan model, but maybe!)

To support both this and non-conda setups, I think that checking cmdstanpy.cmdstan_path() would work. This should throw if cmdstan isn’t found, and then you can call install_cmdstan internally.

1 Like

Thanks for your responses!! @ahartikainen yep I ended up trying to package everything together (actually looking through past Issues I think you suggested the same thing a while back!) just using setup.py and a packaging tool.

I included the built cmdstan package in the wheel as well, but that bloated it by quite a bit (the wheel ended up being ~170MB). I think that’s the only way to ensure that the model .exe is run using the same cmdstan version used to compile it. It did work very nicely out of the box, but I’m not quite sure if if that’s worth the large wheel size.

Thanks again for the help :)

I don’t think cmdstan is needed after compilation. Only tbb.

Ahhh, do you mean that we only need to keep some of the files in the cmdstan-2.26.1 folder? Sorry I’m not sure which files are the tbb files.

Oh sorry I think I see what you’re saying – we don’t actually need to keep the cmdstan files at all? e.g. in this notebook you made: Precompile_cmdstan/02_sample_without_cmdstan.ipynb at master · ahartikainen/Precompile_cmdstan · GitHub we could still instantiate and fit the Model object even without a link to a cmdstan installation. Curious - is this a bug or a feature? haha

I ended up pruning the stan/ folder and the downloaded binaries macos-stanc, linux-stanc, windows-stanc, etc. in order to get the package size way down, and everything still seems to work. Thanks for the tip @ahartikainen !

I think my original questions have been largely answered so thanks everyone :) In case anyone reading this is interested, here is the work-in-progress PR we have on Prophet at the moment: [Draft] Python Wheels for PyPi by tcuongd · Pull Request #2010 · facebook/prophet · GitHub . There are still some issues with Linux we need to figure out and other optimisations we can do.

2 Likes

After Stan model is compiled to an executable, then it doesn’t need anything from cmdstan, it is a standalone. The only thing the executable needs, is tbb executable, which needs to be on the $PATH. This executable can be found from cmdstan folder, but atleast on Windows you could just copy it into the same folder your executables are it would still work. On macos and linux I think you need to do some hack for the executable so that it works correctly.

2 Likes

we should think about the tooling required for development mode vs. the tooling required for production mode - there’s a brief discussion in the docs (https://mc-stan.org/cmdstanpy/workflow.html):

The statistical modeling enterprise has two principal modalities: development and production. The focus of development is model building, comparison, and validation. Many models are written and fitted to many kinds of data. The focus of production is using a trusted model on real-world data to obtain estimates for decision-making. In both modalities, the essential workflow remains the same: compile a Stan model, assemble input data, do inference on the model conditioned on the data, and validate, access, and export the results.

it sounds like the tooling for production mode would be just the model executable, the tbb lib, CmdStan utilities stansummary and diagnose, plus CmdStanPy. (probably can skip the CmdStan uilities).

3 Likes

Thanks again @ahartikainen and @mitzimorris for the clarity!! Our code for the packaging step currently looks like this (not reviewed yet):

CMDSTAN_VERSION = "2.26.1"
BINARIES_DIR = "bin"
BINARIES = ["diagnose", "print", "stanc", "stansummary"]
TBB_PARENT = "stan/lib/stan_math/lib"
TBB_DIRS = ["tbb", "tbb_2019_U8"]

def prune_cmdstan(cmdstan_dir: str) -> None:
    original_dir = Path(cmdstan_dir).resolve()
    parent_dir = original_dir.parent
    temp_dir = parent_dir / "temp"
    if temp_dir.is_dir():
        rmtree(temp_dir)
    temp_dir.mkdir()

    copytree(original_dir / BINARIES_DIR, temp_dir / BINARIES_DIR)
    for f in (temp_dir / BINARIES_DIR).iterdir():
        if f.is_dir():
            rmtree(f)
        elif f.is_file() and f.stem not in BINARIES:
            os.remove(f)
    for tbb_dir in TBB_DIRS:
        copytree(original_dir / TBB_PARENT / tbb_dir, temp_dir / TBB_PARENT / tbb_dir)

    rmtree(original_dir)
    temp_dir.rename(original_dir)

def build_cmdstan_model(target_dir):
    import cmdstanpy

    cmdstan_cache = get_cmdstan_cache()
    cmdstan_dir = os.path.join(target_dir, f"cmdstan-{CMDSTAN_VERSION}")

    if os.path.isdir(cmdstan_cache):
        print(f"Found existing cmdstan library at {cmdstan_cache}")
    else:
        cmdstanpy.install_cmdstan(version=CMDSTAN_VERSION, dir=cmdstan_cache)

    if os.path.isdir(cmdstan_dir):
        rmtree(cmdstan_dir)
    copytree(cmdstan_cache, cmdstan_dir)
    with cmdstanpy.utils.pushd(cmdstan_dir):
        clean_all_cmdstan()
        build_cmdstan()
    cmdstanpy.set_cmdstan_path(cmdstan_dir)

    model_name = "prophet.stan"
    target_name = "prophet_model.bin"
    sm = cmdstanpy.CmdStanModel(stan_file=os.path.join(MODEL_DIR, model_name))
    sm.compile()
    copy(sm.exe_file, os.path.join(target_dir, target_name))
    # Clean up
    for f in Path(MODEL_DIR).iterdir():
        if f.is_file() and f.name != model_name:
            os.remove(f)
    prune_cmdstan(cmdstan_dir)

Keen to get thoughts!

Is this code run in CI or on user computer?

I don’t quite follow this remove-copy idea? Is this to make sure that cmdstan is build with default settings?

I think this is not needed

Why is this needed? You pack the whole cmdstan folder structure + executables for summary etc?

I think what you should only need to do, is compile executable files (.bin) and then pack them into the wheel, as a data folder in your package + compiled tbb files. Then your user doesn’t need to have cmdstan installation.

Thanks @ahartikainen! Responses to your questions below:

Is this code run in CI or on user computer?

The main purpose would be to run on the facebook/prophet CI on package release, in order to generate the .whl files for PyPi. Ideally it’d work on users’ computers too if they want to build the package locally (but I guess we can think of this as a secondary use case for now).

I don’t quite follow this remove-copy idea? Is this to make sure that cmdstan is build with default settings?

I don’t think the remove is necessary on CI actually, you’re right. I guess this step is pretty defensive; it’d only be relevant if the end user had previously built the package in-place (i.e. in the same folder as the source code).

I think this is not needed

You’re right, thank you!

Why is this needed? You pack the whole cmdstan folder structure + executables for summary etc?

Yeah exactly.

I think what you should only need to do, is compile executable files ( .bin ) and then pack them into the wheel, as a data folder in your package + compiled tbb files.

Hmm we still use cmdstanpy code in our package, particularly the .fit(), .sample(), .draws() methods. The CmdStanMLE and CmdStanMCMC objects are also saved to the fitted Prophet object. I get your point though that the end user probably doesn’t care about the MCMC diagnostics at this point, and it’d probably simplify the setup script if we just remove the cmdstan installation altogether. I’m happy to include those executables in the wheel for now though (the total wheel size is still reasonable, <20mb).

1 Like