Experiment Frameworks for Running/Comparing Lots of models/configs/results

Running large numbers of models/configs etc to explore a problem can easily get overwhelming. This problem is shared with creating/tuning machine learning systems and I am aware of at least one tool that attempts to help with it: ML Flow (https://mlflow.org/)

Some questions:

  • Are there other packages that people particularly like/don’t like etc…?
  • Any feedback on MLFlow appreciated. My initial experiments in a Databricks environment seem ok. Does anyone have experience with MLFlow?



I think @rybern is working on related issues of organizing a family of models for his dissertation.

I can never figure out what these web products do from their home pages. Their doc page is better, and I think their use of “lifecycle” corresponds to our use of “workflow”. It looks like some kind of web app for managing and sharing results. It says it’s application agnostic, so I’m curious how hard it was to integrate Stan into it and what you used it for.

1 Like

Not sure, but if think Stantargets maintained by @wlandau is somewhere along the lines of what you are looking for: GitHub - ropensci/stantargets: Reproducible Bayesian data analysis pipelines with targets and cmdstanr

1 Like

Thanks @Bob_Carpenter, yes, I’m exploring an abstraction that should be helpful for organizing and automating model exploration. Andrew posted a video of it here (it’s a very bad video, I’ll replace it soon!)

1 Like

You might also check out the infrastructure used by the SBC package. I also made an SBC framework of my own, using targets directly (after finding stantargets too limited at the time)


Funny timing, I just gave a talk on Stan + MLFlow at StanConnect Ecology part 1 – we’ve used this extensively over the past few months to great effect. I find it works well once you get things set up, and fits neatly in a Bayesian workflow/MLOps pipeline.

Edit: The compelling use case for me is that a Bayesian workflow involves a lot of experiments. Tracking experiments helps organize your work and more systematically see whether your development effort on a model is resulting in improvements. That said, diligent tracking of experiments is hard when it must be done manually. The value proposition of tools like MLFlow is it automates this tracking, which makes it easier to navigate the Bayesian workflow. As a nice side effect, MLFlow also provides a way to share results and deploy models more easily.

Some very minimal examples with cmdstanr, brms, and lm here, along with slides: GitHub - mbjoseph/mlflow-stan: MLFlow with cmdstanr


You should check out GitHub - d6t/d6tflow: Python library for building highly effective data science workflows

1 Like

I was just about to post the same question. I’m mainly interested in tools for keeping track of different parameter settings over time.

One that caught my attention: GitHub - google/gin-config: Gin provides a lightweight configuration framework for Python

Gin provides a lightweight configuration framework for Python, based on dependency injection. Functions or classes can be decorated with @gin.configurable , allowing default parameter values to be supplied from a config file (or passed via the command line) using a simple but powerful syntax. This removes the need to define and maintain configuration objects (e.g. protos), or write boilerplate parameter plumbing and factory code, while often dramatically expanding a project’s flexibility and configurability.

Gin is particularly well suited for machine learning experiments (e.g. using TensorFlow), which tend to have many parameters, often nested in complex ways

1 Like

To add some more thoughts based on our experience so far with MLFlow + Stan:


  • MLFlow is lightweight, doesn’t require major changes to existing code
  • MLFlow has both an R and Python API
  • Easily share model results by pointing people to a run’s URL
  • Integration with Azure Databricks is easy, but MLFlow does not lock you into using Azure
  • MLFlow works with any modeling framework (Stan, brms, lm, pytorch, random forests, scikit-learn, whatever)
  • Model registry is nice to have, as a way to promote an experimental model to production
  • Open source, and actively maintained
  • MLFlow has been around for a while, is on version 1+, and seems to be in a sweet spot as far as maturity + community + active development/maintenance


  • mlflow R package installation can be challenging, depending on your python environment hygiene (requires reticulate, and an mlflow conda environment)
  • Because MLFlow is so generic, it lacks some features that would be nice to have in a Bayesian setting (e.g., there’s no built-in support for comparing distributions from one experiment to the next).
  • I have not found the built-in visualizations in the web interface to be particularly useful
  • The R and Python APIs are not unified (e.g., it’s not like Earth Engine’s one-to-one mapping from javascript to python), which can be confusing and sometimes requires reading both language docs to figure out how to do things
  • I initially found the R documentation to be hard to follow
  • R API seems to be somewhat of a second-tier priority relative to the Python API (though this is probably a fair prioritization given the composition of MLFlow users)

I would love to see more built in support for Stan in particular and PPLs more generally - taking a look at the examples in their GitHub repo there’s definitely a focus on languagues/frameworks that are used more in a machine learning context (but see the Prophet example): mlflow/examples at master · mlflow/mlflow · GitHub

1 Like

Also check out Weights and Biases Weights and Biases · GitHub