ScalaStan talk at Strange Loop 26--28 Sep 2018

Joe Wingbermuehle, the architect and developer for ScalaStan, is giving a talk at Strange Loop in September:

Here’s the abstract:

SCALA DSLS AND PROBABILISTIC PROGRAMMING

Stan is a probabilistic programming language for statistical modeling, data analysis, and prediction with interfaces in R, Python, and other languages. By implementing a statistical model in Stan, one can perform Bayesian inference using Markov-Chain Monte-Carlo (MCMC) as well as optimization and variational inference.

ScalaStan is a fundamentally new kind of interface to Stan. Not only does ScalaStan allow one to interface with Stan from Scala, but, unlike the other Stan interfaces, ScalaStan also supports the type-safe programmatic manipulation and generation of Stan programs via an embedded domain-specific language (DSL). Thus, ScalaStan allows one to fully specify a Stan program in Scala, marshal data to and from the program in a type-safe way, and cache Stan models for fast iteration.

In this talk, we show how the Scala type system allows us to enforce type-safety in the Stan model and prevents us from generating invalid Stan code. Next, we show how the ScalaStan DSL can be used to generate higher-level Stan models. Finally, we dive into the details of several specific techniques ScalaStan employs to enforce type safety and prevent invalid code in an embedded DSL.

2 Likes

I love Scala and type-safety. I just wonder, is it really such a big gain in the case of Stan to have type safe models?

I mean Stan code is normally less than 100 lines of code, for such a small codesize why is type safety needed ?

I love type safety but is that the only advantage of ScalaStan?

For my frequent usage, there are several benefits which may be non-obvious from previous description or discussion:

  • sounds silly, but I love using ScalaStan from within an IDE with all the auto-complete, parameter help and instant red-underlining / type checking! Maybe we can crowdsource adding ScalaDoc to all the Stan functions one day…
  • Full integration into the backend of an analytical application. If you are making/solving/predicting from models as part of a server-side application, which for me is already on the JVM, I can do everything from right inside the same runtime.
  • raising the level of abstraction raises the limit on how much complexity you can handle. When I used Stan directly I did have very short models as you mention, now I have some models which are many hundreds of complex lines because I compose Stan models out of higher-order Scala primitives we’ve built that abstract over things like adding fancy prior hierarchies, error models, variable transforms, and adding more complex sorts of predictors.

So I think that’s the real value … being able to abstract over the composition of Stan programs. ScalaStan gives you that with type checking that enforces Stan semantics, as opposed to the alternative of “string concatenation” (and no host language compile-time verification that your Stan program will compile) to build up a program.

Nice, I think I get your point.

May I ask what is your stack ? Is it full Scala ? Or do you “have to” use python also for something ?

Ideally I would like to use Scala for everything which is simple data manipulation/transformation etc… and some ML lib (STAN), for example,
for the actual “brain” of the pipeline. Ideally, writing some simple wrapper around
whatever ML lib I want to use and call it from Scala, do the “ML” and then do the rest
again in Scala.

So basically Scala is the “glue” between the “something else ML lib”-s which can be
whatever language, since the interface between the Scala world and “some ML lib world”
is very narrow, a few function calls, few parameters. For example the interface to this piece
of “ML algorithm” : https://scikit-learn.org/stable/modules/mixture.html , is very narrow, and
can be called from Scala easily, wrapped into something type-safe. So the Scala side only
sees something type-safe and then life is easy.

The big part of the story is that there is a lot of book-keeping around the actual “ML brain”.

It is much better to write that domain specific book-keeping in Scala than in python or R or whatever if it is more than 1000 lines of code. The “ML brain” is usually 15 lines of code
and the rest is a few k lines of code, so the more type safety one has is the better.

So, ideally, I want to use Scala for everything else than the “core ML lib”, that is, validating / transforming input data, specifying input parameters for the “core ML lib”, plotting output data,
transforming / storing the output data.

Ideally I do not want to write any python / R whatever code, other than the wrapper needed to
call the “ML core”, I wonder if that is feasible nowadays with Scala ? How easy it is ? In
your experience ?

I guess it should not be so hard because a stack is usually : ETL + ML + plotting . In the worst case I need to also wrap some python plotting libs but other than that I see Scala as the “framework language” a very good alternative (as opposed to python, for example). It also
seems to me that Spark is nowadays very good, offering type safe data structures (tables).

" Maybe we can crowdsource adding ScalaDoc to all the Stan functions one day…" but I guess it is possible to just read the ScalaStan code … I mean, the types should tell
everything. What more documentation is needed ? The documentation for STAN and the
ScalaStan code + types are enough right ?

There’s a standoff file generated for RStudio that may be useful for that.

This is what I’m really hoping we can add to Stan itself. We’ve had lots of discussion around how to add modules. Functions aren’t enough because they can’t add parameters, etc.