Stan++/Stan3 Preliminary Design

Thanks for the response. I understand that this is hard and I am in the comfortable position of having an opinion but not having any responsibility for the decision :-)

However, I think my main point still got lost in translation, giving it another try :-)

But how does the compiler determine that times in the main model maps to gp.times (the submodel?). That is the magic that worries me. I’d rather have to write something like

model gp = gaussian_process(N, times, length, sigma, y);

or even with named parameters to avoid mismatches in long param list.

model gp = gaussian_process(
             N = N, 
             times = times, 
             length = length, 
             sigma = sigma, 
             y = y
           );

This way it is obvious what gets passed where and I can be precise what is shared and what is different in each submodel. For example there is now a clear distinction between:

model gp1 = gaussian_process(N, times, length, sigma, y1);
model gp2 = gaussian_process(N, times, length, sigma, y2);

and

model gp1 = gaussian_process(N, times1, length1, sigma, y1);
model gp2 = gaussian_process(M, times2, length2, sigma, y2);

Hope that clarifies my concern :-)

2 Likes

I don’t think the proposal aids in this, but a new compiler implementation will. I tried to convince everyone over the week that we could do a lot in this realm, but I don’t think I managed. The idea is basically the same idea as in this type of CRDT: Conflict-free replicated data type - Wikipedia

Summarized, each AST subtree ending with a target += can be analyzed for information dependencies (both in and out) and if we can conservatively partition the model into subtrees that don’t interact, we can parallelize and then do all of the target +=s in whatever order we want.

1 Like

Acrually R users should be quite familiar with something like

gp = gaussian_process()

without having to pass everything via argument. For an R user this is similar to what happens when variables exist in the global environment or in the environment of a wrapper function around gaussian_process(). However, with a few exceptions, that’s considered bad practice in R, so they will be familiar with it but used to being discouraged from doing it.

That example I had didn’t make sense. I didn’t mean to cut all the declarations out, it should’ve been this:

model gaussian_process() {
   data int<lower = 0> N;
   data vector[N] times;
   param real length;
   param real sigma;   
   data vector[N] y ~ multi_normal(0, cov_exp_quad(times, length, sigma));
}

model gp = gaussian_process();

or

model gaussian_process(int N, vector times, real length, real sigma) {
   data vector[N] y ~ multi_normal(0, cov_exp_quad(times, length, sigma));
}

data int<lower = 0> N;
data vector[N] times;
param real length;
param real sigma;   
model gp = gaussian_process(N, times, length, sigma);

I know this kind of thing is possible, but I have no idea how easy it is to do or how general the heuristics we can come up with will be.

That wasn’t my intent! I’m not a fan of R’s dynamic lexical scope!

For the most part me neither!

I prefer the version where arguments are used to pass this stuff in. It does mean sometimes things have to get passed down deep but that’s a problem every language deals with.

We need something like this urgently for Stan from my view! Stan is amazing and HMC is capable to attack really large problems, but only if you manage to use the raw computing power of many cores. With more knowledge on the model structure we should be able to figure out far more on what are independent computations which can be used to automate parallelism. My thinking was along very similar lines. This the target+= point in any Stan program is an obvious target for these optimizations.

If we cannot do a fully automatic thing, then maybe a syntax like

target += ... some clever syntax to specify independent contributions...

could be translated to a parallel execution facility. How that would look exactly is not yet clear to me.

Sebastian

Maybe first thing would be to move AD graph to some network library object and then doing the analyses on that? Conditional statements are probably going to be tricky unless one just marks them as one node.

Would this help to speed up recursive algorithms, e.g. state-space go (Kalman filter stuff by Simo Särkkä et la)?

I think we will need ways to allow users to manually override / specify parallelism-related ideas; it might help that I have been proposing adding a general metadata annotations system to the Stan 3 language to allow these sorts of AST decorations (e.g. something like @independent (or whatever word) on the target += statement; syntax TBD).

I have a bunch of other thoughts on parallelism that I should probably save for another thread, but I’m a fan of the kind of declarative SQL/Spark DataSet/DataFrame style for specifying what you want and then having the query engine figure out how to spread it out across machines. They’re basically just parallel collections of rows with operations like map and groupBy.

I think this will be much much easier to do at the Stan AST level than at the AD graph level, especially in our AD system. What is a network library object?

I just meant that we could have a subprogram / external library doing the analysis part (connected graphs etc)

That would make sense if we did static expression graph expansion like TensorFlow (before their eager mode) or symbolic differentiation like Theano. As is, it’s not worth optimizing the expression graph compared to just evaluating it.

To be able to expand Stan statically that way through autodiff, we’d need to forbid any conditioning on parameters and also remove any underlying functions that do this internally.

On the other hand, expanding the Stan program (the thing represented by the AST) is something that’s definitely worth doing.

I think that shold be the priority. Then we can think about how to automate it. I want to keep thinking of the Stan language itself more as a programming language.

I think that matches our use-case well.

We probably don’t want to start writing our own graph algorithms, but most of them aren’t that difficult to first order.

1 Like

For parallelizing it would just make sense to treat everything downstream of a parameter-based conditional as serial. Otherwise we’d be giving up the ability to auto-diff numerical algorithms which is too much of a cost to pay.

I think it’s fair that we move the discussion of parallelization to a new thread, as it’s not directly related to the current language proposal.

Same for prior/likelihood separation which is not part of the current proposal and worthy of its own thread.

People have argued that the blocks are obstructive because they don’t allow statements like real x ~ normal(0, 1), but let me step back and ask what is that statement even supposed to mean? To the left of the ~ we introduce a variable x along with its type, cool, but what is the right hand side specifying?

It doesn’t specify the final distribution for x because we eventually condition on the observed variables. Okay, so let’s ignore the conditioning that happens later and focus just on the joint distribution we have before we separate out variables that are uncertain and variables that have specific values. In this case the statement might read “x is marginally distributed as a unit normal”, i.e. if we integrate out all of the other variables then x has the specified distribution. Except we can’t always say that – if we instead had real x ~ normal(mu, 1) then we can’t marginalize out mu and maintain the same interpretation. So we’d have to say something like “conditioned on the values of any variables on the right hand size, x marginally has the given distribution”.

But we can’t even say that because there’s no restriction on having terms later on like x ~ normal(5, 2) which modifies the distribution for x. So at the very best real x ~ normal(mu, 1) would read as “conditioned on a value for mu, and assuming x doesn’t show up anywhere else in the program on the left hand of a ~ or a |, integrating out all of the other variables but x and mu will leave x with the marginal distribution normal(mu, 1)”. It’s a bit of a mouthful.

The problem here is that the intuition for something like real x ~ normal(mu, 1) comes from thinking about graphical models where the statement about the marginal behavior conditioned on the child variables cannot be modified later on. In the context of purely graphical models, which everyone has been using for their examples of why real x ~ normal(mu, 1) is so great, the notation is solid but outside of that context the intuition falls apart and can be more dangerous than useful.

I personally don’t see how the notation can be rectified with a language more expressive than graphical models. If people want to drop back to graphical models then let’s have that discussion, but without that restriction real x ~ normal(mu, 1) is nowhere near as clean as has been presented.

Ultimately I think the subtle tension here is due to the fact that we’re specifying a function, and parameters and data are not intermediate variables and hence not amenable to the programming practices for intermediate variables such as “define near where you declare”. Serious question – is there any language that defines functions implicitly through annotated variables? Especially as languages go more functional and the theoretical perspectives more categorical, the signature of declared functions seems to be becoming more prominent in language designs (correct me if I’m wrong) as reasoning about the inputs and outputs before worrying about the implementation is an extremely helpful abstraction.

A consistent argument with the current Stan language is that it’s ungainly for large programs. To be fair I disagree here (and I often write programs with hundreds of lines) but in any case if one agrees that the current language is ungainly then compositionality would be a natural next step from the functional/categorical perspective. The problem is that to maintain the same expressiveness of the current language those submodels wouldn’t be self-contained – multiple submodels could modify the same variables – so the compositional structure just wouldn’t be there. We’d have hidden state, side effects, leaks, whatever you want to call it, unless we restricted the language to graphical models. How do we maintain all of the desired features of compositionality without changing the scope of the modeling language? Am I missing something here?

Then there’s the added subtlety that the compositionality is natural for specifying a joint distribution through a graphical model, but that same compositional structure no longer holds once we start conditioning on variables. I think this the reason for some of the questions/concerns about the global scope of the data variables – I’m not sure compositionality of the joint is compatible with declaring what variables we condition the joint on afterwards.

Again all of this issues have clean resolutions within the scope of graphical models (reducing to the BUGS model where you don’t specify conditioning variables in the program but rather by passing in values when you run that program) but I don’t think there are anything but limited heuristics outside of that scope, and that worries me in terms of language design.

I can’t figure out any way of having compositionality without having two stages of parsing. The first stage compiles components models together into a joint model amenable to global evaluation along with the list of all input variables. After this the global model is available for viewing (i.e. we generate a model block) and those input variables are separated out into bound (data) and unbound (parameters) variables. Then we run the second stage of the parser that transpiles into C++. This might also offer the opportunity to have a user-focused UX (where they just build the second program directly) verses a developer-focused UX (where the compositional model is specified).

Even then we’d have the odious task of figuring out a good UX for specifying the bound variables (as previously discussed, relying on what’s defined in the input data file is dangerous) and dealing with conflicts between the specification and what variables were defined in the previous model. Requiring that the bound variables be declared in the program itself is extraordinarily powerful.

Wouldn’t this be just as easily resolved by changing data to something like external data and transformed data to something like internal data?

I’m resorting this in terms of order of importance (to me, of course).

Therein lies the main tension we’re wrestling with, in my opinion.

It’s powerful, but its power comes from imposing limits. And when you write bigger programs, you run up against those limits.

Computer programs wind up being made up out of gazillions of little blocks. What we’re doing now is like forcing all the I/O to be fed through a single big main() function and restricting a no I/O policy on the subunits.

I’m thinking about this all very operationally. There’s a data context (mapping of variables to values [sorry, I’m pre-repeating myself as you’ll see later]) and a parameter context (ditto), and the data declarations read from the data context and the parameter declarations read from the parameter context. That’s how it’s implemented under the hood with readers in the constructor for data and in the log density function for the parameters.

That’s why I’m particularly concerned about separating the declarations of various data variables and parameter variables so that it’s not so clear any more what the function is. The way things are now, the signature of the data → parameters → log density function is laid out very clearly (modulo the implicit unconstraining transform and log Jacobian).

But I don’t see how that follows.

I’ve been looking at other probabilistic programming languages recently and I find they present the languages operationally, as if pyro.sample or pyro.observe are simple Python functions. But then they perform variational inference, which clearly isn’t doing anything like naive sampling. You can find this thinking encapsulated in a section title in the Pyro tutorials, “Inference in Pyro: From Stochastic Functions to Marginal Distributions.”

I have the same objections you (@betanalpha) do to this confusion. In Stan, we just say the program’s directly calculating a log density up to a constant. Looked at that way, it looks less like a probabilistic programming language and more like a function. And as I keep emphasizing, that’s still how I think of a Stan program—as a function from data to parameters to a log density.

The original motivation for using ~ came from BUGS, which used it to declare a graphical model. I’m not sure where the original motivation came from in stats, where people use it to lay out components of generative models. I borrowed it from BUGS with a different interpretation as an increment to the log density.

For the new compound declare-distribute syntax, the motivation for me comes not from BUGS, but from the compound declare-define statements in languages like C++. It’s just a convenient way of combining the declaration and use of a variable. Philosophically, I’m not trying to distinguish between

real x;
...
x ~ normal(5, 2);

and

real x ~ normal(5, 2);

other than that in the latter I don’t have to scan so far to find the use of the variable.

That’s what the spec will lay out. How it consistently computes a log density function.

Closures behave something like this. They grab their current context and produce a function. They’re essentially mappings from contexts (variable to value maps) to arguments to values.

In logical languages, you usually think about expressions or statements or other syntactic units having free variables, then things like quantifiers like for-all and there-exists bind them off. That intermediate structure with free variables can be thought of denotationally as a function from contexts (mappings from variables to values) into values. That’s essentially what a closure does, too.

I’m not sure what compositionality you’re talking about here. For the Stan 3 design, I think the issue is that the compositionality is one of variable contexts and target value, not of the joint density in any statistical sense. In order to compose program fragments, those fragments must be interpreted as relations between the contexts before and after the statement executes. This can all be compositional with declaring new variables—it’s the usual thing in programming languages to do that. Stan’s unusual in blocking off all the declarations at the top—that’s just because I was lazy in writing the parser, not out of any philosophical commitment to declarations-first.

But you’re missing an order of operations – we define a joint density \pi(y, \theta), which defines a condition density, \pi(\theta \mid y), only after the conditioning variates have been bound to values which itself happens only after the joint density has been defined. We cannot in any self-consistent way define “submodels” that define “data” because the conditioning happens only after the submodels have been aggregated into a joint.

Or, put another way, if the data context and target in each submodel are meaningless on their own and don’t even have to interact – they just get aggregated independently to define the global collection of bound variables and the total target. The “submodel” concept that everyone is leaning on only makes sense in the context of graphical models and otherwise doesn’t offer any particularly helpful abstraction.

Again, I don’t even think it’s that straightforward. The data and parameter blocks define the inputs to a joint log density function and then it’s the separation of the two blocks that defines what variables get bound and hence how the joint density turns into a conditional density.

In other words, I think you have to be very careful with what you mean by “log density function”. Are you referring to the joint or the conditional?

But what happens when you have

real x ~ normal(5, 2);
x ~ normal(1, 2);

The order of the ~ here is completely arbitrary but the language highlights the first as being different than the second. ~ doesn’t act like an initialization or an assignment operator – it’s a wacky implicit += and hence doesn’t follow the same patterns as initialization or assignment would.

Indeed neither ~ is doing anything to x here so why treat it like initialization? Introducing a declare and define only makes the purpose of ~ more opaque than it already is.

This is exactly my point! In particular, using “fragments” instead of “submodels” would, in my opinion, make this discussion much clearer. The fragments do not necessarily have any statistical interpretation in isolation and so how exactly do they facilitate building statistical models? Even wrapping the variables-to-be-bound with target modifications in a “model fragment” is tricky because they don’t interact at all until the fully joint has been specified by aggregating the target modifications together and the variables-to-be-bound together independently.

No matter what the language looks like the output will be a log density function for the joint, bounds variable, and unbound variables. In this generality there is no natural decomposition of the body of the log density nor the bound and unbound variables. Can we decompose the body into arbitrary fragments? Sure. Can we decompose the bound and unbound variables into fragments? Sure. Will those decompositions have any useful interpretation that facilitates model building? I don’t see how.

Again, this is all much simpler for graphical models but once we talk about a scope as large as the current language all of the nice interpretations become heuristics with as many counterexamples and examples and I am terrified of how much pedagogical damage that will do to current and future users.

+1 for using “fragment” instead of “submodel”, already many of the examples provided here are not submodels in any sense (e.g. lr_predict).

I understand why this may seem cumbersome, but it also lends the program to easier debugging and better static analysis. Also note that functional programming languages do force all I/O through a single entry point and they seem to be doing well even for large programs. However, for this to be easy, you might need a good record/structure data type which would complicate the language further and introduce additional development costs.

Also I personally heavily dislike the way closures are implemented in R, it has created many hard-to-debug bugs when what I thought was a bound parameter was actually taken from the parent environment/context thanks to a typo. So I don’t think this is a good example to learn from :-)

For Haskell, which I think is the only one that operates as I think you’re proposing, they have extremely nuanced systems designed to handle the complexity that idea introduces and I would love to see examples of large programs that deal with IO and never use unsafePerformIO. Then let’s try to get an R programmer to use monad transformers for the sake of purity.

1 Like

I have larger experience only with Elm which is delightful - and it has no monads (or any higher-order types). Judging by the forums, most users are not very advanced former JavaScript devs who seem to be able to learn the language just fine. Elm works by expressing whole web app as two large functions. So I believe it is a proof of existence of a system that has strong typing, clean scoping, where everything is passed as parameters multiple levels down the hierarchy, but it is not annoying on the developer. The reason it works nicely is that Elm has very cleverly-designed record types and great type inference engine.

One more thing I find inspiring about Elm is that they decided to be beginner-friendly not by making the language/standard library do magic that “just works”, but instead the language is very strict but provides extremely high quality error messages, e.g. (paraphrasing from memory):

Line XY:
callToFunc 5.1 
           ^
Type mismatch - expecting Int, found Float. 
Automatic conversions between numeric types are not performed, use Math.round for 
explicit conversion. More info can be found at <link to relevant section of the manual>

Aside - if you wonder how to develop webapps as pure functions without monads, the idea is that you have one function for evolution over time State \times PossibleInputs \rightarrow State \times Action where the elements of Action represent request for side effects to be performed (web request, RNG, play sound, …) and then you have another function State \rightarrow HTML that determines what the user sees + some clever magic to make this work efficiently. But indeed a whole state of an app is passed as a parameter and it works even for quite large webapps.

Disclaimer: I probably have unhealthily positive relationship with Elm and I think it is the best language I’ve ever worked with, so I may be biased :-)

Great! I’m a big fan of Elm and Evan (its creator). I haven’t used it in a few years and I haven’t seen it used on any larger projects but I probably have a different idea in my head of “large.” FRP is a really great paradigm - if you were to try to apply it to Stan, you would see that we currently have as similar of an interface as I think makes sense (though happy to hear if you have other ideas here and in general). We have a stateless model block that takes in changing world state in the form of parameters and outputs a density, which HMC uses to compute the new world state.

Here’s a small example of how Elm does what you’re talking about - it creates a world record and passes it to all functions. All functions then return a modified version of the world.

To use modules in the way Elm does, you have to have many more complex ideas floating around (which could be good for advanced users, but likely we don’t want to incur that overhead for all of our users). And please take a look at the incantation required to actually pipe the “Signal” and “Element” through everything:

main : Signal Element
main =
  Signal.map2 view Window.dimensions (Signal.foldp update mario input)

input : Signal (Float, Keys)
input =
  let
    delta = Signal.map (\t -> t/20) (fps 30)
  in
    Signal.sampleOn delta (Signal.map2 (,) delta Keyboard.arrows)

I disagree with all of your points about the types of people and skillsets of the users of elm and R - I think the above is beyond at least some of our existing users, who we would like to preserve. But if we can find inspiration in Elm I’m all for it!

A reasonable way to think about submodels (name TBD) is that they are defining a record or struct of which you may create an instance. The magic is that when you create an instance of a struct that has a parameter type, for HMC we need to be able to know about that at the top level because we cannot sample submodels independently (for other samplers one could do this). We want to be able import other models or fragments of models and use them with minimal modifications to our original code - to change back and forth between centered and non-centered parameterizations seamlessly, for example. I think this requires some form of magic. Luckily, you will always know where a parameter comes from because of the naming scheme, and you will by default see them in the output. Magic is worst when the effects are subtle or hidden, and bad when the behavior needs to be modified in some way. I believe we avoid both of those pitfalls here.

I think it would be really helpful to reframe some of these discussions in terms of proposed alternatives instead of commentary whenever possible, as having something concrete to look at helps with communication. Would you mind outlining what you’d propose with some example syntax?