Stan++/Stan3 Preliminary Design

I think I probably should stop getting into this discussion, as it seems I am unable to convey my points clearly (or I am overconcerned about something that is not a big deal and everyone understands it). Also others should have a say without being swamped by my activity. So just one last try :-) :

Do I understand correctly, that, with the same fragment/submodel definition you would allow both of these two usages?

A:

data int shared_N;
model splines = spline_regression(shared_N);
model linear = linear_regression(shared_N);

B:

model splines = spline_regression();
model linear = linear_regression();

Where in case A, the main model has shared_N as data, but does not have splines.N and linear.N as data. In case B, though, the main model has both splines.N and linear.N as data? This feels too much hacky/magical/unpredictable, for my personal taste but I guess you might be OK with it. It certainly saves writing time, but I am afraid it could noticeably increase debugging time and make it harder to understand what data should be supplied to a model.

The other interpretation I can make of your proposal is that only one of the variants A and B is allowed, depending on the number of parameters declared for the submodel/fragment constructor. In other words, the submodel/fragment developer decides whether they will force you to pass some data explicitly or whether the data will be implicitly read from the main model data reader. This feels not very useful as it would unnecessarily limit reusability.

On philosophical grounds, you could argue that globals are bad, including global reader streams. All languages I know move away from recommending globals. But I understand that this is a very particular use case other languages don’t really have.

Don’t give up. It’s hard. And we need to figure out how to explain this to concerned users. A lot of the points you’re bringing up are valid.

No. Here’s what they’d look like in full:

A.

model spline_regression(int shared_N) {
  ... no definition of shared_N in body
  ... shared_N immutable
}
model linear_regression(int shared_N) {
  ... ditto ...
}

data int shared_N;
model splines = spline_regression(shared_N);
model linear = linear_regression(shared_N);

After running this, there is a single integer data variable shared_N in the top-level scope, and it has the same value as splines.shared_N and linear.shared_N.

The top-level data variable could be modified later, but that would get confusing with this naming convention, because submodels would keep the value they have at construction. And it would be immutable—it can’t be changed in the body of the model function.

B.

model spline_regression() {
  data int shared_N;
  ...
}
model linear_regression(int shared_N) {
  data int shared_N;
  ...
}

model splines = spline_regression();
model linear = linear_regression();

After running this, there are two integer data variables, splines.shared_N, and linear.shared_N, and they do not need to have the same value.

Languages all have to come to grips with this over logging, I/O, socket communication, etc. Wrap it in a monad and call it a singleton if that helps the medicine go down.

But we’re really talking about two different things. One is like I/O, and reads data and parameters from a “global” stream. The second is variable scoping. We’re proposing very traditional variable scoping and if you can think of the read and write as like reading from stdin (std::cin in C++), then we’re not proposing anything that extraordinary in the way of access or scope.

To be concrete, let’s look at a C++ program by way of analogy:

struct foo {
  int a;
  foo() { }
};

int main() {
  int b;
  foo f1 = foo();
  foo f2 = foo();
  ...
}

After the constructor for f2 is executed (marked ...), three separate variables are in scope, b, f1.a, and f2.a.

I would like to propose that the scoping here is exactly the same as in the Stan 3 case (B) above. The following would be equal to (A) as far as scope goes:

struct foo {
  const int a;
  foo(int b) : a(b) { }
};

int main() {
  int c = 3;
  foo f1 = foo(c);
  foo f2 = foo(c);
  ...
}

After the constructor for f2 executes (marked ...), there are three variables in scope, c, f1.a, and f2.b, all with value 3.

Now let’s make that more concrete in terms of the I/O analogy, first for case (B),

struct foo {
  int a;
  foo() { std::cin >> a; }
};

int main() {
  int b;
  std::cin >> b;
  foo f1 = foo();
  foo f2 = foo();
  ...
}

Now you see that all of b, f1.a, and f2.a, get separately read values from the standard input stream and each instance of foo holds its own value.

And now for case (A) with sharing:

struct foo {
  int a;
  foo(int b) : a(b) { }
};

int main() {
  int c;
  std::cin >> c;
  foo f1 = foo(c);
  foo f2 = foo(c);
  ...
}

Now it’s clear there is only a single read, for the c in the main() function, and it is passed to the instances of foo.

2 Likes

Thanks for clarifying. This however means that as a submodel/fragment developer I have to choose which data the main model will have to set explicitly and which will automatically be taken from the global reader. In particular A and B require different submodel/fragment code, even though what I want is just to change mapping from main model inputs to submodel/fragment inputs.

I would guess, that users will be frustrated when they will have to rewrite a submodel just to avoid data being taken from the global reader and quickly a “best practice” will emerge (at least for the submodel/fragment libraries) that submodels/fragments should expose all possible variables in their constructors to increase reusability. If you are OK with that, than I have nothing to add :-)

Also I can imagine submodels/fragments having a lot of constructor parameters so named parameters to submodel/fragment constructors should IMHO be supported or even enforced.

1 Like

+1 named parameters being supported

I’m not sure I have totally given up on optionally passing in things declared as data within a submodel. This makes importing entire Stan models much easier and addresses the usability concerns around rewriting that @martinmodrak brought up. I think we could establish clear rules here for submodel developers without getting too far into something like ridiculous C++ template resolution order territory.

@Bob_Carpenter asked what I meant by importing a Stan model. Imagine that you have just written a classifier on two covariates and want to use that classification as input to a new model (for some reason, maybe speed). You can then say #import "classifier" (or whatever syntax) and import a whole Stan model that does not have a submodel classifier() {} signature around it. If the rules for main models and submodels are the same, then we know what the signature is allowed to be to construct classifier-as-submodel. If we allow optional named params for non-private data and params, we can choose which things to share and which to allow to be passed in as classifier.covariate1 etc.

Supported throughout or just for submodels? With default values?

Would the idea be that every data variable in the submodel may optionally be provided as an argument, which would override reading from data reader? I think we need to start writing examples of what this would look like.

Another alternative would be to keep a simple notion of #include as a preprocessor directive, then require every model to have a top-level model declaration. Then we wouldn’t need anything special for including a whole model. I generally prefer to keep the number of primitives down, but this is all fairly new territory for me to think about in terms of inclusion.

I found this paper on modules by Bob Harper useful for thinking about modularity. I didn’t go through any of the math—just the overview and motivations.

1 Like

By default I would probably just steal Python’s syntax for this, so yeah supported and you can specify a default value in the function signature if you want?

Yeah, that’s what I’m thinking of at the moment. Thoughts about this? Not sure the best way to write this out since it’s a combination of something in a model file and a change of parameters passed in to the stan fit call.

I have a strong intuitive bias against the preprocessor style includes - would prefer at least namespacing the imports a la most modern library imports.

+1

This is an interesting thought, but I don’t think it actually helps you import something that was standalone into a new model if you wanted to share any data or parameters, right? Like if you force submodel main() {...} then you still have no mechanism to pass in data or params. If you force submodel main(int N, data vector[N] x) {...} or something then 1) The scoping on N is interesting though possible and 2) how do you run the model standalone? Do you assume that stuff in the signature should be passed in as data as well? Can the stuff in the signature be both params and data? When fitting, can you pass in data for variables in the signature as well as in the body?

I’ll take a look at that paper.

I haven’t seen any of those, so you’ll have to show me what it looks like. You’re right that just raw includes are clunky.

I’m still not sure, which is why I’d like to work through some examples.

I’m OK with that as long as all the arguments are used—I don’t want to get into interfaces like R’s dgamma that lets you specify two out of three parameters to specify the distribution. I’m OK with things being optional, but I don’t want argument interactions for optionality if they can be avoided.

[data arguments overriding reading from data] example:

submodel foo() {
  int N;
  real mu;
  real<lower=0> sigma;
  vector[N] y ~ normal(mu, sigma);
}

int N;
vector[N] y;
f = foo(N=N, y = y);
f.mu ~ normal(0, 1);
f.sigma ~ normal(0, 1);

f2 = foo() // must pass in f2.N and f2.y to fit call; could also pass in f2.mu and f2.sigma
...

Also thinking about submodel lpdfs and what that looks like…

I seem to be the only person who feels this way, but I don’t think that non-centered parameterizations are a pain to implement. In fact I think that trying to abstract them away as being equivalent to a centered-parameterization is extremely dangerous as they define different densities. If we want a modeling language that specifies a probability distribution up to equivalent isomorphism then we’re talking about a very different language.

But more importantly I don’t see why we are still using “submodel” when everyone seems to admit that there’s nothing “model”-based about them. There is no guaranteed statistical interpretation of these fragments! I know I’m always going on about the importance of nomenclature but I think that this is a particularly important one!

If we don’t have graphical model then there is no guarantee that the fragments will actually be well-defined incrementally (later fragments can modify the distributions of previous parameters and data) so I disagree that this is in any way “natural” within the current scope of the language.

I get why people want to decompose models, although I disagree on many of those reasons, but I don’t get how these decomposed contexts are going to be anything but confusing to people not experts in the current language. Immediately there are the issues with the scope of the data, as has been pointed out by @martinmodrak multiple times, but I think that this also continues to the parameters as well. As is my understanding, in order to maintain he same breadth of the language the new proposal would have to allow multiple fragments to use the same parameters – how does that not immediately break the encapsulation of the fragments for all intents and purposes?

I am inclined towards @mgorinova’s suggestion that avoids declaring data in the fragments at all, instead propagating them as arguments. That goes a long way to reducing confusing about scope while also requiring that the user specify all data expected to be input though the interface in the top-level/global/final Stan program. That leaves the issue with the overlapping parameters in fragments but at least moves, in my opinion, in the right direction.

Perhaps part of the confusion stems from the fact that in the existing Stan language, model is the most poorly named block. I guess we originally tried to mimic BUGS here, but model does not make any more sense in that context. Users tend to either think that “their model” is characterized by the likelihood function only or that the whole file is “the model” (which Andrew refers to as a Stan “program” but that has not really caught on). So, calling something a submodel at best perpetuates the existing confusion and maybe adds to it if the model is not necessarily the composition of all submodels and / or a submodel does not itself possess the defining characteristics of a model.

The model block in the existing Stan language is really a function of the parameters and transformed parameters with additional arguments being the data and transformed data. Except (confusingly to people who do not look at the generated C++), this function does not have a return statement. So, its signature is something like

void model(const tuple& parameters, const tuple& transformed_parameters, 
           const tuple& data, const tuple& transformed_data, 
           real& target /* in and out */);

Whatever the merits and demerits of the naming, a submodel seems like a step forward in that it looks or could look more like a function looks in R, Python, etc., especially if either the arguments have to be passed in or the usual scoping rules apply or we adopt the C++11 semantics for lambda functions. Actually, these submodels are a bit like named anonymous functions, which seems like we have made a wrong turn somewhere. Anyhoo, a construct like

submodel foo() {
  // declarations bubble up
  int N;
  real mu;
  real<lower=0> sigma;
  vector[N] y ~ normal(mu, sigma);
  // no return statement
}

contradicts the behavior of a function in R or Python. I think I would prefer

local foo(data int N, real mu, real<lower=0> sigma, data vector[N] y) {
  return normal_lpdf(mu, sigma);
}

or if you want to get more C++11-ish with a capture list

local foo[data int& N, data vector[N]&y](real mu, real sigma) {
  return normal_lpdf(y | mu, sigma);
}

I suppose the union of arguments to all local functions can get mapped to one of the blocks in the existing Stan language if we use the pre-C++11 style. Or maybe the union of things declared in capture lists are considered data and the union of arguments are considered non-data if we use the post-C++11 style? Anyway, the user calls

target += foo(N, mu, sigma, y);
target += foo(); // equivalent to above);

in pre-C++11 style or if we do C++11-style then

target += foo(mu, sigma);
target += foo();

Another thing I am worried about is that we are repeating the mistakes we made with the “~ vs target +=” debate of 2015. The compromise we ended up with of adding the target += syntax but not adding the option to ignore the constants and not doing much to push the people who were using the ~ syntax to change their ways has probably resulted in more confusion than if we had either not added the target += syntax or added it and deprecated the ~ syntax. It seems like we are headed down the same path if we add a new syntax but the old one remains much more prevalent.

That’s a good point. They can be models, but they don’t have to be. So they shouldn’t be called model. This is the same objection Andrew had to the name model for the block in the Stan program—it was only part of the model. @bgoodri brings up the same points below.

I originally wanted module. How do you feel about that? I don’t think our users will have noise from the (OCa)ML notions.

I meant natural from a computational point of view, not a statistical one. If you think of the computational operations as (a) reading data from the data reader, (b) reading parameters from the parameter reader, (c) adding terms to the log density, then these are all naturally composable.

Rather than thinking of adding a second term as modifying a previous log density, I think of it as adding a term to the final log density. This can be done in any way, some of which will lead to more natural and reusable modules than other.

I wouldn’t say that adding a second term to the log density function modifies previous distributions. They can just be written down in any order. For example, if I have

target +=  `-0.5 * ((y - mu) / sigma)^2;
target += -log(sigma);

I don’t think of the second increment modifying the distributions on y, mu, and sigma—it just adds another term into the log density of y given mu and sigma. Same as if I for some reason decide to write

y ~ normal(0, 1);
y ~ normal(0, 1);

The first doesn’t define the distribution of y, nor does the second. Only the whole program together does.

I’ve been trying to explain. Yes, multiple fragments would have to use the same parameters, but they wouldn’t introduce parameters. The mechanism is taking them as arguments.

That would be a possibility. It’s not like the data arguments are fully encapsulated because they’ll have to be named in a sensible way from the outside. This is less an issue for parameters, since they don’t get named by users from the outside other than in inits.

That’s the basic idea, but to be a bit more precise, here is how it works:

  • data and transformed data variables are member variables in the model class
  • the constructor takes the data variales as an argument (these are the argument)
  • the constructor defines the transformed data variables (so the transformed data isn’t an argument)
  • the log density method takes the parameters as arguments, defines the transformed parameters and returns the log density
  • the write arryay method takes the parameters as arguments, defines the transformed parameters and generates the generated quantities

I did it this way to follow p(\theta \mid y), with data y specified and fixed to produce a function f(\theta) = log p(\theta \mid y) because that’s what we need for sampling.

That is the plan. I don’t think it’ll be too complicated, either. But it won’t quite solve all of our problems with diff eq solvers because we need to know which system parameters to differentiate.

Any suggestion as to how to get back on track?

This is my biggest overall concern—that it’s going to confuse users. It’s hard to see how it won’t, as it’s adding a lot of complex machinery.

We needed the target += (or increment_log_prob() as it used to be) for flexibility. The only reason we don’t have the ability to control proportionality on target += is that it hasn’t been a high enough priority for anyone. If we want to escalate it, I don’t think it’d be that hard.

I think I am suggesting:

  • Treat what we have been calling submodels as local functions with return statements that return the contribution to the log kernel
  • Things mentioned in the capture list are fixed, while arguments are for unknowns
  • Have the parser infer from the capture lists and argument signatures what needs to go where in the generated C++, rather than inferring from the local function bodies
  • Things declared inside a local function have local scope

How would you unify this syntactically and operationally with the supermodel? Should it also return a contribution to a log kernel?

Yes. I mean the local function would do something like

  ...
  return normal_lpdf(y | mu, sigma);
}

and then on the outside you target += that.

I don’t like anything that even hints as something self-contained. To be module is still too close to that. I still like fragment which has the explicit connotation of being incomplete.

Exactly! And decomposing models into potentially uninterpretable components is going to be a communication disaster. Users will try to reason about the components without the context of a larger model; users will try to communicate their models via these fragments leaving anyone who wants to figure out the full scope to piece it together themselves; etc, etc.

This will be a nightmare and one I do not think is worth the, in my opinion small, developer benefit which mots people seem to be presuming uncritically.

And what’s the UX for handling fragments with the same parameters? How are the fragments assembled and fed into the compiler? Which fragments takes precedence?

And in functions that want to call diagnostics or visualizations or summaries by name.

Anyways, I don’t see anyway of resolving all of these issues and having a language that continues to facilitate effective communication of statistical models (it is a probabilistic programming language, after all) without introducing another layer of abstraction. I mentioned this before, but I wouldn’t have a problem if there were a second parser that takes fragments and compiles them into a valid Stan program which could then be fed into the main compiler. This initial stage would be used solely by developers or power uses while most everyone else would continue to use a single monolithic Stan program (or not quite monolithic with proper header support and extensive use of regular user-defined functions).

I don’t think you’re noticing when your opinions leave the realm of your expertise (math, stats, modeling, etc) and enter ours (languages and their design, software engineering). 😜😜😜 Maybe expertise is more relevant to the future projections about ROI than the naming decisions.

If you review the very long thread (we are actually working on a review/summary, out very soon!) you will see that submodels are namespaced so they won’t have name collisions. Here is an example:

submodel foo() {
  real alpha;
}
f1 = foo();
f2 = foo();

If nothing is passed in as data, there will be two parameters visible to HMC and to the interfaces named f1.alpha and f2.alpha respectively. To use the same parameter,

submodel foo(real alpha) {
  ...
}
real alpha;
f1 = foo(alpha);
f2 = foo(alpha);

We claim this will always be unambiguous.

This discussion is by its very nature interdisciplinary because of the combination of applied statistics, computational statistics, and programming language efforts involved.

The current construction is perhaps best framed as being object like, as in a struct in C or a class in C++ or Java. We could just call it class or struct or object. The only sense in which it’s self contained is that it can manage member variables. The construct in some sense works by setting up those membmer objects (which can involve reading parameters and/or data) and then executing statements for their side effects on the target—nothing else scopes out.

From a programming perspective, the only way to scale effort is through modularity. It might be argued that these pieces aren’t modular enough, but I think it’s a step in the right direction toward eliminating a bunch of error prone and repetitive code.