Stan++/Stan3 Preliminary Design


#61

Good point, let’s be more specific. The discussion I think I am having (hope others agree :-) ) is on the points from my previous post here: Stan++/Stan3 Preliminary Design

I’ll try to rephrase it and introduce some terminology to let us stay focused.

Basically the design as proposed (if I understand it correctly) features what I’d call implicit binding: unbound data variables from a fragment/submodel somehow magically bubble up to the top level and are treated as data to the main model, while everything inside the fragment is visible to the outside world e.g.:

fragment gaussian_process {
  data int N;
  data vector[N] x;
  ...
  vector[N] y_raw ~ normal(0,1);
  vector[N] y = <<cholesky and GP stuff>>
}

fragment gp = gaussian_process();
//now both gp.y and gp.y_raw can referenced

There were two variants floated around, one in which the main model automatically gains data gp.x and gp.N, and another where the main model gets data x and N. In both cases this is automagic.

I am trying to argue for explicit binding, where fragments/submodels have explicit interface, both in (as arguments to instantiation) and out (clearly stated which internal parameters/transformed parameters/gen quants are visible to the main model), e.g.:

data int whatever_I_call_it;
data vector[whatever_I_call_it] my_x;

fragment gaussian_process(int N, vector x) {
  //Introducing new data variables here is syntax error, only params/gen quants allowed
  vector[N] y_raw ~ normal(0,1);
  public vector[N] y = <<cholesky and GP stuff>>
}

fragment gp = gaussian_process(whatever_I_call_it, my_x);
//only gp.y can be referenced

Obviously the question about explicit/implicit input is mostly orthogonal to explicit/implicit inner member access.

The argument for implicit binding seems to be - at leat in part - that it is less verbose and less demanding of beginners. I took Elm as a counterexample where everything is explicit, but still it is terse and relatively easy to learn. I also like focusing more on debugging experience than on code writing experience as I believe debugging is more often the bottleneck.

I don’t want to steal a lot from Elm (Stan definitely should not switch to functional), but as for the size - I’ve certainly written programs one or two orders of magnitude larger in Elm than in Stan and I’ve seldom felt it is too verbose.


#62

Gotcha. This is a tough one for data - my opinion is mostly that we need to be able to create nuisance parameters within submodels for modularity’s sake. For data I had been imagining it being explicit as in your proposal, though I know Bob is a fan of allowing Stan programs where what is data and what is parameters is unspecified until compile time (rather than encoded in the Stan model itself). So in some cases, you could end up with the magic extra data parameters you describe, or worse - with them being inferred to be parameters when they aren’t passed in.

@Bob_Carpenter, do you remember what we were going to do about this issue? I forget if we had this particular issue figured out. One solution would be to force submodels to explicitly specify all data they plan to use, which I think was what I had been unconsciously assuming. But this breaks the symmetry between submodels and the larger model that I think is important.


#63

Sorry for jumping in so late. Hopefully I add to the conversation, specifically about the design considerations of the next iteration of the language. Inevitably, I’m going to add a bit of non-relevant comments, so feel free to ignore them.


Before I get into it, I wanted to make sure I’m on the same page in how I think about the Stan language. If we have a Stan program, the inputs are instances of data and parameters (on the unconstrained scale). These are all values that can be represented as either a floating point number or an integer (I think this is what @Bob_Carpenter calls “value maps”). I think there are two types of output:

  1. Computation of the log joint probability distribution function with the data and parameters as input. The main output of the Stan program is the function value, which is a single number. We also can compute the gradients of the function at the given inputs with respect to each of the parameters. Once again, the output is a single value (with the ability to get the gradients at that point).
  2. Lots of outputs that are functions of data, parameters, and a random number generator. By lots, I mean it can be 0 to very many values. The only thing we care about here are the values themselves.

The Stan program itself is not responsible for generating the parameter values. This is done outside of the Stan program by algorithms that are designed to navigate the parameter space and produce parameter values that are passed to the Stan program to compute the log joint probability distribution function.

That’s how I view the role of the Stan program. I think the language is there to make it easy to specify a joint model (and I think submodels fit within that framework) and a way to specify how to generate different quantities using the values of what’s already passed and a random number generator.


Now, onto language design:

  • was there any talk about separating data into modeled and unmodeled data?
    There are really two flavors of data: modeled and unmodeled. Why is this important? If we knew which things were modeled data, we could swap what we know about the parameters and modeled data and be able to generate quantities naturally. I’m just thinking about design and ignoring the computational aspect to it; couldn’t we just run NUTS if we knew which pieces of data were the observations and conditioned on parameters? (things that aren’t modeled typically include number of observations and observation times). I haven’t thought this through enough to know whether that’s enough information to do it generally, but if it does, wouldn’t it solve the generated quantities problem in the original post? As a benefit, it would also help with implementing things like mini-batches too. I think we need some sort of marking of data to make that happen.
  • Currently, one of the messier things we deal with is having to deal with non-centered parameterizations. Does the current submodel proposal make that easy? If not, is there a way to incorporate this use case? (I’m looking at the original post… how do we swap hier_normal_ncp() with hier_normal() (that doesn’t exist). Not looking for something automatic, but just a way to swap out a few lines of code that would enable the different parameterization.)
  • I know libraries were talked about a bit on the thread. Any thoughts on how that looks from the language? I don’t think I saw any examples on how that looks.
  • My current pain with the existing Stan language is the transformed parameters and the model block and having to jump back and forth to do computations in transformed parameters that affect the model block. This proposal kills that issue, so awesome!
  • This isn’t the language, but how we present results. I’d really like to separate out the reporting of the target value from the absolute of the log of the determinant of the Jacobian. I think it would really help debug constructing the model / target term if we could just plug in the parameter values and compute the term that’s up to an additive constant to what’s written directly in the Stan model block. (Please feel free to ignore this point for this discussion; I can put it on a different thread.)

#64

I think explicit binding is compatible with inferring what is data. Explicit binding just requires all data to be declared (with or without the data specifier) I agree that declaring new parameters in fragments/submodels is a necessity.

The assymetry between fragment/submodel and the whole model bothers me, but I think that you could have a dual syntax for this, e.g.:

fragment linear_regression(int N, int M, matrix[N,M] x) { <<model_code>> }

would be equivalent to:

fragment linear_regression { 
     inputs {  //Using "inputs", because those could be both params or data
         int N;
         int M; 
         matrix[N,M] x;
     }
     code {
         <<model_code>>
     }
 }

Or something like this. Or maybe allow only the latter syntax, because it is more general and it is not cool to have two syntaxes for the same thing. But now I am really getting ahead of myself :-)

What I don’t like about inferring what is data at compile time is that the model ceases to be specified in a single file - part of the specification lies in the call to the compiler. I’d rather specify a general model as a fragment and then have multiple top-level models that include the fragment and declare and pass different data/params. Especially since not all combinations of data/params might make sense (e.g. putting x for the GP as params is disastrous).


#65

What you’re describing is how this is usually presented in a Bayesian textbook and how it’s currently coded in the model class. It will also be how it’ll be coded in Stan 3.

But, in Stan 3, we’re thinking of the natural operaitonal semantics in more traditional computational terms, where a statement (or program) is interpreted as a relation between the context before it executes and the context after. Part of the context is a global data reader and a parameter reader. Data declarations read from the data reader (advancing its state) and parameter declarations from the parameter reader. Local variable declarations add to a local function stack context (as in usual programming languages), and operations to set variables (local, transformed data, transformed parameters) affect the local context. The target can also be modified.

So the submodels don’t have to be anything other than a sequence of executable code.

They interact through their side effects on the readers, target, and local variables.

Right now, Stan doesn’t enforce anything about joints, posteriors, etc. You just write a function down that implements the log density from which you want to sample. So I can play tricks like exploiting conjugacy to avoid implementing the joint and instead move directly to implementing the posterior.

The thing to do is think of this as syntactic sugar for:

real x;
x ~ normal(5, 2);
x ~ normal(1, 2);

The declaration will read the variable from the arguments as it’s currently implemented. The second two statements increment the target log density. Because of the rules of arithmetic, the second and third statements may be reordered (not that they’ll provide exactly the same answers with floating point under reordering, but close enough).

I do worry about this issue.

Correct. They only have operational interpretations.

They facilitate modeling by encapsulating transformations, distirbutions, etc. Like usual functions in programming languages. As you may have noticed, it’s a pain in the ass right now to implement a non-centered parameterization and this will let us encapsulate this notion in a fragment of code with a name that’s reusable and controls naming of its component in an orderly way for reuse.

It will be a log density, but not necessarily joint. There will be two different kinds of “free” variables, data and parameters.

The natural decomposition is compositional along the incremental reading of the data and parameters and computation of target log density.


#66

This is not what we’re proposing. We’re proposing standard objects with member variables, just like in C++ or Java. There’s no implicit binding. Everything is declared, named, and bound explicitly by standard scoping rules.

This would be a danger.

To mitigate this risk and lock in usage, a user could always declare a variable as data or parameters or whatever. I’m not suggesting get rid of the declarations, just allowing them to be inferred automatically.


#67

I think we’re on the same page here, other than that as I keep pointing out, nothing requires the log density to be a joint density over data and parameters—it’s just interpreted as a density over parameters up to a proportion.

No, but that’d be useful to know what kind of data we could impute as missing data—only modeled data. It might protect against accidental missing data leading to sampling improper densities.

If we have minibatches, it’d cross modeled (the data) and unmodeled (sizes of minibatches).

Yes. You can implement them both as submodels. You probably wouldn’t implement the simple case that way, since it’s also implementable as beta ~ normal(mu, sigma);. So that’s what’d get swapped out with non_centered_normal(beta, mu, sigma); if you wanted to write it in parallel. This is the case Maria’s thesis was largely designed to address, so there are examples there, too.

The math library doesn’t need to change. We could start distributing libraries of submodels as well as libraries of functions. Even better if we could compile and link independently. I’ve been reading some stuff by Bob Harper et al. recently only submodules in ML and principles of submodules in general that’s very compelling, and it includes discussion of independent compilation of submodules (which we could do—there really isn’t anything magic about scope going on despite some lingering confusion due to the ill- formed first example I presented).

This should be a separate issue, but I can’t resist any bait (I’d be a short-lived fish).

As you know, the hooks are there in the code to make this easy—all the Jacobians are done in one step in the log density function and could be stored in a separate value.

I’ve wanted to have jacobian += in the transformed parameters block from the get-go, but there is a lot of opposition to that both because it’d be confusing and because we’d never use it.

What’s the usue case?


#68

Matthijs drove us to keep working until there is no asymmetry. The operational semantics is exactly the same.

I think some of these discussions may need to be postponed until Matthijs and Maria work out the formal operational semantics so we can resolve some of these confusions more concretely.

If by “inputs” you mean function arguments, then this is exactly how we’re thinking of this. Particularly as arguments to an object constructor.


#69

Minor point of order concern - I think the narrative presentation here and in the original post is both not super helpful for presentation and a bit divisive - for example I feel like I was another major proponent of that symmetry, which is relevant here in this later discussion. Saying this instead as “We collectively placed a high value on that symmetry and kept working to eliminate it” or something could be more helpful. And in the original post, I feel like we could have written a summary aimed at justification instead of a play-by-play. I credit Michael with teaching me that emphasis for paper writing :P


#70

Thanks. I’ll try to ease up on the storytelling or confine it to blog posts rather than serious discussions. It’s obviously very subjective, and I didn’t mean to offend anyone by misattributing ideas.

Do you think it would’ve been better if we’d waited to roll out a complete design? Do you want to have a go at starting a new thread rewriting this the way you’d like and we can close this one? Should a unified design go to a wiki at this point?

I’m open to suggestions.

The only thing I absolutely want to avoid is delaying a broader discussion of what the language is going to look like and how it will behave w.r.t. interfaces, etc. I want to get feedback as early as possible in the design process. Pretty much everyone involved in the project is a stakeholder here.


#71

Interfaces like PyStan and RStan? Did we talk about the major proposed implementation changes there yet? I am a little afraid to name them in this thread for fear of hijacking.

I think this thread suffices for airing the language design we came up with and soliciting alternatives, but I think if you have specific concerns or people to check with on specific issues that haven’t come up (implementation and interface changes?), new threads for each (or grouped) could be helpful? This thread has been very active but mostly focused on the blockless and submodels ideas. We can summarize the design with a wiki page or something when it’s time to coordinate work.


#72

Bob - Re: @martinmodrak’s concerns about asymmetry between submodels and models, he was proposing forcing submodels to always declare data explicitly, and to force them to be passed in with “explicit binding.”

I think I’m leaning towards the original design proposal on this one, where we allow both data and params to be declared inside submodels, and then both are implicitly added to any containing model’s data and param signature and threaded through implicitly.

To try to make this concrete this for myself, I’m thinking of cases where users have a couple of existing models they already use for pieces of analyses and they want to import them into a larger model - it’s nice that importing one requires code changes in just a single place if there are no interactions between models.

What happens when someone imports a model from a file? I’m not sure if we even specified, but maybe in the current design that isn’t possible and you need to create a modified version that declares the model to be a submodel and specifies the signature before you are able to import it. Another possibility here would be to allow you to specify at import time what names will be passed in to the submodel. A third possibility is that we allow (or even require, if we can’t infer something to be a parameter) binding by name in the constructor signature of any variables declared in the body,

// Nonsense
submodel foo() {
  int N;
  vector[N] a ~ normal(0, 1);
  real sigma ~ normal(0, 2);
  vector[N] z ~ normal(a, sigma);
}

f = foo(N=27);
real z = foo.z + 2;

This is perhaps the logical extension of the way we allow access after construction to anything declared inside a submodel with . field and method access - but on the way in instead of the way out.

@Bob_Carpenter @Matthijs @mgorinova any thoughts on what to do about importing standalone models?


#73

In the example above with submodel foo, why isn’t N in the signature of foo. Coming down R/python/c++ that seems clearer since it uses the signature to declare the interface with the submodel.


#74

This seems pretty explicit to me. When a submodel is declared and defined, it comes with parameter and data variables.

model foo() {
  data real mu; 
  parameter real y ~ normal(mu, 1);
}

model f = foo();  

// f.mu and f.y now available

This is similar to what happens in C++ (or Java) when an object is constructed. For example,

vector<string> names(20);

// name[0], ..., name[19] now available in scope

There’s no object-level concept of containing model—just a sequence of statements before and after the declaration and definition of f.

Standalone in what sense?


#75

Yes, that’s absolutely true. In the current version of the language, that’s really the thing that’s specified. But this limits our ability to generate data from the same Stan program. (I wasn’t thinking about this when I wrote it, but after your comment, I thought about it a bit more.)

A little more concretely, if we were actually able to define these things:

  • \theta, the parameters
  • y, the modeled data
  • f(\theta, y), the joint density

Then it’d be really natural to be go between treating \theta as our parameters and condition on y (as we currently do) or just treat y as our parameters and condition on \theta. We’d know which situation we were in based on the data that’s provided.

The nice thing is that this could be done in a backwards compatible way.

That’s what I was thinking about. If we’re talking about the next Stan language, it would be nice to have this sort of library mechanism thought about and how it looks in the language. It’d be great to have user-contributed or user-maintained libraries that extend capability. I think Torsten falls into that camp, so perhaps we can use that as a use case? Beyond the implementation details, I’d like to know that it’s not prohibitive to build something like that.

(context: separating out Jacobian and target)
Two use cases for me:

  1. teaching
  2. debugging

For debugging, it’d really help to have the target defined in the model block be accessible somewhere. I do things like use init files to set parameter values and then check the lp__ to be exactly what I expect. I know how to back out the Jacobian term, but cutting out that step would be great.

For teaching, it’d be nice to treat the model block as a function. It’s a function over the parameters. You change the parameters, and you get a value out. That value should match (up to an additive constant) to how you build it up using target +=. And then the Jacobian part is automatic transforms. You can change the model block, but if you leave the parameters block alone, the Jacobian term is still the same.


#76

What I don’t understand about your proposal is how do you give the main model control of this data? E.g. will I be able to do something like:

data int shared_N;
model splines = spline_regression(N = shared_N);
model linear = linear_regression(N = shared_N);

I believe this HAS to be allowed - I can’t expose both splines.N and linear.N as data of the main model and expect the user to fill in the same number for both (I might want to do some matrix algebra downstream that requires compatible dimensions or something).

If you agree this is necessary, I think implicit binding produces two behaviors that are IMHO weird (but maybe I am just missing something):

  1. If I explicitly bind a data variable, as in the example, the corresponding fully qualified variables splines.N and linear.N have to disappear from the data of the main model. So the interface of the main model depends not only on the fragments/submodels included, but also on the parameters of their constructors.
  2. Model fragment definitions as proposed in the initial post separate input arguments into two groups - the one in parentheses e.g. model linear_regression(vector beta) and the unbound variables within the model (eg. linear_regression::N). But I should be allowed to set all of them in the constructor, which is weird. I believe either everything should be declared explicitly in the interface (this is my proposal), or everything should be implicit.

[begin philosophical rant]
Yeah, but not exactly, as the member variables are also in a sense available before the fragment is declared (they become part of the data for the main model). You don’t get undeclared parameters of nested functions becoming parameters of the function calling them in most languages.
[end philosophical rant]

A slightly orthogonal, but still IMHO unaddressed concern is how will I, as a submodel/fragment library developer hide implementation details and avoid some internal variables that might be subject to change in the future from being referenced. Will there be a private modifier if everything is exposed by default?


#77

I’ve been thinking about the submodel more and more as literally a “plug-in” to a bigger model. Some of the variables in this sub-part are internal and hidden from the rest of the model. And some of them are visible, and “available for attachment”. For example, imagine we’ve got the following (nonsense) program:

submodel foo_model(int N){
   real x ~ normal(0, 1);
   private vector[N] y ~ normal(x, 1);
   vector[N] z ~ normal(y, 1);
}

data int N;
submodel foo = foo_model(N);
vector[N] a ~ normal(foo.z, 1);

Then one can visualise this as the graph on the left below:

The green variables of foo are public parameters that can be used to “attach” the submodel foo to different parts of the main model by referencing them with foo.x or foo.z.

It seems to me that it makes sense for unmodelled data, such as N in this case, to be passed down as an argument to the constructor of a submodel, as with the code above. The public parameters of the model can be used as data depending on the situation, by allowing optional arguments in the constructor. For example, we might instead have what’s visualised on the right in the picture above:

submodel foo_model(int N){
   real x ~ normal(0, 1);
   private vector[N] y ~ normal(x, 1);
   vector[N] z ~ normal(y, 1);
}

data int N;
data real v;
submodel foo = foo_model(N, x=v);
vector[N] a ~ normal(foo.z, 1);

If I understand correctly, this is similar to what @martinmodrak is suggesting. That is: data needs to be passed as arguments to submodels, rather than supplied externally. Not sure if this is the best way forwards, but it seems clean to me, as it solves the problem with asymmetry: a model is just a constructor for a submodel, where any data item is an argument to the constructor. So the model above is the submodel:

submodel main(int N, real v){
   submodel foo = foo_model(N, x=v);
   vector[N] a ~ normal(foo.z, 1);
}

This also makes sense in terms of being flexible with what’s data and what’s a parameter. Anything that’s declared as data in the program is definitely data and needs to be supplied — a non-optional constructor argument. Other variables can be data or not, depending on the usage — they are optional constructor arguments. For example, we can call main(N=10, v=0), or main(N=10, v=0, a=[0,1,2,3,4,5,6,7,8,9]).

(I used a private variable in this example, but we were planning to ignore public/private for now, and only add this functionality in version 3.1. Also, there might be a better keyword than private. Maybe latent? That’s just details, I suppose…)


#78

The use case we have now is SBC (formerly Cook-Gelman-Rubin diagnostistics). There, we generate both the parameters and the data from the hyperpriors.

I think we can even be flexible enough to specify at run time which variables are observed a la BUGS if we have a joint log density.

Torsten will have Stan models, but I’m guessing a lot of it is based on the R interfaces like RStanArm.

Those are both compelling use cases. Neither of them involves manipulating a Jacobian. Where do you want user-defined Jacobians to fall?

I take it you mean minus the Jacobian adjustments for constraints?

The whole implicit unconstraining is very confusing for users, especially given that sampling is happened on the unconstrained scale.

Many of our users are surprised about the constants dropping out, too.

The Jacobian varies with the parameters, though, so it’s not just a constant on top of the model block.


#79

Think of the data input as providing a reader and the parameter input as a providing a reader. Given data and parameters, we get a log density. Think of these being plumbed through to all the object calls or think of them as globals.

The functionality of sharing data among models, yes. The way I’m imagining this going is as I outlined above, with arguments to the both models. Slightly modifying the example above, I’m suggesting something like this:

data int shared_N;
model splines = spline_regression(shared_N);
model linear = linear_regression(shared_N);

The variable shared_N is only defined in the top-level model, then passed as an argument to the two submodels. Those submodels can’t declare it as data—they will take it as an argument. And it all will act lazily like a closure because the shared variables might be parameters. We wait to run time to actually read them.

OK, now I think I see what you’re worried about.

I’m not thinking about it this way, but rather as having data and parameter streams available everywhere.

Then we have to prove theorems that we can rearrange the code to produce the function you’re thinking about as an optimization.

I’m not sure what you mean by “implicit binding”.

In Java and C++, member variables (as opposed to say, static class variables), become available only once an instance is constructed. The local names for the members are defined in the code, so that’s available. But there aren’t any values until an instance is constructed.

Probably. We don’t have a design for that yet and are open to suggestions.

Being member variables, we won’t get external name conflicts at least.

The data and parameter names have to be part of the “public” API. They’re not mutable anyway.


#80

I think so, too. I’d still very much like to avoid this and just have every variable declared exactly once. Then if v is to be declared externally, the submodel and invocation would look as follows.

submodel foo_model(int N, real x) {
   x ~ normal(0, 1);  // no longer declares x
   private vector[N] y ~ normal(x, 1);
   vector[N] z ~ normal(y, 1);
}

data int N;
data real v;  // single declaration of v
submodel foo = foo_model(N, v);
// now v and foo.x are aliases for the same variable
vector[N] a ~ normal(foo.z, 1);

That is, the arguments to the “constructor” are all saved as “members”.