Stan++/Stan3 Preliminary Design


#61

Good point, let’s be more specific. The discussion I think I am having (hope others agree :-) ) is on the points from my previous post here: Stan++/Stan3 Preliminary Design

I’ll try to rephrase it and introduce some terminology to let us stay focused.

Basically the design as proposed (if I understand it correctly) features what I’d call implicit binding: unbound data variables from a fragment/submodel somehow magically bubble up to the top level and are treated as data to the main model, while everything inside the fragment is visible to the outside world e.g.:

fragment gaussian_process {
  data int N;
  data vector[N] x;
  ...
  vector[N] y_raw ~ normal(0,1);
  vector[N] y = <<cholesky and GP stuff>>
}

fragment gp = gaussian_process();
//now both gp.y and gp.y_raw can referenced

There were two variants floated around, one in which the main model automatically gains data gp.x and gp.N, and another where the main model gets data x and N. In both cases this is automagic.

I am trying to argue for explicit binding, where fragments/submodels have explicit interface, both in (as arguments to instantiation) and out (clearly stated which internal parameters/transformed parameters/gen quants are visible to the main model), e.g.:

data int whatever_I_call_it;
data vector[whatever_I_call_it] my_x;

fragment gaussian_process(int N, vector x) {
  //Introducing new data variables here is syntax error, only params/gen quants allowed
  vector[N] y_raw ~ normal(0,1);
  public vector[N] y = <<cholesky and GP stuff>>
}

fragment gp = gaussian_process(whatever_I_call_it, my_x);
//only gp.y can be referenced

Obviously the question about explicit/implicit input is mostly orthogonal to explicit/implicit inner member access.

The argument for implicit binding seems to be - at leat in part - that it is less verbose and less demanding of beginners. I took Elm as a counterexample where everything is explicit, but still it is terse and relatively easy to learn. I also like focusing more on debugging experience than on code writing experience as I believe debugging is more often the bottleneck.

I don’t want to steal a lot from Elm (Stan definitely should not switch to functional), but as for the size - I’ve certainly written programs one or two orders of magnitude larger in Elm than in Stan and I’ve seldom felt it is too verbose.


#62

Gotcha. This is a tough one for data - my opinion is mostly that we need to be able to create nuisance parameters within submodels for modularity’s sake. For data I had been imagining it being explicit as in your proposal, though I know Bob is a fan of allowing Stan programs where what is data and what is parameters is unspecified until compile time (rather than encoded in the Stan model itself). So in some cases, you could end up with the magic extra data parameters you describe, or worse - with them being inferred to be parameters when they aren’t passed in.

@Bob_Carpenter, do you remember what we were going to do about this issue? I forget if we had this particular issue figured out. One solution would be to force submodels to explicitly specify all data they plan to use, which I think was what I had been unconsciously assuming. But this breaks the symmetry between submodels and the larger model that I think is important.


#63

Sorry for jumping in so late. Hopefully I add to the conversation, specifically about the design considerations of the next iteration of the language. Inevitably, I’m going to add a bit of non-relevant comments, so feel free to ignore them.


Before I get into it, I wanted to make sure I’m on the same page in how I think about the Stan language. If we have a Stan program, the inputs are instances of data and parameters (on the unconstrained scale). These are all values that can be represented as either a floating point number or an integer (I think this is what @Bob_Carpenter calls “value maps”). I think there are two types of output:

  1. Computation of the log joint probability distribution function with the data and parameters as input. The main output of the Stan program is the function value, which is a single number. We also can compute the gradients of the function at the given inputs with respect to each of the parameters. Once again, the output is a single value (with the ability to get the gradients at that point).
  2. Lots of outputs that are functions of data, parameters, and a random number generator. By lots, I mean it can be 0 to very many values. The only thing we care about here are the values themselves.

The Stan program itself is not responsible for generating the parameter values. This is done outside of the Stan program by algorithms that are designed to navigate the parameter space and produce parameter values that are passed to the Stan program to compute the log joint probability distribution function.

That’s how I view the role of the Stan program. I think the language is there to make it easy to specify a joint model (and I think submodels fit within that framework) and a way to specify how to generate different quantities using the values of what’s already passed and a random number generator.


Now, onto language design:

  • was there any talk about separating data into modeled and unmodeled data?
    There are really two flavors of data: modeled and unmodeled. Why is this important? If we knew which things were modeled data, we could swap what we know about the parameters and modeled data and be able to generate quantities naturally. I’m just thinking about design and ignoring the computational aspect to it; couldn’t we just run NUTS if we knew which pieces of data were the observations and conditioned on parameters? (things that aren’t modeled typically include number of observations and observation times). I haven’t thought this through enough to know whether that’s enough information to do it generally, but if it does, wouldn’t it solve the generated quantities problem in the original post? As a benefit, it would also help with implementing things like mini-batches too. I think we need some sort of marking of data to make that happen.
  • Currently, one of the messier things we deal with is having to deal with non-centered parameterizations. Does the current submodel proposal make that easy? If not, is there a way to incorporate this use case? (I’m looking at the original post… how do we swap hier_normal_ncp() with hier_normal() (that doesn’t exist). Not looking for something automatic, but just a way to swap out a few lines of code that would enable the different parameterization.)
  • I know libraries were talked about a bit on the thread. Any thoughts on how that looks from the language? I don’t think I saw any examples on how that looks.
  • My current pain with the existing Stan language is the transformed parameters and the model block and having to jump back and forth to do computations in transformed parameters that affect the model block. This proposal kills that issue, so awesome!
  • This isn’t the language, but how we present results. I’d really like to separate out the reporting of the target value from the absolute of the log of the determinant of the Jacobian. I think it would really help debug constructing the model / target term if we could just plug in the parameter values and compute the term that’s up to an additive constant to what’s written directly in the Stan model block. (Please feel free to ignore this point for this discussion; I can put it on a different thread.)

#64

I think explicit binding is compatible with inferring what is data. Explicit binding just requires all data to be declared (with or without the data specifier) I agree that declaring new parameters in fragments/submodels is a necessity.

The assymetry between fragment/submodel and the whole model bothers me, but I think that you could have a dual syntax for this, e.g.:

fragment linear_regression(int N, int M, matrix[N,M] x) { <<model_code>> }

would be equivalent to:

fragment linear_regression { 
     inputs {  //Using "inputs", because those could be both params or data
         int N;
         int M; 
         matrix[N,M] x;
     }
     code {
         <<model_code>>
     }
 }

Or something like this. Or maybe allow only the latter syntax, because it is more general and it is not cool to have two syntaxes for the same thing. But now I am really getting ahead of myself :-)

What I don’t like about inferring what is data at compile time is that the model ceases to be specified in a single file - part of the specification lies in the call to the compiler. I’d rather specify a general model as a fragment and then have multiple top-level models that include the fragment and declare and pass different data/params. Especially since not all combinations of data/params might make sense (e.g. putting x for the GP as params is disastrous).


#65

What you’re describing is how this is usually presented in a Bayesian textbook and how it’s currently coded in the model class. It will also be how it’ll be coded in Stan 3.

But, in Stan 3, we’re thinking of the natural operaitonal semantics in more traditional computational terms, where a statement (or program) is interpreted as a relation between the context before it executes and the context after. Part of the context is a global data reader and a parameter reader. Data declarations read from the data reader (advancing its state) and parameter declarations from the parameter reader. Local variable declarations add to a local function stack context (as in usual programming languages), and operations to set variables (local, transformed data, transformed parameters) affect the local context. The target can also be modified.

So the submodels don’t have to be anything other than a sequence of executable code.

They interact through their side effects on the readers, target, and local variables.

Right now, Stan doesn’t enforce anything about joints, posteriors, etc. You just write a function down that implements the log density from which you want to sample. So I can play tricks like exploiting conjugacy to avoid implementing the joint and instead move directly to implementing the posterior.

The thing to do is think of this as syntactic sugar for:

real x;
x ~ normal(5, 2);
x ~ normal(1, 2);

The declaration will read the variable from the arguments as it’s currently implemented. The second two statements increment the target log density. Because of the rules of arithmetic, the second and third statements may be reordered (not that they’ll provide exactly the same answers with floating point under reordering, but close enough).

I do worry about this issue.

Correct. They only have operational interpretations.

They facilitate modeling by encapsulating transformations, distirbutions, etc. Like usual functions in programming languages. As you may have noticed, it’s a pain in the ass right now to implement a non-centered parameterization and this will let us encapsulate this notion in a fragment of code with a name that’s reusable and controls naming of its component in an orderly way for reuse.

It will be a log density, but not necessarily joint. There will be two different kinds of “free” variables, data and parameters.

The natural decomposition is compositional along the incremental reading of the data and parameters and computation of target log density.


#66

This is not what we’re proposing. We’re proposing standard objects with member variables, just like in C++ or Java. There’s no implicit binding. Everything is declared, named, and bound explicitly by standard scoping rules.

This would be a danger.

To mitigate this risk and lock in usage, a user could always declare a variable as data or parameters or whatever. I’m not suggesting get rid of the declarations, just allowing them to be inferred automatically.


#67

I think we’re on the same page here, other than that as I keep pointing out, nothing requires the log density to be a joint density over data and parameters—it’s just interpreted as a density over parameters up to a proportion.

No, but that’d be useful to know what kind of data we could impute as missing data—only modeled data. It might protect against accidental missing data leading to sampling improper densities.

If we have minibatches, it’d cross modeled (the data) and unmodeled (sizes of minibatches).

Yes. You can implement them both as submodels. You probably wouldn’t implement the simple case that way, since it’s also implementable as beta ~ normal(mu, sigma);. So that’s what’d get swapped out with non_centered_normal(beta, mu, sigma); if you wanted to write it in parallel. This is the case Maria’s thesis was largely designed to address, so there are examples there, too.

The math library doesn’t need to change. We could start distributing libraries of submodels as well as libraries of functions. Even better if we could compile and link independently. I’ve been reading some stuff by Bob Harper et al. recently only submodules in ML and principles of submodules in general that’s very compelling, and it includes discussion of independent compilation of submodules (which we could do—there really isn’t anything magic about scope going on despite some lingering confusion due to the ill- formed first example I presented).

This should be a separate issue, but I can’t resist any bait (I’d be a short-lived fish).

As you know, the hooks are there in the code to make this easy—all the Jacobians are done in one step in the log density function and could be stored in a separate value.

I’ve wanted to have jacobian += in the transformed parameters block from the get-go, but there is a lot of opposition to that both because it’d be confusing and because we’d never use it.

What’s the usue case?


#68

Matthijs drove us to keep working until there is no asymmetry. The operational semantics is exactly the same.

I think some of these discussions may need to be postponed until Matthijs and Maria work out the formal operational semantics so we can resolve some of these confusions more concretely.

If by “inputs” you mean function arguments, then this is exactly how we’re thinking of this. Particularly as arguments to an object constructor.


#69

Minor point of order concern - I think the narrative presentation here and in the original post is both not super helpful for presentation and a bit divisive - for example I feel like I was another major proponent of that symmetry, which is relevant here in this later discussion. Saying this instead as “We collectively placed a high value on that symmetry and kept working to eliminate it” or something could be more helpful. And in the original post, I feel like we could have written a summary aimed at justification instead of a play-by-play. I credit Michael with teaching me that emphasis for paper writing :P


#70

Thanks. I’ll try to ease up on the storytelling or confine it to blog posts rather than serious discussions. It’s obviously very subjective, and I didn’t mean to offend anyone by misattributing ideas.

Do you think it would’ve been better if we’d waited to roll out a complete design? Do you want to have a go at starting a new thread rewriting this the way you’d like and we can close this one? Should a unified design go to a wiki at this point?

I’m open to suggestions.

The only thing I absolutely want to avoid is delaying a broader discussion of what the language is going to look like and how it will behave w.r.t. interfaces, etc. I want to get feedback as early as possible in the design process. Pretty much everyone involved in the project is a stakeholder here.


#71

Interfaces like PyStan and RStan? Did we talk about the major proposed implementation changes there yet? I am a little afraid to name them in this thread for fear of hijacking.

I think this thread suffices for airing the language design we came up with and soliciting alternatives, but I think if you have specific concerns or people to check with on specific issues that haven’t come up (implementation and interface changes?), new threads for each (or grouped) could be helpful? This thread has been very active but mostly focused on the blockless and submodels ideas. We can summarize the design with a wiki page or something when it’s time to coordinate work.


#72

Bob - Re: @martinmodrak’s concerns about asymmetry between submodels and models, he was proposing forcing submodels to always declare data explicitly, and to force them to be passed in with “explicit binding.”

I think I’m leaning towards the original design proposal on this one, where we allow both data and params to be declared inside submodels, and then both are implicitly added to any containing model’s data and param signature and threaded through implicitly.

To try to make this concrete this for myself, I’m thinking of cases where users have a couple of existing models they already use for pieces of analyses and they want to import them into a larger model - it’s nice that importing one requires code changes in just a single place if there are no interactions between models.

What happens when someone imports a model from a file? I’m not sure if we even specified, but maybe in the current design that isn’t possible and you need to create a modified version that declares the model to be a submodel and specifies the signature before you are able to import it. Another possibility here would be to allow you to specify at import time what names will be passed in to the submodel. A third possibility is that we allow (or even require, if we can’t infer something to be a parameter) binding by name in the constructor signature of any variables declared in the body,

// Nonsense
submodel foo() {
  int N;
  vector[N] a ~ normal(0, 1);
  real sigma ~ normal(0, 2);
  vector[N] z ~ normal(a, sigma);
}

f = foo(N=27);
real z = foo.z + 2;

This is perhaps the logical extension of the way we allow access after construction to anything declared inside a submodel with . field and method access - but on the way in instead of the way out.

@Bob_Carpenter @Matthijs @mgorinova any thoughts on what to do about importing standalone models?