Stan++/Stan3 Preliminary Design

I agree, I think. Our trade-off was that this would mean you potentially need to write separate models which only differ in what is labelled data. Then, the question is, do we specify what is data in the model file or do we just pass it as arguments to the compiler?

I guess the following style might work quite well for the model above:

model coeffs = hier_normal_ncp();
model lr = linear_regression(coeffs.beta);
model hat = lr_predict(coeffs.beta, lr.sigma);

data {
  int<lower = 0> coeffs.K;
  int<lower = 0> lr.N;
  vector[lr.N, coeffs.K] lr.x;
  int<lower = 0> hat.N;
  matrix[hat.N, coeffs.K] hat.x;
  vector[lr.N] lr.y;
}

I would have a mild preference for specifying what is data at the end of the file as the variables would then actually be in scope, if we think of the submodel calls as constructors.

Note that we only need to specify what is data. Maria’s trick in SlicStan lets us then automatically infer about the remaining global variables whether they are parameters or generated quantities.

Sure. We’d have to think about how hard it is to implement. It shouldn’t be too terrible, especially if we only allow user defined derivatives of order 1. The higher-order stuff, I wouldn’t really see how to do yet.

That is a problem we should be able to address immediately in Stan 3 as we will allow users to declare variables (and function arguments) of type data. Would that solve the problem?

1 Like

I agree. The higher order stuff can be done later if at all.

YES, that would solve it! A much needed feature. Cool.

1 Like

I’m not sure I understand the concern. Maybe @betanalpha could explain a bit more about what he meant?

I mean, of course, no one is suggesting we perform inference for the submodels separately and then combine, just as you would not separately perform computation on subroutines in a larger program if those subroutines access the same state. All we were suggesting is that for the purpose of writing your code it can be useful to have the option to develop libraries of reusable models which you can combine to build bigger models.

I am relatively confident it is possible to infer what is a generated quantity, as long as you still distinguish between ~ (or target += and _rngs. I am basing this on Maria’s thesis and her upcoming paper with Charles Sutton and Andy Gordon combined with our discussions in the past week.
What I am less sure about it that we can decide in full generality when to optimise a ~ at level model to an _rng at level generated quantities. This would depend on detecting certain conditional independence properties of the joint density. We should be able to do this sort of optimisation in easy cases hopefully, though.

1 Like

The problem is that the joint model formulation will no longer be integral to the Stan program, so there will be the additional step of downloading a program and running the pretty printing facility in order to get the more holistic view of the program. This additional step will be a massive pain and I do think that it needs to be considered.

But let me again emphasize that there are compromises abound here that I don’t think are being considered. Centering/non-centering is not an unnecessary implementation detail in my opinion, nor are QR decompositions or other modified implementations that can, but don’t necessarily always, improve the latent geometry. Moreover, modularization and high-level abstractions are often in tension with computational considerations and readability of the joint model. Again these are all useful concepts in specific contexts but they are not universally without cost and the relative benefits and costs need to discussed, especially if that fall into “good for developers” verses “good for users”.

The difference is not in how the calculations are done but rather when they are done. Other languages are focused on interweaving computation and model specification (either explicitly or implicitly through partial traces and the like) where as Stan doesn’t do anything until the full joint model has been specified.

I don’t see how a top level declaration like this wouldn’t compromise the many stated features of this proposal. Not that I don’t like the idea of having to specify the full signature this way but rather that it would obstruct many of the modularity features that motivated the design.

Sure, but at the moment the Stan language is too expressive to be able to infer these conditional independence structures. The language is more expressive than probabilistic graphical models!

Right but this augmented view breaks all of the declared abstractions, does it not? Computation of push forwards is fundamentally different from computation of the conditional posterior, hence the natural separation into different specifications where different computational tools, like PRNGs, are exposed.

I argue joint distribution over data and parameters, but regardless, the only reasonable representation of such a model right now is a density over the reals. In particular, the entirety of the Stan backend presumes that the language provides such a density. Making languages capable of expressing more general distributions is straightforward – finding algorithms capable of accurately computing with those distributions is, I argue, an open problem.

I don’t disagree with this, but I would argue that we haven’t yet fleshed out a model building workflow well enough to be able to make this decision properly, so discussion its incorporation into the language seems a bit premature to me. In particular, how do we deal with the ambiguity in what is prior and what is likelihood either in the workflow or the language?

Definitely an important conversation to be having in general but its relevance to the language depends on the relative timelines here.

This may actually be an RStan bug as I’ve encountered a few times in the last few weeks. Running in CmdStan and PyStan you get a short warning that a variable matching the constraints was not found in the data file – you used to get this in RStan but at the moment that warning seems to be getting swallowed.

Because compositionality does not imply a graphical model – there’s no guarantee that two submodels won’t augment the same parameters akin to

mu ~ normal(0, 10);
mu ~ normal(5, 5);

in the current language. Consequently Markov blankets, and conditional sampling or the like, cannot be defined until all of the components have been aggregated anyways.

Preventing the language of having redundant components like this would move Stan to specifying more graphical models, but it would limit the expressiveness of the language.

I don’t think that’s an issue – the issue to which I was referring is that the language features available to f([[S_1]],[[S_1]]) will be strictly richer than the features available to [[S_1]] and [[S_2]] (PRNGS, etc). In other words the computation is leaking across the compositionality here and we end up with components using different languages.

Technically this is all fine if we have some directive in the language that tells us when we condition the joint model and hence when the more expressive language can be employed. Currently we do this with blocks, I’m not sure how it would be done in the proposal.

This is fantastic! Having something like this would have saved me many weeks of work creating my own tooling and debugging some very complex models.

It looks like this will be getting rid of the distinction between defining a param in the transformed parameter block and in the model block. It would be nice to maintain some way to tell CmdStan that it doesn’t need to save some parameters, This is the only way I avoid generating GBs of csv files and blowing through my disk quotas.

1 Like

Hi all. Bob just informed me of this discourse thread, and I wanted to share a couple of thoughts:

  1. From what I saw of this new design (which Bob explained to Aki and me on blackboard), it looks great: I think it will make models easier to write, to share, and to develop. In particular, currently our Stan models start to get hard to follow when they’re longer than 60 lines or so of code; these new language improvements should help a lot for building complicated hierarchical models.

  2. There seems to be some discussion above regarding the following statement: “‘Statistical model’ does not mean ‘density.’ It means ‘joint distribution over data.’” I think this statement is basically correct; actually I’d say “joint distribution over data and parameters.” It’s something that comes up in chapters 6-8 of BDA3 and we’ve discussed it in the context of various examples over the years. The simplest way of putting it is that we are working with generative models: p(y,theta|M) = p(theta|M)p(y|theta,M). A generative model for parameters and data is more than a posterior density. To put it another way, you can have different data-generating processes (sometimes called “sampling distributions” in theoretical statistics) that give you the same likelihood function. For posterior inference, you just need the likelihood & prior, but for model checking and fake-data simulation, you need the data-generating process.
    From a Stan user’s perspective, we can consider two use cases:
    (a) I just want to supply an objective function and optimize it, or just supply a log-posterior density and draw from it.
    (b) I want to specify a generative model, a joint distribution, so I can draw from the posterior conditional on data and also simulate fake data and predictive checks.
    Both these are important.

1 Like

I like the idea about using a single model file for both data simulation and inference, but I think it’s very important to know what exactly needs to be supplied as data. Perhaps the data block could contain conditional statements so that a user could switch between the two modes?

model coeffs = hier_normal_ncp();
model lr = linear_regression(coeffs.beta);
model hat = lr_predict(coeffs.beta, lr.sigma);

data {
  int<lower = 0> coeffs.K;
  int<lower = 0> lr.N;
  vector[lr.N, coeffs.K] lr.x;
  int<lower = 0> hat.N;
  matrix[hat.N, coeffs.K] hat.x;
  int<lower=0,upper=1> y_is_data;
  if(y_is_data) {
    vector[lr.N] lr.y;
  }
}

When y_is_data = 0, lr.y is treated as a parameter (or generated quantity, possibly).

I’m not a developer, I’m still learning Stan and statistic every day. Every working day Stan makes my day. So I’m happy to know that all we (user) will have in the future an even better Stan language and that there is an effervescent community of developers!

2 Likes

To brainstorm a bit, I think there are maybe two things we can do that help address concerns:

  1. Throw an error when someone specifies a variable and doesn’t use it (this should help with defining x_tilde). Maybe we can extend this to not involving something we think is a parameter in a target += statement.
  2. Allow block specifications both at the top-level and within submodels:
submodel ncp(...) {
  parameters {
    real alpha;
  }
...
}

Thanks for your thoughtful reply.

Again, I think that this is a misleading view of other probabilistic programming languages. Instead, I would say that they focus on specifying the model via the description of a computation.

I don’t see the problem here. For example, you could have two types of Stan files: “programs” which require a data block and a model block, and “submodel files” which contain model {…} which must be imported into a program file to be used.

I’m not sure if this is a good design or not, but I’m just saying that it’s possible.

There are two levels of abstractions here:

  • Statistical modelling level: Data, parameters, nuisance parameters.

  • Computational level: State space of the Markov chain, local variables in generated quantities

My understanding is that “generated quantities” is a concept at the computational level that has no analogue at the statistical level. Would you agree with that ? If so, it seems a shame to expose that concept at the level of the model description if it can be avoided.

My bet is that this is a fixable limitation of current inference technology rather than an inherent feature of the Stan language. Of course I could be wrong.

Yes, I think I would agree. This is me being “corrupted” by my ML training (we have a gazillion parameters and none of them matter).

I’d maybe revise that to something like “the only practical representation for current statistical software”.

1 Like

FWIW, I’ve never understood why Stan allowed double-twiddling mu like that. It makes sense when guarded by a conditional, so that only one twiddle is “active” as it were. Is that the only use case or are there use cases where it’s important to have that?

1 Like

That was my original motivation. I found BUGS programs too hard to read, so I wanted something that made everything explicit. I thought all along as if the program was going to compute a function from data to parameters to log density. The data block specified the signature of the data argument, the parameter block the signature of the parameters, and the model block provided the body of the function.

I’m talking about compositionality at the level of computing the joint density, not in terms of sampling. The density computation involves only reading data, reading parameters, and adding to the target. These operations can all be composed.

This is not at all clear to me from looking at them, where they mix I/O, sampling algorithms and the model. In fact, this is a major point of contention that’s led most of the probabilistic programming community to dismiss Stan as just a “log density specification language”. They seem to like that they have “observe/sample” or whatever defined somehow operationally. And they seem to also like that they can mix all the sampling and blocking and everything else together in one language.

Let’s take a simple Pyro example:

def model(...):
    z_prev = self.z_0

    # sample the latents z and observed x's one time step at a time
    for t in range(1, T_max + 1):
        # the next two lines of code sample z_t ~ p(z_t | z_{t-1}).
        # first compute the parameters of the diagonal gaussian
        # distribution p(z_t | z_{t-1})
        z_loc, z_scale = self.trans(z_prev)
        # then sample z_t according to dist.Normal(z_loc, z_scale)
        z_t = pyro.sample("z_%d" % t, dist.Normal(z_loc, z_scale))

        # compute the probabilities that parameterize the bernoulli likelihood
        emission_probs_t = self.emitter(z_t)
        # the next statement instructs pyro to observe x_t according to the
        # bernoulli distribution p(x_t|z_t)
        pyro.sample("obs_x_%d" % t,
                    dist.Bernoulli(emission_probs_t),
                    obs=mini_batch[:, t - 1, :])
        # the latent sampled at this time step will be conditioned upon
        # in the next time step so keep track of it
        z_prev = z_t

I don’t understand the underlying computation, but the doc is written ias if the pyro.sample statement really does sampling. I assume it doesn’t as they are fitting with variational inference. You can also see explicit minibatching in the model itself (as opposed to in some algorithm on top of the model—I guess you could think of minibatching as part of the model).

That’s how I originally designed it. We had a C++ implementation with a class that was constructed with data and implemented a log density function. The data block provided the constructor signature in some sense and the parameters block the signature of the log density function (as well as the necessary constraints to transform to unconstrained parameter space).

Technically that’s right and we could compute the generated quantities using MCMC. But we don’t. The actual Markov chain simulated is just over params, then there’s a separate computation for generated quantities. No information feeds back from generated quantities to the model and it’s all computed with double values and no derivatives. And largely with pure Monte Carlo (as opposed to MCMC) with (P)RNGs.

We can only do that inference in general for directed graphical models. Stan being Turing equivalent on the computation side means we can’t even tell what gets executed, much less what can be effectively marginalized. We might be able to develop some heuristics.

We tend to think of them as joint distributions over parameters and observations conditioned on some constants (like sizes, or maybe unmodeled predictors).

I wouldn’t reserve all your worry in this space for Stan 3. There are plenty of ways the current version fails sliently or weirdly that users have a really hard time debugging. Here’s a common one:

vector[n] y;
...
for (n in 1:N)
  y ~ normal(mu, sigma);

This will work just fine and wind up giving you much narrower posterior intervals than you expect because each data point is repeated N times.

It will for continuous data. What we won’t be able to do easily is discrete data forward sampling.

We were thinking a bit about explicit parallelization of loops and what not. I don’t think we’ll be able to automatically parallelize.

We’ll get this whether we go with a blockless version of Stan or not. This was just waiting for C++11 closures to become available for us. I’d say less than a year to functional types and closures (assuming we don’t get hung up on tuples en route).

In general, we’d like users to be able to specify derivatives for general user-defined functions. This should be easy for the ODE functor, but in general is tricky because functions can take multiple arguments and we need a Jacobian component for each component of the output and each component of each argument of the input. Do you have a suggestion for how the user could specify such a thing (assume it’ll be easy to implement—I just want to know what you’d like the definitions to look like).

I’m not sure what you mean by implicit or automatic passing of data. Only in-scope variables are visible anywhere.

What you wrote [data block plus submodel invocations] would also be legal—just a different programming style.

We considered this. To make this fly, we’d have to keep track of types on the submodels a little more closely.

So am I as everyone may have noticed from the first language design! And we’re keeping the strong static typing (which is still painful for lots of non-programmers as we can see by traffic on the forums).

It’s the clarity versus flexibility we’re trying to trade off.

Just to be clear for other readers here, the function arguments themselves can be data, but any intermediate variables will be raised to the join type of all the arguments (still data if all arguments are data, var otherwise). This is something we can fix with better code generation on the inside and isn’t related to this language redesign.

I don’t think that’s going to be legal. If there is a K in coeffs, then it must have been declared there, so it shouldn’t be declared again. I assumed the previous example was meant to remove the data variable declarations from the submodels.

That’s pretty easy. We convert their gradients into template code that can run either with reverse or forward mode and then we nest. This shouldn’t be a problem, in other words.

Do you mean developers of packages on top of Stan or developers of Stan?

This cuts both ways. When you have half a dozen non-centered parameterizations written out by hand (aka cut-and-pasted) in your code, it gets hard to read. It’d be much easier to read and understand if it could be broken into a subroutine of some kind.

I’m not sure they do a lot of this kind of interweaving. It seems like it, but then I just looked for examples, and for the most part, the model’s specified independently, just wrapped in a ton of things that look computational before and after the model specification in all these examples. I’m sort of surprised given how much grief we’ve gotten from the PPL people over not allowing interleaved model specification and inference!

This is one of the things that’d be up to the programmer. You could specify all the data up front and that would indeed remove some of the flexibility in the model to be used in multiple directions or for missing data. But you wouldn’t have to.

Sort of. It certainly is different now in the computation. But we can always frame it as just more posterior. The reason I broke it out the way I did in the beginning was purely for efficiency (and because we can’t use HMC on discrete paramters). The generated quantities block is only evaluated once per iteration, and then only on double vals with no derivatives. So it’s super efficient compared to sampling.

Not sure how else we’re going to solve the problem. This design came up to address very real problems we’re facing in building real models. In that sense, it doesn’t seem premature.

I don’t understand the question about likelihood/prior, as we don’t make any distinction between these in the current framework, either.

In case people want a real example, we use exactly this kind of “double sampling” for soft centering priors.

Correct. This has been a huge pain as I built it originally in terms of only saving the transformed parameters. It means if there’s an intermediate quantity used to compute a transformed parameter that’s needed in the model, then it needs to be saved.

I didn’t mention this in the body of the post, but indeed the plan is to have the declaration of what to save be independent of what gets defined.

This is a very important point. Shorter code’s usually much easier to read. As is code where uses of variables are near where they are declared. Separating declarations and uses of variables is a known bad practice, and we pretty much enforce it with our block structure. This is one of the things Andrew talked about at StanCon 2017 that inspired Maria’s MS thesis and my own thinking about where the language should go that I wrote about a while ago on these forums.

We’ve toyed with these conditional declarations, but in the end decided they’d be more trouble than they’re worth.

And we’re never going to allow variable declarations with periods in them (this came up in a few posts)—they’re just to refer to membe variables of submodels.

Ever since our meeting, I’ve been toying with making something like this work with the current Stan language. No reason we couldn’t add a submodel with components spread across blocks and then merge it in.

I’ve always found this very confusing within the PPL community, given that they don’t actually run the computations as they are written (instead they do rejection sampling or SMC or something with program states than just executing as written). I had thought there was actually more computation in their models, but now I can’t find examples—it’s all jumbled together in their examples, but it’s not mixed in examples I looked through for Pyro and Edward.

Statements and declarations can only move to generated quantities if they are conditionally independent of the parameters.

It can’t be avoided in general in a language as flexible as Stan (most things aren’t going to be decidable in general). But I want to stress that we’re dealing with a programming language and I want to keep thinking of this as a programming language design. We’re not trying to write pure model specification—we want users to be able to control computation.

It’s an intrinsic feature of our Turing equivalence. Given that you can’t tell what statements get executed, you won’t be able to tell which are conditionally independent in general.

Soft-centering of an otherwise undefined parameter is an example:

y ~ normal(0, 5);
sum(y) ~ normal(0, 0.001);

The first is the prior, the second provides soft centering of the sum. Technically, it’s providing a double prior on y.

Other cases would be things like reparameterizations in terms of differences, where it’s going to be very hard to tell that we’re not double counting:

x - y ~ foo(...);
x + y ~ bar(...);

This is one of the reasons why I think people who have never personally written any compiled code before are not in a good position to learn the current Stan language, although this proposal perhaps makes this somewhat better. If you know some C++, then in retrospect it kind of makes these that the things in the data block are what the constructor needs and the log_prob method of the class takes in parameters and evaluates the log-kernel of the density function. But from the perspective of someone who only knows R, Python, etc. a Stan program does not look like a function that would be written in those languages:

foo <- function(arg1, arg2, ...) {
  # do stuff
  return(output)
}

It isn’t simply “Why does Stan require types, sizes, and semicolons?”, it is stuff like why

data {
  int N;
  int K;
}

isn’t written as

data {
  int N = 1000;
  int K = 5;
}

and the answer that N and K are things the constructor is expecting to be passed to it is not really satisfying for someone who does not know what a constructor is.

Anyway, having a submodel may help a bit but it is still going to be weird to them in some ways, i.e. seeing local data objects being used that are not passed to the function. That is not really an argument for or against the proposal, but a reminder that some of our workshops that go

  1. Install a C++ toolchain
  2. Do the 8 schools example
  3. ???
  4. Profit

make no sense to the people in the workshop.

Thanks Bob for the thoughtful reply, I’ll try to clarify a bit:

The implicit passing of data I refer to is that when you have a model like:

model gaussian_process() {
   int<lower = 0> N;
   vector[N] times;
   real length;
   real sigma;   
   vector[N] y ~ multi_normal(0, cov_exp_quad(times, length, sigma));
}

gp = gaussian_process();
y ~ gp.y;

Your specification seems to assume, that this will automatically let me specify gp.times as data for the main model. This IMHO has multiple issues:

  1. To understand what data/params/… a submodel provides you need to read its full code
  2. You have to be careful when updating a submodel that’s in a library, as basically everything in the submodel is visible to the user (and by Hyrum’s law someone will depend on it)
  3. If you use multiple copies of a submodel it is impossible to say which data is shared, e.g.:
gp1 = gaussian_process();
gp2 = gaussian_process();

requires me to specify both gp1.times and gp2.times as data for the main model, there is no way to have a single data item times and pass it to both gp1.times and gp2.times. I would prefer that the data passing be made explicit.

I believe that allowing two quite different programming styles has a huge cost as people need to understand both to be able to use the forums and other online advice effectively. I’d rather have only one way of writing things, even if it is the one I personally like less.

And one more philosophical point: I don’t think it is necessarily a bad sign when people come to forums with simple questions about basic misunderstandings (e.g. “Why do you have to declare data separately”, typing issues etc.). Consider two scenarios:

  1. User tries to write a simple model, but struggles as their mental model of how Stan works is incorrect. User asks a simple question on the forums which is easy to understand and gets quickly answered. User is now better equipped to write complex models.
  2. User tries to write a simple model, their mental model of Stan is incorrect, but they succeed because Stan performs a bit of magic behind the scenes. User proceeds to a more complex model and fail. They post a complicated and long model code on the forums. The question is hard to understand and takes more time to answer.

The R language is IMHO a typical example, where users succeed with simple things and then post impenetrable code on Stack Overflow asking for help.

I am not sure that this particular proposal moves us much towards the second scenario, but wanted to say that optimizing the language for “Hello world” examples is IMHO not very reasonable. It is IMHO more important that worst practices are hard than that best practices are easy.

Stan also has the advantage that for simple use cases, you have rstanarm and brms, so there isn’t really a reason to strongly cater to users that will only ever write simple models.

Well, that’s just like my opinion. You do you :-)

Yes, that’s the point. You write the program as if the pyro.sample statement really does sampling, but it doesn’t. In reality, complicated algorithms under the hood are approximating posterior distributions, but the details of that computation are abstracted away from the specification of the model.

You can think of Stan programs that way as well, at least in the common case in which the program describes a generative model. You can pretend that ~ means “sample” as a way of understanding the generative model, as long as you’re aware that Stan isn’t really sampling from those distributions under the hood.

I don’t think this can be correct. e.g., In the example at top, you could report a precision tau_y = 1 / sigma_y**2 in the generated quantities. (Maybe I am misunderstanding your point?)

Ah, but almost every optimization in an optimizing compiler depends on the solution to a Turing-complete problem. The way the handle it is to come up with a conservative analysis. It might miss opportunities for optimization, but never in theory changes the semantics of a program.

1 Like

At least for the data, that’s how you define object members in OO languages like C++ and Java. But iI agree that the current Stan language is challenging for a lot of people, both programmers expecting one thing (something more composable like a programming language) and statisticians expecting another (though I’m not quite sure what that is—I think something without types like R would be a start, but we’re not quite to where we can do full automatic type inference because of overloading. We migth be able to work it out.

This isn’t part of the proposal.

There’s always something in the ??? bit. The classes we’ve done where statisticians have shown up have gone well. The ones done to more general audiences have been a disaster in my experience. I want to cut out trying to do those until we’ve figured out a more general intro to probability and stats and maybe some programming.

@martinmodrak Thanks for all the thoughtful comments! My reply may be even longer than your post as I’m going through several ways of coding things.

Let me rewrite your model a bit to make it well formed according to your model spec.

model gaussian_process() {
   int<lower = 0> N;
   vector[N] times;
   real length;
   real sigma;   
   vector[N] y ~ multi_normal(0, cov_exp_quad(times, length, sigma));
}

model gp = gaussian_process();

I’m not sure what you intended with the y ~ gp.y as that wouldn’t be a thing. Instead, just constructing the GP itself as above would introduce five variables, gp.N, …, gp.y. I wouldn’t say it’s automatic so much as I’d say when you create a submodel, the submodel’s elements are also created. These created elements are then accessed as gp.N, etc. But it’s the call to gaussian_process() that creates them.

You only need to read the declarations. You already have to do that now. So I don’t think it’s so much the number of things you have to read so much as that it’s not clear what is data and what is a parameter. The plan is to allow that to be explicit, as I showed in the very first example. So if you wanted, you could do this:

model gaussian_process() {
   data int<lower = 0> N;
   data vector[N] times;
   param real length;
   param real sigma;   
   data vector[N] y ~ multi_normal(0, cov_exp_quad(times, length, sigma));
}

data int<lower = 0> N;
data vector[N] times;
param real length;
param real sigma;   
model gp = gaussian_process()

Now it looks almost like an existing model in terms of the order of declarations. And it makes it clear there are priors missing for length and sigma, assuming those are the free variables here. Those could either go on the outside, or on the inside of the submodel depending on how much you wanted to encapsulate in the submodel. The point is to start reasoning about the submodel as a unit. If the submodel is coherent, for example in defining a non-centered hierarchical prior or a full GP, then it can be easily used as a component in other models. That’s the main goal here.

It’s all encapsulated, so this is no more an issue than the name a developer used for a variable in an object like vector in C++. The user has named the submodel gp and all names that get introduced are relative to that object, and used as gp.X. This proposal actually makes this ability better. As is, all the names are completely global. So if you use four hierarchical models, you make yourself crazy writing down mu_beta_age and sigma_beta_age and the like for all the coefficients. This gives all the names structure.

Not quite. Nothing is shared between gp1 and gp2. In order to share something between these two, they’d need arguments and someting in a higher scope passed down to them. In order to share sigma, it all has to be done explicitly, just like in any other programming language:

model gaussian_process(real sigma) {
   data int<lower = 0> N;
   data vector[N] times;
   param real length;
   data vector[N] y ~ multi_normal(0, cov_exp_quad(times, length, sigma));
}

real<lower = 0> sigma ~ normal(0, 1);  // give it a prior
model gp1 = gaussian_process(sigma);
model gp2 = gaussian_process(sigma);

This is a serious concern and one I worry about in trying to maintain backward compatibility. The question’s whether people are going to keep wanting to write things both ways, or if we can wean them into new styles. This is essentially what happens with C++ — there are lots of ways to do things, but there are modern and antiquated ways to do them, with modern code being written in a more modern style.

If you think of (a) allowing data, parameter as qualifiers, as we’ve proposed, then all you need to do is think of data { ... } as way to specify the data qualifier on a group of variables.

So I hope it’s not that different in terms of style. But it will be an ongoing concern.

To me, it depends how much effort they put in before they got there. The ones that get me are the two extremes: “I spent two days wrestling with X” for something we could’ve answered quickly and the “here, I just guessed and it didn’t work, help me write my code one debug line at a time”. I also worry that when people show up to the forums with simple blocking issues that stop them from proceeding (as opposed to, say, some misunderstandings about operational issues), then I worry that really represents many many more people who got frustrated and walked away.

It’s very hard to build a tool that works for beginners and for experts. We’ve largely focused on the experts and are having some in-person discussions about how much we want to continue that focus versus try to address some more beginners issues.

It’s not just R. Anything involving code is like this. You can get a little ways by debugging by just apprehending everything at once, then you hit a complexity barrier that needs to be avoided through more engineering. It’s hard.

That’s just in R, so it’s invisible to a subset of our users. And brms and rstanarm cater to some particular simple classes of models (and some quite complex ones), but not all of them.

Not to speak for @betanalpha, but he and I have always been in agreement that the basic Stan language is actually a good tool for teaching people about the components of a Bayesian model. Partly because it’s easy to read a Stan model. I worry that this is going to change if we start writing more code-like examples. But they may actually be easier. I sort of prefer the hello world example:

param
  real<lower = 0, upper = 1> theta ~ beta(1, 1);
data
  int<lower = 0> N;
data
  int<lower = 0, upper = 1> y[N] ~ bernoulli(theta);

to

data {
  int<lower = 0> N;
  int<lower = 0, upper = 1> y;
}
parameters {
  real<lower = 0, upper = 1> theta;
}
model {
  theta ~ beta(1, 1);
  y ~ bernoulli(theta);
}

but I see the appeal of both versions. The former is short and follows the geneative story and links the declarations with the uses of the parameters and modeled data variables. The latter very clearly circumscribes the variables and their roles.

If it’s not, you should cite references. :-)

2 Likes

Exactly how we do think about it. But it can be very confusing to users.

Given the parameter sigma, the target is independent of tau. So you can move it to generated quantities. If you used tau somewhere else in the model, say by writing y2 ~ normal(0, tau_y), then you wouldn’t be able to do that. But if you did it where the y2 itself isn’t data, say to do posterior predictive replications, then you can again move it to generated quantities because now the target doesn’t depend on y2 or tau_y given sigma. And when I say “depend” here, I mean in terms of whether it’s getting any useful information about the parameters. Let me write out a full example:

parameter real mu ~ normal(0, 1);
data int N;
data vector[N] y ~ normal(mu, 1);
data int N_sim;
parameter vector[N_sim] y_sim ~ normal(mu, 1);

In this case the parameters are mu and y_sim and the posterior factors as

p(\mu, y^{\mathrm sim} \mid y) \\ \ \ \ \ { } = p(\mu \mid y) \cdot p(y^{\mathrm sim} | \mu) \\ \ \ \ \ { } \propto p(\mu) \cdot p(y \mid \mu) \cdot p(y^{\mathrm sim} | \mu).

BUGS, for example, can do this automatically, based on analysis of the Markov blanket of variables.

Exactly. So the question is whether we can detect useful cases locally.

An alternative is to consider a directed graphical modeling sublanguage which is no longer Turing equivalent, but lets you calculate a bunch of this stuff. I think we might also usefully be able to do a lot of hotspot runtime unfolding based on the data that’s actaully observed.