RStan3 and PyStan3 Interface

I think we agree—the GoF book is a bunch of UML detailing C++ implementations. I didn’t mean that, I just meant the basic patterns (like having singletons, facades, or adapters). The linguistic turn is that “pattern” has come to mean something a little more vague and high level than “algorithm”. So making things iterable or enforcing a singleton is a pattern (lower case), whereas quicksort is an algorithm.

There’s a Python book that lays things out this way (no idea if the book’s any good w.r.t. Python, but it’s the right idea of what patterns are):

http://python-3-patterns-idioms-test.readthedocs.io/en/latest/PatternConcept.html

I hadn’t heard the term “duck typing” before, but it sounds like a C++ concept and template instantiation, especially in the sense that there’s no interface being defined in the object language that these iterables have to extend or declare that they implement. Of course, no typing, so no need for template parameters—everything acts like a template parameter for free! The trick in the video was wrapping a rabbit in a duck outfit.

We should talk about returning raw tuples and what not. The advice I’ve seen is to prefer returning class instances initially implemented with direct exposure of member variables then later encapsulating these if need be with these special ___ methods.

I read that as “duck confit” and became both really confused and really hungry.

I’m hoping to work on PyStan 3 this week. It would be great if we could agree on R and Python user-facing APIs which are similar. We do all agree that it’s a good thing if people can switch between RStan and PyStan without too much effort, right?

Here’s a flattened table capturing some of the main proposals. Assume the user is working through the 8schools example.

Proposal A

RStan

program <- stan_compile("data { int<lower=0> J; ... }")
data <- list(J=8, ...)
compiled_program_with_data <- program$instantiate(data)
post <- compiled_program_with_data$sample(seed=1)
post$mu

PyStan

MyProgram = pystan.compile("data { int<lower=0> J; ... }")
data = {"J": 8, ...}
program = MyProgram(data)
post = program.sample(seed=1)
post['mu']

Comment

Original proposal from Stan Wiki’s User Interface Guidelines for Developers

API requires the fewest lines of code from the user.

Proposal B

RStan

config <- stan_config("data { int<lower=0> J; ...")
config$data(J, y, sigma)
module <- config$build(seed=1)
post <- module$sample() 

PyStan

config = pystan.make_config("data { int<lower=0> J; ...")
data = {"J": 8, ...}
config.data = data  # or config.set_data(data)
module = config.build(seed=1)
post = module.sample()
post['mu']

Comments

New RStan3 proposal. Introduction of build method seems like the most significant change.

Proposal C

RStan

posterior <- stan_compile("data { int<lower=0> J; ...", J, y, sigma, seed=1)
draws <- posterior.sample()
draws$mu

PyStan

posterior = pystan.compile("data { int<lower=0> J; ...", data, seed=1)
draws = posterior.sample(seed=1)
draws['mu']

Comments

Assumes we never need the compiled-but-uninstantiated object. Is this always true?

Proposal D

RStan

posterior <- stan_model("data { int<lower=0> J; ...", J, y, sigma, seed=1)
draws <- posterior.sample()
draws$mu

PyStan

posterior = pystan.StanModel("data { int<lower=0> J; ...", data, seed=1)
draws = posterior.sample(seed=1)
draws['mu']

Comments

Proposal C but with stan_model/StanModel


Any other proposals?

Edit: added Proposal C
Edit: fix proposal C and add proposal D

I don’t see the compiled-but-uninstantiated object as having a role from the user’s point of view, if the only thing that can be done with it is instantiate it, which the user probably doesn’t really understand anyway. So, I think compilation and instantiation should be combined into one step. In R, it could look like

posterior <- stan_compile("8schools.stan", seed = 12345, J, y, sigma)
draws <- posterior$sample()

but there are a lot of other viewpoints, ranging from the name of the first function to whether the algorithms should be methods of a different class than the low-level methods like $log_prob.

1 Like

Certainly the seed should go with sample() right? The user might want to
sample multiple times from the same posterior.

I like this proposal. I’ll add it as Proposal C above.

No. Even for 2.16.0 the seed is needed at instantiation time in order to do PRNG in the transformed data block (and because the Boost RNG thing is a template parameter). I think you will be able to call $sample() multiple times and it will continue from the previous PRNG state.

This isn’t exactly right. The seed is not needed at compilation time.
The C++ stan_model can be (and currently is) instantiated immediately
at sample time, so there’s no problem allowing the seed to be changed at
each sample call.

Although I don’t mind having the user pass a seed a compile time (as
PyStan 3 will cache all compilation results), it isn’t something we are
required to do.

Anyway, this is a detail which can be sorted out later. Making progress
on deciding among the proposals is what’s essential right now.

I like proposal C - simpler, cuts out intermediate steps the user won’t likely care about.

The compiled but uninstantiated object could know the name and types of all the variables and provide them statically, but I don’t even know that’s possible through Rcpp.

The compiled but uninstiated object is what you need to run the same program on different data. As is in RStan, this is all implicit in that it tries to act like make behind the scenes and figure out if it needs to recompile or not to get new data. As things are now, you need the whole fit object (post in Proposal A, draws in Proposal C) or the whole posterior object, both of which seem confusing to me.

You need the seed to instantiate the Stan program with data. Ideally, that seed will then be used for later calls so that they’re reproducible with a single seed. That’s why we changed the seed/advance behavior to treat the first chain as already advancing past what was used to randomly generate any random data.

I like Proposal A because it gives you the pieces you need. I absolutely don’t mind if there’s a single abstraction at the top like we have now:

fit <- hmc_diag_e_adapt(program = "foo.stan",
                        data = data_list,
                        ...);

as long as there’s a way to get the individual components as such out of the fit. That is, I want to be able to do

program <- fit$program
posterior <- fit$posterior
draws <- fit$draws

So that if you wnat to refit with new data after calling the general program, you’d pull the program out of the fit object and reinstantiate it.

This does make sense. So there are people who are going to want to call
log_prob without doing any sampling but do need some random sampling to
have occurred in the transformed data block. OK. I’ll update the proposals.

I’ve added Proposal D, which is Proposal C with the following change: instead of compile we use the old stan_model / StanModel. Any thoughts on this? My reasoning is that “compilation” isn’t really an essential concept here. Who knows, maybe in the future we’ll have some JIT compiler and nobody will be the wiser.

I’m open to other alternatives to stan_model/StanModel. Maybe ‘StanPosterior’?

For proposal D, I think RStan is going to wrap up the arguments as list(J=J, y=y, ...).

Still no place in proposal D to get the compiled program without the data, which is the chunk you’ll reuse if you want to run the same model over multiple data without recompiling.

I’d also rather encourage people to pass in files for their Stan programs rather than strings so that they’re easier to use in other Python or R programs and so that the error messages get appropriate line numbers.

Is there a strong functions are verbs, variables are nouns convention in R or Python? If so, “model” isn’t the best choice. It’s more like “condition” as I had it before if the data’s included. “sample” can work either way, as a noun or a verb, and I take “sample” to be roughly synonymous with “draws” in that a sample is made up of a number of draws.

posterior = pystan.posterior(program, data, seed)
sample = model.sample(seed)

It is, it’s just callling static methods on a c++ object, right?

This is true, but in my head, I was ignoring it because the compilation doesn’t actually take place the second time you give it data if your Stan program hasn’t changed on the disk.

Does PyStan have that same make-like functionality? If not, that’d be a great thing to add.

Does R have an object encapsulating the compiled but uninstantiated program?

I wasn’t suggesting that people would go through these steps—I’m guessing everyone will want the one-liner just like now.

I was just trying to outline what the basic building blocks were. I do think it makes sense to expose things rather than leave them implicit. And it helps the user understand what’s going on with the stan() function itself to understand the longer-form of what it goes through.

That is really the question, to create one or not.

I agree with Ben here, this is the question. Do we need to have an object that users interact with which holds information about the compiled but uninstantiated program? If so, what else does it do?

Re: one-liner stan(.........). I think we should deprecate this in favor of the proposed two-line sequence. We should have one recommended way to do things.

In favor of the two-line sequence. Separating model compilation (+ instantiation with data) from sampling/optimizing is useful. It makes writing documentation easier. It happens to be more in keeping with people’s expectations in Python too. (People familiar with scikit-learn will be very comfortable with splitting up the model setup from the model fitting.)

I think we all agree that what goes on under the hood is:

  1. translate Stan program to C++
  2. compile and link C++ program
  3. instantiate C++ class with data
  4. run sampling or optimization or whatever

The output of step (3) is what you need to evaluate densities, and I think we’re converging on exposing that to the user as its own object that’s not tied up witha fit object.

If you don’t expose the compiled and linked C++ program resulting from (2), then your first function call has to be a lot more complex:

A. a single function does all of (1)–(3). But before doing (1) and (2), it checks to see if the Stan program already has a compiled and linked C++ program and reuses it if so
B. run sampling or optimization

Are you saying that if you have A & B, you don’t want to have a single function that combines them?

In R, users are used to one liners that specify the model and the data. And I don’t think there’s the fussiness about not having multiple ways to do things that Python seems to have (but doesn’t follow, of course, because it’s a programming language, so what the Python people tell me is that what they really meant is that there’s usually only one suitably pythonic way to do things).