RStan3 and PyStan3 Interface


#1

We had an unusually productive discussion today on Stan 3, mostly from the perspective of RStan. You can install a somewhat working version of the provisional rstan3 package from GitHub with

devtools::install_github("stan-dev/rstan", ref = "develop", subdir = "rstan3")

which sits on top of rstan 2.15.1 from CRAN. Much later I will make a branch that can be merged into rstan and then rstan3 will be deleted.

I also updated the wiki but if you want to follow what rstan3 is doing, it is probably best to look at https://github.com/stan-dev/rstan/blob/develop/rstan3/R/rstan.R .

You can currently do

library(rstan3)
J <- 8
y <- c(28,  8, -3,  7, -1,  1, 18, 12)
sigma <- c(15, 10, 16, 11,  9, 11, 10, 18)
config <- stan_config("8schools.stan") # wait a long time for it to compile
config$data(J, y, sigma)
built <- config$build(seed = 12345) # or maybe <- stan_build(config)
post <- built$sample()

This reflects some but not all of the opinions that were expressed this morning. I don’t love stan_config because I don’t think users mentally associate the data the are passing to Stan as “configuration” and things that they do mentally associate with “configuration” (like max_treedepth) would not be specified until later. Anyway, config is pretty much a var_context, and there was a minor open question as to whether config should have a field for the PRNG seed that could be set like config$seed <- 12345 or if seed should be argument to the build method, which is instantiates the thing on the C++ side.

There were also differing opinions as to whether $sample should be a method of built or a method of some other class that only exposes methods for algorithms. In that case, it could go like

algo <- built$algo()
post <- algo$sample()

and built would only expose low-level methods like log_prob. I don’t think enough R users would call the low-level methods to justify making everyone the rest of them take an extra step.

One of the bigger open questions is whether to make the names of the estimating functions in the interface correspond exactly to the C++ functions or to make the names shorter and the argument list longer. For example, $hmc_nuts_diag_e_adapt() vs. $hmc() with defaults diag = TRUE, adapt = TRUE, metric = "Euclidean". With long names, it is possible to enumerate exactly the arguments that are accepted by the estimation function. With short names, some R arguments could render other R arguments moot, as in $hmc(adapt = FALSE, delta = 0.9). Either way, $sample() is basically an alias for default MCMC, which is what 99% of R users would be calling. Also, there is the question of what to do about degrees of adaptation. In some cases, the user just wants to pass an initial guess of the mass matrix but adapt it, other times fix the mass matrix but adapt the stepsize, and other times fix both the mass matrix and the stepsize.

There was also a lot of discussion about the output, but that can mostly be figured out later. In the interim, it would be good if there were

  • a virtual class that is inherited from rather than prob_grad and requires things like log_prob to be specified by the inherited class
  • a static method that could be called on the compiled but uninstantiated C++ object that would return the types of the things declared in the data block of a Stan program
  • a method that could be called on the instantiated C++ object that would return the types, sizes, and block of things declared in the parameters, transformed parameters, and generated quantities block
  • more progress on the var_context refactor

Ben


#2

Thanks for writing this up.

Would you mind producing issues for the things you want in stan-dev/stan? In particular, clarifying what you want in the way of virtual methods in the base class.

I think we’ll be able to knock out these model changes pretty quickly. I’ve been wanting to add things like this for a long time and all the info’s there to generate it from the AST.

There are four distinct steps that we need to get from a program to a density function:

  1. Stan program (string—a program in Stan language): supplied by user

  2. Transpiled Stan program (string–a class definition in C++): produced by running stanc on the program

  3. Object code: produced by the C++ compiler (results in Rcpp object)

  4. C++ object instantiated with data:

I don’t think a Stan user is ever going to want to inspect the C++ in the usual course of using Stan (though maybe for optimization).

I think from an R client perspective we want something like:

joint <- stan_compile("parameters{simplex[3]a;}")

posterior <- stan_condition(model, data = ..., seed = ...)

posterior_sample <- stan_sample(posterior, ...config...)

We should be able to get the program variable shapes and block definitions from the joint, and we should be able to use the posterior as a density, as in

ld <- posterior$log_density(parameters = ...)

For names, we could go with something based on the name of what’s being produced, e.g., stan_joint and stan_posterior and stan_fit as the names of the functions if we wanted to use the nouns for what they produce rather than the verb for the process of creating them (though sample has a joint appointment in the noun and verb department).

Ben defined things a bit differently with data being attached bit by bit, and then a builder-like pattern used to instantiate, all with methods:

joint <- stan_compile("...")
joint$data(...)
...
joint$data(...)
posterior <- joint$build(seed = ...)
sample <- posterior$sample(...)

I’m not keen on the incremental data attachment as I don’t know what to name the varaible that starts as the joint and transforms into the posterior. I could see how the OO approach in calling sample as a method on the posterior density would make sense in terms of discoverability—though it disagrees with Michael’s preference it does help with his concern about discoverability of alternatives (though a list of 20 might be a bit daunting if they’re not organized some way).


#3

This is mostly consistent with the interfaces API spec, User Interface Guidelines for Developers (Wiki), we worked out awhile back. That is the API I’m planning on using for PyStan 3.

For Python using an object-oriented API (e.g., something like “posterior.log_density(params)”) is what users will expect.


#4

We should use the C++ function names exactly and refer to/copy their documentation (perhaps automatically?). Having to keep 3 different sets of docs is unnecessary work. The verbosity of the function names should not bother us because >90% of users will be using generic sample and optimize functions.


#5

I am fine with that, although I think optimize should be called maximize. Other people were concerned about the long cryptic names.


#6

I don’t really know what to call the object either (in part because I have gone back and forth on what to name its constructor), but I think that is a pretty minor thing. In RStan 2.x, it is not uncommon (for packages) to do something like

sm <- stan_model("8schools.stan")
datlist <- list(J = 8, y = y)
datlist$sigma <- sigma
post <- sampling(sm, data = datlist)

where you make an empty or partially filled list whose names correspond to the objects in the data block and fill it in stages, particularly when there are a lot of objects to pass.

Under my proposal for RStan 3.x, it is pretty similar but slightly more structured. You could do

thing <- stan_compile("8schools.stan")
thing$data(J = 8, y)
thing$sigma <- as(sigma, "vec")

Since thing has named fields for J, y, and sigma, R at least knows what types those fields have to be. But this is all before any C++ gets called, so it is just up to the caller to ensure that the fields are set correctly before calling $build() to instantiate; otherwise you will get an error from C++. And you can change the fields of thing and call $build() again.


#7

OO isn’t what R users will expect (at least for a few more years), but I think we should do it for RStan 3.x anyway. And if we do it, it is better to go almost 100% rather than 50% OO and 50% functional. The main reasons to favor ReferenceClasses are:

  1. Less dissimilarity between RStan and PyStan, which is good when giving talks to audiences that are mostly R users but have a non-trivial proportion of Python users (or vice versa)
  2. Easier to figure out what (is possible) to do. In R, if you happen to know that you need to call the sapply function, then you can execute args(sapply) or help(sapply) to ascertain that you need to pass a vector and a function to be applied to each element of that vector. But what is far more common is that you have a vector and know that you want to apply some function to each of its elements, but you have almost no way of figuring out that you need to call a R function that happens to be named sapply. With Python or Pythonic R syntax, you just hit Tab after typing the object name and it shows you the universe of possibilities.
  3. Avoiding name clashes with exported functions from other packages. To avoid this problem now, you have to use really esoteric function names or prefix stan_ to a more generic name like sample in order to avoid clashing with the sample function that comes with R. If we have to prefix almost everything with stan_ there is more typing and less clarity as to which of many functions starting with stan_ needs to be used at the moment.
  4. Locking. When appropriate, you can enforce that fields can only be assigned once.

#8

I agree with what Ben says here about the UX in R. I would also advocate having the object call free functions in its methods for the operations that are more broadly applicable. For example you ought to be able to at least do rstan:::r_hat(x) (yes, three :), where x i.s an array (chain x parameter x sample) or something like that. It’s really irritating to have to build a whole object just to call one method.


#9

The real question is whether it’s worth inverting control for discoverability. That is, should it be

joint <- stan_compile(program = ..., ...);
posterior <- joint.condition(data = ..., seed = ...)
posterior_sample <- posterior.sample(...)

rather than

joint <- stan_compile(program = ...)
posterior <- stan_condition(joint, data = ..., seed = ...)
posterior_sample <- stan_sample(posterior, ...)

Ben makes good points that it’s shorter. And discoverability seems important to a lot of people (which will argue for not having 20 methods because those will scroll too long to be really useful). In fact, it’s the canonical use of OO in that it’s converting a standalone function (or constructor) into a (factory) method on its first argument. The reason I think Michael and I are balking a bit at this is that it seems strange to have the sampling options be methods on the posterior—it breaks a kind of encapsulation on the posterior where it can be viewed simply as a density. From an OO perspective, it seems more natural to have the sampler hang on to a reference to the density rather than having the density construct the sampler. I know I sound like a broken record on modularity/god objects and on immutability, but that’s because avoiding the former and embracing the latter is the cornerstone of sound and modular OO code.

P.S. There’s no “pure OO” because you need to construct objects before you can use them; so there’s always the tension between what should be a constructor for an object versus a method on one of the arguments to the constructor. That’s essentially the discussion above—should constructing a sampler be done through a sampler’s constructor or through a factory method on the density?


#10

Somewhat orthogonal to the rest of this thread, RStan 3.x is planning to have S4 types for all of the types of objects in the Stan language. Methods like $sample() would return an object that has fields for each parameter, transformed parameter, generated quantity, and diagnostic that are the corresponding types but having trailing dimensions of iterations x chains. I guess r_hat could be a S4 generic that could have a method for these types and for a plain array.


#11

That sounds graet. It’d bring us much closer to something like the PyMC3 interface, which lets you do lots of cool stuff like this with specific variables.


#12

I think it is fair to say that the Stan program (e.g. 8schools.stan) can be viewed as defining a density. But I don’t see why the instantiated version of

class model_thing : public prob_grad {
...
}

would be thought of as a density or even from a statistics perspective. Its log_prob method evaluates a density, but stuff like write_array is just I/O.


#13

OO in R is a bit unorthodox, but calling a method does not change the object into an object from another class. What you are calling joint (which also seems a bit weird to me) is like a C++ class where the public members are not const. They actually could be made const and / or input to the constructor, but that doesn’t gain us anything on the R side.

It is unusual to have a method of one class to be the constructor of another class, but what you are calling posterior isn’t really connected to joint, in the sense that it does not inherit from joint's class or embed joint as one of its fields. You can rm(joint) after creating posterior and R doesn’t even crash.

There’s something to be said for algorithmic methods like $sample, $maximize, etc. being separated from algorithm-agnostic methods like $log_prob but 99% of users would want to go directly to that rather than first instantiating the C++ object and then not doing anything with it except doing whatever needs to be done to make the algorithms available. Maybe

compiled <- stan_compile()
compiled$data()
instantiated <- compiled$instantiate() # usually skipped by users
posterior <- compiled$build() # calls .self$instantiate() internally but returns an object that has algorithms as methods
draws <- posterior$sample()

#14

For Python users, the first is far more natural. This is what I’m thinking it will be:

MyStanProgram = pystan.compile("parameters{simplex[3] alpha;}")  # returns a class
program = MyStanProgram(data)  # instance of MyStanProgram, has methods of C++ stan_model
program.log_density(params)
fit = program.sample()
fit.alpha  # gets params

Hope this useful for the discussion. I think it’s important that we do make it easy to move between RStan and PyStan for people who need/want to.


#15

I think I’m missing something fundamental here. Why do you have

program = MyStanProgram(data)

rather than

program = MyStanProgram.condition(data)

if methods are always preferable?

Does program.log_density(params) just return a value? If so, why not treat that as a functor too, as in program(params)?

I’m not trying to be a troll here, I just don’t get what the motivations are for doing things one way or another. When I look at model specifications in Edward or PyMC3,they don’t look at all object oriented.


#16

I wasn’t worried about that. And I know there’s no inheritance among any of the objects I laid out. I’m more thinking OO in terms of method dispatch (vs. functions) than in terms of inheritance.

Nice to know that the posterior doesn’t need to maintain a reference to the compiled model.

I completely agree that having some object with algorithms hanging off it will be clunky—the only advantage is that it enables that tab-based discovery—you only have to remember algorithms or whatever. I don’t know how serious a concern this is, but it keeps coming up.

The reason I used the variable name joint is that its value defines the joint density in most of our Stan programs. It really only needs to define a density that’s proportioanl to the posterior when instantiated with data. So you can think of a joint density as a function that maps a data set into a posterior density or you can think there’s some external function that maps a joint density and data set into a posterior—that’s what I think we’re talking about. The reason I called the result of combining the joint with data y the posterior is that we can use that object to calculate the posterior up to a constant. We wouldn’t need to use those names in examples, it’s just how I was trying to illustrate what each piece was functionally.


#17

One reason is that it aligns with what is happening underneath the hood in C++ land ( e.g., there’s no .condition method in stan_model.hpp).

The Python APIs with which most Python users will be familiar are numpy, pandas and scikit-learn. scikit-learn’s API is fairly close to what I sketched.

I’m happy with whatever we decide on. Small differences will be inevitable due to differences in Python/R idiom.


#18

Still confused. There’s no sample() method anywhere in the C++. There’s

  • a constructor for instantiating the code-generated model class with data, and
  • a bunch of standalone functions for sampling and optimization in the services namespace.

I don’t know scikit-learn and I’m not quite sure what pieces of what you propose you think will be familiar. I would’ve thought that something like PyMC3 might be familiar and what they’re doing doesn’t look anything like what we’re doing (I think that’s why they’re saying that whatever we do won’t be “Pythonic” because it’s a standalone domain-specific language rather than an embedded one).

How would scikit-learn set up a custom regression or something similar to what Stan does?


#19
  • sample() is an alias for the default sampler. Now it would call hmc_nuts_diag_e_adapt. I think we all agree on this strategy? maximize() will work the same way, calling the relevant default.

  • I can think of a few reasons to avoid the .condition method besides its being unpythonic. The strongest reason is that it’s needlessly verbose. If the user can’t do anything useful with MyStanProgram until they call MyStanProgram.condition(data) why not just eliminate the condition bit and have them use MyStanProgram(data). This will save 9 characters in every PyStan script.

  • To respond to your other question, here’s what fitting a Gaussian mixture looks like in scikit-learn:

    model = sklearn.mixture.GaussianMixture(n_components=2)
    model.fit(X)
    # params available after fitting:
    model.means_, model.covariances_, model.weights_
    

    I suppose fit in this case works like maximize() would for us, without the data.

We do seem to agree that the program object/instance needs to get the data somehow before calling sample or maximize. Seems like we don’t agree on the exact function/method name.

Also, I thought we were going to move towards calling things “programs” rather than “models”. Shouldn’t this move push against using method/function names like “condition” (and vars like “joint” and “posterior”)?


#20

The key thing is that where you have

program = MyStanProgram(data)
fit = program.sample()

scikit-learn seems to do this in one go as

model.fit(X)

I like what you have better, but not your name. The program is MyStanProgram, not MyStanProgram(data). The latter is the combination of the program with data. I don’t want to call it model, either, which is why I called it posterior. It is an object that computes the log posterior and gradients for the given data set.

Is there a style guide somewhere as to what counts as “pythonic”? I can’t make the generalizations from the examples.

It would help if you told me what’s non-pythonic with replacing your functor usage MyStanProgram(data) with a method usage MyStanProgram.condition(data) other than the extra lines. Do you think that sckit learn should’ve used model(X) to be more pythonic and save four characters?

P.S. Programs define models. I think of the model as being the joint density, as Andrew lays out on page one of BDA. Then when we add data to the model, we get an instantiated posterior density. Sorry if I’m not always consistent, but this is what I’m trying to do.