RStan3 and PyStan3 Interface

scikit-learn appears to have a model without data class and then a fit class, but not a model with data class.

Example 1

http://scikit-learn.org/stable/auto_examples/plot_cv_predict.html#sphx-glr-auto-examples-plot-cv-predict-py

lr = linear_model.LinearRegression()
boston = datasets.load_boston()
y = boston.target
predicted = cross_val_predict(lr, boston.data, y, cv=10)

So that’s broken down into loading the model, then calling a big function to do cross_val_predict is some mashup of multiple functionality.

Example 2

Then I looked for a GLM:

http://scikit-learn.org/stable/auto_examples/linear_model/plot_ridge_path.html#sphx-glr-auto-examples-linear-model-plot-ridge-path-py

alphas = np.logspace(-10, -2, n_alphas)
clf = linear_model.Ridge(fit_intercept=False)

coefs = []
for a in alphas:
    clf.set_params(alpha=a)
    clf.fit(X, y)
    coefs.append(clf.coef_)

This also has model without data (clf). But in this case, it creates a god object that holds not only the model, but then gets modified destructively to set parameter values to constants and another method to do a fit, which modifies it destructively, then fields that contain the fitted values.

I’d rather not follow the scikit-learn model if this is typical. This is very much like what R would let you do with glm(), but they’d just kick off everything in one line, then let you start doing the modifiers.

I wouldn’t be opposed to the “one function” approach provided there was
some modularity. So “sample” and “optimize” and “variational” could be
functions (but we would kill the monster “stan” function).

One thing I want to avoid, however, and which could be a concern with
the one-line approach, is this nesting of the compiled program inside
the fit object. This does strike me as completely unnecessary coupling.
The fit object should just hold the draws or the optimization result. It
does not need to hold on to the entire compiled model.

If we use the one-line approach and the fit result does not store a copy
of the compiled model, how would the user store a copy of the model if
they were interested in doing so? Would we also have a two-line API
for doing things, where one first compiles the model and then draws
samples from it?

My feeling is that in order to get natural return objects we need to have three stages, which I repeat:

  1. compile Stan program (translate to C++, compile and link C++): results in compiled model that can be reused with other data

  2. add data to program: results in density function that can be called for log density and derivatives and transformations

  3. sample: results in a sample of multiple chains consisting of multiple draws

(3) is made up of multiple calls to sample individual chains with different chain IDs that then get put back together, whereas (1) consists of translating the Stan program to C++, compiling the C++, dynamically linking the C++.

If I understand correctly, RStan currently provides (1) with stan_model(), (2+3) with sampling() and optimizing() by passing in the result of stan_model(), and (1+2+3) is combined with top-level call tostan()` (which only does sampling with HMC). I believe Allen is suggesting providing (1+2) and (3) in Python.

I agree with Allen that it’s bad to combine output. That’s why my own preference is to separate (1), (2), and (3). If you want to start combining them, then things like the compiled and linked code will be an implicit global.

At this point we’re going around in circles, so I think you guys need to make a decision. I don’t think either Python or R hew very strongly to any kind of programming idiom or notion of modularity across their popular packages. They both tend to jam smaller operations together into large composed calls, whereas I come from a more Unix, C/C++, Java religious background where modularity is king (because it’s easier to document and test and gives the user more flexibility to compose their own operations). If we look at particular Python packages,

  • PyMC3 cooks the data and model together
  • scikit-learn compiles the model with some data (in the form of fixed model parameters), but then leaves the data contributing to the likelihood for another call;
  • Edward has a concept of model with lazy data, so it can presumably customize just what data’s baked into the model and which is provided at runtime

I just don’t see consistent usage we can follow.

1 Like

Internally, we should/must separate 1, 2, and 3 (@Bob_Carpenter’s numbering) . I don’t think this is a problem.

It sounds like we have a rough consensus on Proposal C for the main user-interface.
Does this sound about right @bgoodri ?

I’m happy with this proposal (Proposal C) as the RStan/PyStan 3 primary interface. Where do all the arguments go in this setting. Do they all get attached to compile? So, like this:

posterior = pystan.compile(program_code_or_filename, data,
                           random_seed=1, num_samples=500, num_thin=2)
output = posterior.sample()

Or do we want .sample() etc to take all parameters other than data and random_seed?

I should think so.

Is it true that if the model were estimated twice with different data, then PyStan will skip the compilation the second time (as in RStan). If so, then I don’t know if compile is the right way to do the verbing.

Sounds good.

Is it an error if they provide random_seed in the methods like sample?

If we have a mechanism to change the seed after instantiation, I guess not. But I am not seeing a use case where that would be necessary.

num_samples and num_thin are control parameters for sample() —they shouldn’t be arguments to compile().

The only reason there’s a random seed in compile is that what you’re calling “compile” is doing both the compilation (which does not need a random seed or data) and the conditioning (which requires data and the random seed to initialize the data block of the class that was compiled).

Is there a reason you’re cutting out one intermediate step and not both? Then you could just use this:

output = pystan.sample(code, data, seed, num_samples, num_thin)

The real question is what happens to the 20 or so versions of sampling we have (unit, diagonal or dense metric, adaptation or not, NUTS or static HMC, and eventually Euclidean and Riemannian). Are you just going to configure that all through sample()?

I think we agreed that $sample() was just an alias for the default sampling algorithm, but there is a method for each of the 20 variants with the same name as the name of the C++ function.

That seems fine to me.

Could someone make some summary of the discussion? Wikipage with a proposal?

Aki

I don’t inkow how to link posts, but if you go about umpteen posts up, there’s some summary from Allen. Not much to summarize.

There is a wiki page here


but it is not completely up-to-date with what has been discussed / agreed to / disagreed on.

Click on the time stamp on the top right of the desired post.

What do people think about stan_setup for the name of the first function that the user would call, which would both compile and instantiate with data? It would look like

stan_setup <- function(file = file.choose(),
                       model_name = NA_character_,
                       model_code,
                       includes = character(),
                       seed = sample.int(.Machine$integer.max, 1),
                       ...) # to pass data objects

Will there still be an option to compile model without data and then run the model parallel (example different machines) for different datasets? E.g. Bayesian filter that is run for multiple measurements?

It that case, I think the user would end up calling stan_setup multiple times with different data, but it should only actually compile once and pick up the compiled version thereafter.