Replacement for permuted=TRUE (RStan 3 / PyStan 3)

ariddell · July 13, 2017, 9:47pm

What should be the method of accessing samples in RStan 3 / PyStan 3? I’m assuming we’re all happy with getting rid of permuted (see https://github.com/stan-dev/rstan/issues/345)

So we have a fit for eight schools and we are interested in getting samples of real eta[8];

We have our fit object, upon which we use extract.

extracted <- extract(fit)
draws <- extracted$eta

What shape is draws? Is it (num_chains, num_samples, 8)? Sounds right to me but this will be an adjustment for users.

bgoodri · July 13, 2017, 10:11pm

Yes, we should definitely eliminate the concept of permuted.

The function name extract is bad in R because it clashes with an extract function in one of the hadleyverse packages. Actually, stand-alone functions in general are bad these days because there are 10000 packages to clash with and increasing by a few hundred every month.

So, I am pretty strongly in favor of

eta_draws <- fit$eta

rather than something like

eta_draws <- fit$extract("eta")

If you wanted all of the parameters, then you would be doing something with a specialization of a common generic function like as.matrix, as.data.frame, as.list, etc.

Also, eta_draws would be a instance of a S4 type that mostly mirrors the types in the Stan language, so you can do things like transform it to the unconstrained space.

As to what shape eta_draws is, I think it should be the same shape as in the Stan declaration, plus one additional trailing dimension that is iterations x chains. For things like convergence diagnostics, we need the ability to apply functions to the chains separately but for most things, we want to lump the chains together. Both uses are easy to accomplish in R because R is just storing it as a long numeric vector anyway and the dimensions are just metadata (attributes). So, you can get either shape without copying the values.

betanalpha · July 14, 2017, 12:45am

And for all that is good in this world stop stripping away the sampler diagnostics. At this point I would also prefer having iteration and chain number added as extra columns so that we can completely avoid any ambiguity over the order of the samples.

ariddell · July 14, 2017, 12:49am

This all sounds good. Default shape is the question: num_chains x num_samples x param_dim or num_samples x param_dim x num_chains ? (just thinking of scalar and vector params for now) Is there a right answer here?

betanalpha · July 14, 2017, 12:51am

It doesn’t matter provided that there are functions to collapse the default output into parameter vs iteration.

ariddell · July 14, 2017, 12:54am

Any suggestions for a function name to do this? Do we want to do this by default, assuming the vast majority of users will not care about draws by chain?

bgoodri · July 14, 2017, 12:58am

I think param_dims x (iterations_chains). This is different from the way it has been done, but it gives the opportunity to the calling function (mostly in packages) to set rownames and / or colnames easily so that those identifiers show up in the output rather than beta[7,3].

betanalpha · July 14, 2017, 1:05am

Depends on the desired workflow. Would the R-hat or E-BFMI calculation be done on the fit object or the extracted samples object? If the former then chain id information can be hidden in the fit object and extracted with special functions if desired by the user.

bgoodri · July 14, 2017, 1:10am

Both, I think. But typically the user would invoke fit$Rhat() or whatever and that would call Rhat on each variable that was saved, which would calculated it on each margin. For R-hat, you usually want it on everything, but for something like mean() you might want it for theta but not rho.

bgoodri · July 14, 2017, 1:13am

But we need to define a Rhat, n_eff, etc. S4 methods for the StanTypes in case someone does something like

exp_theta <- exp(fit$theta)
n_eff(exp_theta)

ahartikainen · July 14, 2017, 5:29am

Are we going to store raw draws as an array or do we put them in to some subclass (python here). If we store them in a “draws” class we could in principle define function that need the chain info. This is something similar that PyMC3 does (MultiTrace object).

Some examples.

fit.theta.draws # raw array
fit.theta.rhat # value
fit.theta.chain_0 # array
fit.theta.mean() # value
fit.theta.mean(chains=True) # value array
fit.theta.ci(alpha=0.05)

ariddell · July 14, 2017, 12:37pm

I think we all agree that these fit objects should support all sorts of functionality and reshaping operations. Sounds like this will not be a problem in R or Python.

I do think people should get the same default result in RStan and PyStan, so the question is what do we want

# R
fit$eta

and

# Python
fit['eta']  # and/or fit.theta

to return?

Seems like there are a few options:

Option A: num_chains x num_samples x param_dims
Option B-I: rearrangements of the dimensions above (but A seems natural to me)
…
Option J: num_samples x param_dims # chains collapsed
Option K: param_dims x num_samples # cahins collapsed
Option L: no default. You must use something like the PyMC multitrace attributes to specify more precisely what shape you want. So fit['theta'].draws for raw access and fit['theta'].collapsed for

Option J is identical to current behavior in Python (except the draws are permuted).

betanalpha · July 14, 2017, 1:37pm

A significant utility of this kind of output is transportability to other plotting/analysis packages so we want to make the default output as simple as possible. If the user specifies by parameter name then it should be a single concatenated list/column of samples. But the user should also be able to specify multiple parameters, so I don’t like the fit$param_name pattern. Rather we should have something like fit$samples('name1', 'name2'). Then fit$samples with no named argument would return all of the samples including the sampler parameters and additional columns for chain number and iteration number (these will be especially useful if we ever end up with ragged chains). From this output R-hat and other multi chain diagnostics can all be calculated independently of Stan (for example, with bayesplot).

If people wanted more elaborate structures then there could be fit$samples_by_chain methods and so on.

In particular I think we should make the fitting as separate from the diagnostics/analysis as possible, moving away from the monolithic pattern of PyMC. Not only is this more modular it allows us to put more stuff in bayesplot hence get other MCMC tools to use the proper diagnostics.

ariddell · July 14, 2017, 1:46pm

See: https://github.com/stan-dev/stan/blob/develop/src/stan/mcmc/chains.hpp

ariddell · July 14, 2017, 1:57pm

Ok, all this sounds good. Re: multiple params, so with Python we can do

fit['mu', 'eta']

without difficulty. What would you like this to return?

betanalpha · July 14, 2017, 2:07pm

A dictionary with keys “mu”, “eta”, “chain”, “iter”. Or just a dictionary with keys “mu” and “eta” if people don’t want the chain and iteration info there by default. Actually I would be fine with “chain” and “iter” just being selectable parameter names in the call itself. So one could due fit['mu', 'eta', 'chain', 'tier']. And the sampler parameters would be available in the same way: fit['eta', 'divergent'].

bgoodri · July 14, 2017, 2:59pm

This orientation (and constructs like fit$eta) is usually optimal for people (like Andrew) who want to do probabilistic programming stuff with the output. Other orientations are more suited for convergence diagnostics and so forth, but those tasks will almost always be handled by our own functions, in which case the objects can be reshaped internally.

bgoodri · July 14, 2017, 3:02pm

For RStan, there are going to be S4 classes that mirror the possible types in the Stan language. These S4 classes contain an array slot to hold the raw draws. There hasn’t been much progress lately, but what progress there is is under rstan3/ on GitHub, for example
https://github.com/stan-dev/rstan/blob/develop/rstan3/R/AllClass.R#L255

syclik · July 15, 2017, 2:03am

Umm, don’t strip them, but perhaps don’t keep them in the same structure with the same methods. The sampler diagnostics aren’t draws, draws aren’t sampler diagnostics. We typically want quantiles, mean, sample variance, R-hat, estimated effective samples, and e-bfmi over the draws. We don’t want all of that over sample parameters.

This discussion overlaps with the discussion on breaking apart the writers.

If this were done object-oriented, then the return type of something like fit$eta would be something like a draws class that would have methods like r-hat, quantile, etc, whereas something like fit$tree_depth__ wouldn’t. I’m not suggesting these things be given separate classes in R or Python… I’m just trying to illustrate that they are different things.

+1.

syclik · July 15, 2017, 2:07am

@bgoodri and @ariddell, thoughts on how you want to deal with subsets of parameters? That’s currently done in RStan and PyStan, so I don’t want that to be an afterthought as this is being designed.

I think it makes sense to not keep all elements of a symmetric matrix. And there are mechanisms for letting users specify which parameters to keep / omit. Does the fit object keep both the full param dims and the subset around? Is there going to be a way to tell when a run is from a particular program? Two runs may have different param dims depending on runtime arguments.

Topic		Replies	Views
[Interface roadmap] fit objects and `extract` Developers	44	2683	September 17, 2019
Naming help on function in CmdStanR and CmdStanPy to get variable from sample Interfaces posterior-package	25	2241	June 9, 2020
Shredder R package for rstan General	22	2097	January 24, 2020
Various observations on rstan, cmdstanr, pystan and cmdstanpy after teaching with all in parallel Interfaces pystan , rstan , cmdstanr , cmdstanpy	26	4031	October 12, 2021
Interface roadmap - last draft before ratification vote Developers	120	5038	October 7, 2019

Replacement for permuted=TRUE (RStan 3 / PyStan 3)

Related topics