Naming help on function in CmdStanR and CmdStanPy to get variable from sample

Now that folks are using CmdStanR and CmdStanPy to run real models, it’s clear that we need an accessor function to retrieve the named variables in the sampler output structured according to the Stan program variable declaration.

There are issues filed for CmdStanR and CmdStanPy - https://github.com/stan-dev/cmdstanr/issues/183, https://github.com/stan-dev/cmdstanpy/issues/245

RStan provides the extract function, which has some confusing options and a bad name - (cf https://github.com/stan-dev/rstan/issues/118 - basically, extract has meaning in SQL and tidyverse and we should avoid it).

Possible names: get_var, get_variable, get_param - the latter is technically a misnomer - the sampler outputs draws for all variables declared in the parameters, transformed parameters, and generated quantities blocks, so variable is more correct than param.

4 Likes

I used to use extract, but then following Ben and Jonah’s advice, I switched to as.matrix. We use as.matrix in Regression and Other Stories.

I do not pretend that’s a good choice, I m writing this for information purpose.

In MathematicaStan I have used GetStanResult.

I also introduced GetStanResultMeta to get Stan computed quantities like “lp__”, “step_size__” as they depend more on the chosen solver than on the model.

Examples:

GetStanResultMeta[stanResult, "lp__"]  
GetStanResult[stanResult, "beta.2"]        <- the beta[2] scalar
GetStanResult[stanResult, "beta"]          <- the beta vector
1 Like

I just added a Concepts section to the CmdStan3 design doc - https://github.com/stan-dev/design-docs/blob/cmdstan3/designs/0005-cmdstan3.md#concepts

From the interface perspective, the Stan program data comes in the following flavors:

  • input data - the set of all variables declared in the data block
  • output data - the set of all variables declared in the parameters , transformed parameters , generated quantities blocks
  • output meta-data - information on the solver (e.g. output columns of sampler state lp__ , accept_stat__ )
  • runtime data - variables declared in the transformed data block, and in local blocks. Only available as print statement to the console.

I still think that get_variable is the most general way to talk about the output data, but I like Vincent’s suggestion of differentiating between meta-data and model variables.

1 Like

get_samples? Preferably with an option to stack the chains so instead I’d getting a 3 dimensional array you can get a 2D one

get_samples ? Preferably with an option to stack the chains so instead I’d getting a 3 dimensional array you can get a 2D one

agreed - this functionality is available in CmdStanPy - names, however need to be changed.

the CmdStanMCMC object has property sample:
https://cmdstanpy.readthedocs.io/en/latest/api.html#cmdstanpy.CmdStanMCMC.sample

A 3-D numpy ndarray which contains all draws across all chains arranged as (draws, chains, columns) stored column major so that the values for each parameter are stored contiguously in memory, likewise all draws from a chain are contiguous.

and also method get_drawset:
https://cmdstanpy.readthedocs.io/en/latest/api.html#cmdstanpy.CmdStanMCMC.get_drawset

Returns the assembled sample as a pandas DataFrame consisting of one column per parameter and one row per draw.

Parameters: params – list of model parameter names.

But these names don’t match up with CmdStanR, and get_drawset doesn’t do enough for container variables.

Also, this should work for all estimation outputs, not just the sampler, although there it just a single estimate per variable, still, we need to provide the output variable in the right structure, not as a flattened list.

Have you looked at posterior package https://github.com/jgabry/posterior which is already used by CmdStanR? See the readme for introduction and function names. The simplest accessor function is as_draws. posterior package is aware of meta variables and provides easy way to transform the posterior draws to different formats useful in R.

Pinging posterior developers @jonah and @paul.buerkner

1 Like

We have the fit$draws() method in CmdStanR currently, which returns a draws_array from the posterior package that can then be converted to other formats via posterior. However, the posterior package doesn’t have a format comparable to the list returned by rstan::extract() in which each variable has the structure that it does in the Stan program.

For example, if a Stan program has

parameters {
  real mu;
  real<lower=0> tau;
  vector[8] theta;
}

then the list returned by rstan::extract() has named elements mu (length 4000 vector), tau (length 4000 vector) and theta (4000 x 8 matrix). The posterior package doesn’t currently have any format like this. Even the draws_list format has separate entries for each element of theta (i.e. it treats theta[j] like it treats mu and tau and not as an element of a larger structure theta). There’s no way to treat theta as having the structure it does in the Stan program, at least not currently in the posterior package.

@paul.buerkner I know we talked at some point about this but I forget where we left things. On the one hand, we didn’t want the posterior package to have to know anything about Stan. On the other hand, vector/matrix/array parameters are not unique to Stan so the posterior package could in theory have a format for working with them (maybe this is related to the “random variable”-style format we previously discussed, although not sure if that encompassed multivariate RVs).

1 Like

I think the output should minimize the manual handling for post-sampling modelling.

if I have 100 parameters, how should I process my output from mcmc. Should I have a function and a way to unload values as function parameters, or should I create manually 100 variables?

Also, should we consider how “native” broadcasting rules work.

For example

dot(theta,  beta)

How should we think the out when

One is mcmc, other one is static
Both are mcmc

And dev work should have other interface (to select chains etc)

1 Like

to restate what Jonah said, because this is a Stan-specific, downstream analysis packages don’t have the functionality we need here - which is to take the flattened elements of a structured variable back to the corresponding structure. we want the functionality of extract, but a better name.

I think this is the right way to go - that said - get is implicit, result and meta too vague -
how about stan_var, stan_vars and mcmc_data?

Given a program with parameters block:

parameters {
  real alpha;
  real phi[3, 2];
  vector[2] nu;
}

and a resulting CmdStanMCMC object my_fit, then my_draws = my_fit$draws() is a draws_array with column names: 'alpha', 'phi.1.1', 'phi.1.2', 'phi.1.3', 'phi.1.4', 'phi.1.5',' phi.2.1', 'phi.2.2', 'phi.2.3', 'phi.2.4', 'phi.2.5', 'nu.1', 'nu.2'

  • my_fit$stan_var("phi") returns a multi-dim array dimensions ( rows X chains X 3 X 2 )

  • my_fit$stan_vars() returns a map of names to multi-dim arrays - for alpha this is ( rows X chains X 1),
    for phi, as above, and for nu (rows X chains X 2)

  • my_fit$mcmc_data returns a map from names to single column arrays (rows X chains X 1)

1 Like

suggestion from a very wise user: stan_param

this function returns the set of draws corresponding to the named parameter or generated quantities variable, with the corresponding structure.

Why the Stan prefix? That’s surely implied. Also generated quantities aren’t parameters.

1 Like

I agree with both points - I think that “var” (one var), “vars” (dict of vars), and “mcmc_data” (dict of mcmc info - or “mcmc_vars”) are relatively concise and correct.

here’s the reasoning behind the above proposal:

  • “params” because that’s the term used in RStan since forever - this is to help users coming to CmdStanR and CmdStanPy from RStan and PyStan.

  • “stan” because the variable has the same structure as in the Stan program.

maybe consensus on the right name is impossible, in which case, looking for consensus on names that are wrong.

For what it’s worth, I’m more interested in having a dataframe mechanism for handling things that are conceptually grouped, rather than deferring to the lists of whatever.

So I think what tidybayes (@mjskay) does is cool. Look at the spread_draws example on that page.

1 Like

When I find the time to finish the rvar stuff for posterior this summer (if such an interface is still desired) it such support something like this. In fact, the prototype already supports arbitrary-dimensional random variable arrays, I just need to revisit the most recent updates to vctrs to make it work inside dataframes. Then I was going to loop back around to you (@jonah) and @paul.buerkner to see what you’d be interested in for interfaces that use the rvar type (my plan for tidybayes was for dataframes of rvars, which will be very nice I think :) ).

6 Likes

I think the rvar feature in posterior is exactly what we need to represent the structure of Stan (or other parameters) with their corresponding dimension. With regard to the discussion of vars, variables, parms or parametes, in posterior we made the decision to go with variables, in short, because this does include quantities generated on the original model parameters (for more see: https://github.com/jgabry/posterior/issues/1)

3 Likes

glad there’s precedent for “variable” - here are some possibilities:

  • rvar - as above, r for random
  • svar - s for stan
  • mvar - m for model

or just plain ol’ var

as for sampler diagnostic output columns ‘lp__’, OK to go with mcmc_vars or meta_vars ?

To clarify, the term rvar is more an adjoc name that we have used to refer to a specific features temporarility (https://github.com/jgabry/posterior/issues/8). I am not sure with what name we will end up with in this regard.

In terms of the posterior naming conventions, we try to spell most of the things out. Hence, the full name “variable(s)”. This doesn’t mean, we need to use the same spelled out form for CmdStanX of course.

2 Likes

@mjskay sounds good, thanks!

1 Like

Actually @mjskay does tidybayes work with any of the cmdstanr/posterior outputs yet? I tried a little but was getting errors. I don’t see it listed on the models page, so I figured I’d ask before I tried too much.