[Interface roadmap] fit objects and `extract`

Hey all,

I wanted to break out this topic from the larger interface roadmap draft. We need to decide what the posteriors returned from the fit objects will look like. I asked Ben about replacing “extract” and he wrote the following:

For RStan3, I don’t think there should be a standalone extract function (or many standalone functions for that matter) because functions with names like extract conflict with functions that have the same name in other packages.

There could be an $extract() method for the Reference Class that holds the output, but I think it would be kind of unnecessary. I think users would be able to do foo$theta to get the draws of thetarather than foo$extract(pars = “theta”) or something like that. The user wanted all or most of the parameters, then as.matrix, as.data.frame, etc. would presumably be more useful.

So, could go either way on an $extract method, but the main points (which I think were agreed upon in 2014) were that

  • The permute argument should die
  • The order should be StanDimensions x Chains x MainIterations so that it is easy to add more iterations to the back and so that you don’t have to do anything if you want to analyze the thing with the distinction between chains and main iterations erased.

Do you all still agree on this direction? Here I’ll attempt to write it up declaratively as it would appear in the end result roadmap document, noting places where I’m filling in specifics that weren’t specified anywhere and could use comment with italics and question marks. I was unsure about most of these details, so please correct me as needed.

Returning results

The language interfaces will return an object representing the results of fitting a Stan model (henceforth called a “StanFit” object, though the name isn’t important) that allows attribute access to each of the named parameters, transformed parameters, and generated quantities returned by running a Stan inference algorithm on data. The iterations will be stored flat (?) such that the first dimension is the MainIteration number, the 2nd dimension is the chain number, and the rest of the dimensions are the StanDimensions (? is this right or backwards? see example:). For example, if we have a two dimensional array parameter real theta[2]; and the following fit result containing two iterations on two chains:

iteration chain theta.1 theta.2
1 1 11.1 11.2
1 2 12.1 12.2
2 1 21.1 21.2
2 2 22.1 22.2

The results would be a single dimensional (flat) array as the following: fit.theta = [11.1, 11.2, 12.1, 12.2, 21.1, 21.2, 22.1, 22.2].

Each interface will also provide methods for translating into a 2d matrix or dataframe style object with the appropriate columns for chain id, iteration id, and each of the Stan dimensions of each of the parameters, essentially just like the table posted above. There will be no options to change the ordering (i.e. no permuted or permute flag), and there will be a separate method to retrieve the warmup samples (?).

//cc @bgoodri @ariddell @jonah @paul.buerkner @ahartikainen @Bob_Carpenter @mitzimorris

Particularly curious if we want them to be flattened or to have the theta[iter][chain][standim1][standim2][...] structure. Thanks all!

1 Like

My sense is that User Interface Guidelines for Developers captures the results of previous discussions and describes how results will be returned. Perhaps we could discuss specific changes (if any) to that document instead of starting afresh?

2 Likes

I didn’t know that page existed! Sorry y’all. I wish I had known to ask for this before the roadmap meeting, haha. One thing that I don’t know how to interpret on that wiki page is what it means when someone writes [ params x chains x iterations ]. Is it as I specified above using the table to flattened structure conversion?

Backwards. For example,

parameters {
  vector[N] theta[K];
}

should by default come out N \times K \times 4 \times 1000 or equivalently N \times K \times 4000

I don’t like that we would start to remove dimensions if they are 1. That is going to be real pain in the long run (because we don’t have named dimensions).

For example, user takes a parameter and transforms it to array with shape of (20, 3, 4, 4000) and shares it with another user.

Was the original shape 20,3 or 20,3,4?

If a parameter has a dimension of 1 somewhere then there should be an array index for it. I was just saying that functions in the interface can keep the distinction between the last two dimensions when doing diagnostics and ignore the distinction between the last two dimensions when doing summaries, but that is only possible without reallocating memory when the chains and main iterations are the last two dimensions.

Oh yes, my mistake. You have 1000 vs 4000.

I’m still confused about what this N x K x 4 x 1000 notation corresponds to, especially since multiplication is commutative 😛. Since you said my example is backwards I’m guessing that might be corresponding to 11.1, 11.2, 21.1, 21.2, 12.1, 12.2, 22.1, 22.2 for the table above, is that right?

For a parameter declared as

real theta[2];

the resulting array would be 2 \times \mbox{chains} \times \mbox{iterations}. In R and Python, the first six iterations on two chains would look like

> array(LETTERS[1:24], dim = c(2, 2, 6), 
        dimnames = list(theta = 1:2, chain = 1:2, iteration = 1:6))
, , iteration = 1

     chain
theta 1   2  
    1 "A" "C"
    2 "B" "D"

, , iteration = 2

     chain
theta 1   2  
    1 "E" "G"
    2 "F" "H"

, , iteration = 3

     chain
theta 1   2  
    1 "I" "K"
    2 "J" "L"

, , iteration = 4

     chain
theta 1   2  
    1 "M" "O"
    2 "N" "P"

, , iteration = 5

     chain
theta 1   2  
    1 "Q" "S"
    2 "R" "T"

, , iteration = 6

     chain
theta 1   2  
    1 "U" "W"
    2 "V" "X"

where, for example, U & V are the realization of theta on the sixth main iteration of the first chain.

Okay, I’m not good enough at R to easily parse that, let me try popping back to just a table and a list to try to work with the example. So here you’re using letters instead of numbers and saying the list is just A-X in order, [A, B, C, D, E, F, G, H...]. We’d parse that list as the following table such that the thetas are nested within chain which are nested within iteration like so:

iteration chain theta.1 theta.2
1 1 A B
1 2 C D
2 1 E F
2 2 G H

Is that right?

Here is some viz converning this issue. It’s in row-major, but basically same as F, just in another direction.

https://eli.thegreenplace.net/2015/memory-layout-of-multi-dimensional-arrays

Thanks for posting the nice visualizations. I just want to make sure can explicitly write out our own example for the roadmap doc that is exactly what we want so we make sure we are all on the same page and we make things easier for implementers.

1 Like

Ah, there’s something I really like from the link that I’ll add to the doc - we can talk about which indices “change the fastest.” Here I think we are saying that StanDimension changes fastest, chain 2nd fastest, and iteration 3rd fastest.

Right now, as.array() on a stanfit gives an array of iterations x chains x parameters.

m <- rstanarm::stan_glm(Sepal.Length ~ 1, gaussian(), iris)
dimnames(as.array(m$stanfit))
#> $iterations
#> NULL
#> 
#> $chains
#> [1] "chain:1" "chain:2" "chain:3" "chain:4"
#> 
#> $parameters
#> [1] "(Intercept)"   "sigma"         "mean_PPD"      "log-posterior"

1 Like

Yes, what we are doing now is how it has been done since the 1990s and we want to change it for Stan3.

Cool! So reading between the lines a little here it sounds like this post got the ordering right. I’ll write something up with that and also describe it with that “which index changes fastest” idea.

Thanks all!

I would suggest asking @andrewgelman, who often has opinions on interface issues.

On the specific topic of fit$extract(...) as extract(fit, ...), they’re interchangeable to me.

On the other hand, fit$theta feels odd to me as it feels like the structure of the fit object changes.

I’m all trial and error and str() in R, so it doesn’t matter that much to me.

As for StanDimensions x Chains x MainIterations, yes, that makes most sense to me (assuming we mean C-style indexing here, where we first provide a Stan dimension (picking out a single scalar parameter) then get a collection of chains for that parameter). It lets you get the chain x iterations for each parameter dimension, which is what we need for the posterior.

I almost never look at my own chains, so it’d be nice to also have the collapsed option of getting StanDimensions x Draws. It doesn’t need to be permuted.

Is the Stan fit object in R going to be like the current one in that it includes input, model binary, etc.?

What’s the purpose of flattening theta[2] (a size 2 array or 2D vector)? I won’t ever have a use for that. What I think we need is be able to give a parameter (say in string form with indexes as it’s done now, say theta[1]) and get either

  1. N chains, each consisting of M draws of theta[1], or

  2. N * M draws of theta[1].

I’ve never needed (1).

What I do find clumsy is having to use paste to manipulate strings:

print(fit, pars = c("alpha", paste("theta[", 1:5, "]"))
1 Like

The theta element of fit would be read-only.

No, I don’t think so. I’m not entirely sure because I think we will want to support a workflow where the user asks for additional main iterations after the first batch of iterations has been obtained, but we may just require the user to pass back the first batch.

Not much. I think the interfaces would be much better if users could access things with the Stan dimensions preserved. It is more if the user wants all of the parameters flattened into a matrix.

I’d like to have fit$theta because this comes up a lot! I’d also like to do something like foo(fit$theta, median) or foo(fit, “theta”, median) that would grab the median over the simulation draws (or it could be foo(fit, “theta”, mean) or foo(fit, “theta”, quantile, c(0.25, 0.75)) etc. Currently I waste a lot of time in R grabbing simulations, then trying to figure out how to use “apply” or whatever to get summaries. This is a particular issue if theta is a vector or matrix or array. What I’d generally like is pointwise summaries; e.g., if theta is a 2 x 3 x 4 array, then fit$theta would return a n_sims x 2 x 3 x 4 array, and foo(fit, “theta”, median) would return a 2 x 3 x 4 array.

I’m also thinking the default would be to concatenate all chains together, but of course there should also be an option to grab one chain at a time, which could come up for diagnostics or whatever.

1 Like

I meant that the signature of the object varies based on which variables are in the Stan program. Not that it’s mutable. Though I’m happy it’s not mutable.

My background in Java and C++ makes this seem weird, as they have fixed signatures defined at compile time.