[Interface roadmap] fit objects and `extract`

Hey all,

I wanted to break out this topic from the larger interface roadmap draft. We need to decide what the posteriors returned from the fit objects will look like. I asked Ben about replacing “extract” and he wrote the following:

For RStan3, I don’t think there should be a standalone extract function (or many standalone functions for that matter) because functions with names like extract conflict with functions that have the same name in other packages.

There could be an $extract() method for the Reference Class that holds the output, but I think it would be kind of unnecessary. I think users would be able to do foo$theta to get the draws of thetarather than foo$extract(pars = “theta”) or something like that. The user wanted all or most of the parameters, then as.matrix, as.data.frame, etc. would presumably be more useful.

So, could go either way on an $extract method, but the main points (which I think were agreed upon in 2014) were that

  • The permute argument should die
  • The order should be StanDimensions x Chains x MainIterations so that it is easy to add more iterations to the back and so that you don’t have to do anything if you want to analyze the thing with the distinction between chains and main iterations erased.

Do you all still agree on this direction? Here I’ll attempt to write it up declaratively as it would appear in the end result roadmap document, noting places where I’m filling in specifics that weren’t specified anywhere and could use comment with italics and question marks. I was unsure about most of these details, so please correct me as needed.

Returning results

The language interfaces will return an object representing the results of fitting a Stan model (henceforth called a “StanFit” object, though the name isn’t important) that allows attribute access to each of the named parameters, transformed parameters, and generated quantities returned by running a Stan inference algorithm on data. The iterations will be stored flat (?) such that the first dimension is the MainIteration number, the 2nd dimension is the chain number, and the rest of the dimensions are the StanDimensions (? is this right or backwards? see example:). For example, if we have a two dimensional array parameter real theta[2]; and the following fit result containing two iterations on two chains:

iteration chain theta.1 theta.2
1 1 11.1 11.2
1 2 12.1 12.2
2 1 21.1 21.2
2 2 22.1 22.2

The results would be a single dimensional (flat) array as the following: fit.theta = [11.1, 11.2, 12.1, 12.2, 21.1, 21.2, 22.1, 22.2].

Each interface will also provide methods for translating into a 2d matrix or dataframe style object with the appropriate columns for chain id, iteration id, and each of the Stan dimensions of each of the parameters, essentially just like the table posted above. There will be no options to change the ordering (i.e. no permuted or permute flag), and there will be a separate method to retrieve the warmup samples (?).

//cc @bgoodri @ariddell @jonah @paul.buerkner @ahartikainen @Bob_Carpenter @mitzimorris

Particularly curious if we want them to be flattened or to have the theta[iter][chain][standim1][standim2][...] structure. Thanks all!

1 Like

My sense is that User Interface Guidelines for Developers captures the results of previous discussions and describes how results will be returned. Perhaps we could discuss specific changes (if any) to that document instead of starting afresh?

2 Likes

I didn’t know that page existed! Sorry y’all. I wish I had known to ask for this before the roadmap meeting, haha. One thing that I don’t know how to interpret on that wiki page is what it means when someone writes [ params x chains x iterations ]. Is it as I specified above using the table to flattened structure conversion?

Backwards. For example,

parameters {
  vector[N] theta[K];
}

should by default come out N \times K \times 4 \times 1000 or equivalently N \times K \times 4000

I don’t like that we would start to remove dimensions if they are 1. That is going to be real pain in the long run (because we don’t have named dimensions).

For example, user takes a parameter and transforms it to array with shape of (20, 3, 4, 4000) and shares it with another user.

Was the original shape 20,3 or 20,3,4?

If a parameter has a dimension of 1 somewhere then there should be an array index for it. I was just saying that functions in the interface can keep the distinction between the last two dimensions when doing diagnostics and ignore the distinction between the last two dimensions when doing summaries, but that is only possible without reallocating memory when the chains and main iterations are the last two dimensions.

Oh yes, my mistake. You have 1000 vs 4000.

I’m still confused about what this N x K x 4 x 1000 notation corresponds to, especially since multiplication is commutative 😛. Since you said my example is backwards I’m guessing that might be corresponding to 11.1, 11.2, 21.1, 21.2, 12.1, 12.2, 22.1, 22.2 for the table above, is that right?

For a parameter declared as

real theta[2];

the resulting array would be 2 \times \mbox{chains} \times \mbox{iterations}. In R and Python, the first six iterations on two chains would look like

> array(LETTERS[1:24], dim = c(2, 2, 6), 
        dimnames = list(theta = 1:2, chain = 1:2, iteration = 1:6))
, , iteration = 1

     chain
theta 1   2  
    1 "A" "C"
    2 "B" "D"

, , iteration = 2

     chain
theta 1   2  
    1 "E" "G"
    2 "F" "H"

, , iteration = 3

     chain
theta 1   2  
    1 "I" "K"
    2 "J" "L"

, , iteration = 4

     chain
theta 1   2  
    1 "M" "O"
    2 "N" "P"

, , iteration = 5

     chain
theta 1   2  
    1 "Q" "S"
    2 "R" "T"

, , iteration = 6

     chain
theta 1   2  
    1 "U" "W"
    2 "V" "X"

where, for example, U & V are the realization of theta on the sixth main iteration of the first chain.

Okay, I’m not good enough at R to easily parse that, let me try popping back to just a table and a list to try to work with the example. So here you’re using letters instead of numbers and saying the list is just A-X in order, [A, B, C, D, E, F, G, H...]. We’d parse that list as the following table such that the thetas are nested within chain which are nested within iteration like so:

iteration chain theta.1 theta.2
1 1 A B
1 2 C D
2 1 E F
2 2 G H

Is that right?

Here is some viz converning this issue. It’s in row-major, but basically same as F, just in another direction.

https://eli.thegreenplace.net/2015/memory-layout-of-multi-dimensional-arrays

Thanks for posting the nice visualizations. I just want to make sure can explicitly write out our own example for the roadmap doc that is exactly what we want so we make sure we are all on the same page and we make things easier for implementers.

1 Like

Ah, there’s something I really like from the link that I’ll add to the doc - we can talk about which indices “change the fastest.” Here I think we are saying that StanDimension changes fastest, chain 2nd fastest, and iteration 3rd fastest.

Right now, as.array() on a stanfit gives an array of iterations x chains x parameters.

m <- rstanarm::stan_glm(Sepal.Length ~ 1, gaussian(), iris)
dimnames(as.array(m$stanfit))
#> $iterations
#> NULL
#> 
#> $chains
#> [1] "chain:1" "chain:2" "chain:3" "chain:4"
#> 
#> $parameters
#> [1] "(Intercept)"   "sigma"         "mean_PPD"      "log-posterior"

1 Like

Yes, what we are doing now is how it has been done since the 1990s and we want to change it for Stan3.

Cool! So reading between the lines a little here it sounds like this post got the ordering right. I’ll write something up with that and also describe it with that “which index changes fastest” idea.

Thanks all!