Interface roadmap - last draft before ratification vote

foo is local to fit, so it is only a (self-inflicted) problem if they do foo <- fit$foo.

Commonly in python fit.foo are first reserved for methods and if there is no method named foo, it can used for the data. But fit["foo"] should return the data.

My assumption was that extract_one_draw was a function that would extract from an existing fitted model (i.e., an existing set of draws), not a function that would take an additional draw. My use case for extract_one_draw is when I want to make a graph of 10 draws from the fitted model, that I’d do a loop and, inside the loop, extracting a draw and then using these parameter values to make a graph or simulate fake data or whatever. The point is that the extracted draw would be a list, each of whose elements is the appropriate size and shape to use it in calculations.

Before diving into some of the specifics, I have the same confusion as @Bob_Carpenter and @ariddell in the sense that I don’t know what needs to be voted on and who? I think it’s great that a roadmap is being put together, but I don’t understand what things made this list. Some of these need a lot of clarity and could use real proposals to even get a rough consensus. Rather than flushing these out, it seems like a few individuals were told to make decisions on behalf of the development community. Fundamentally, that seems like a problem. An alternative would have been to gather what’s been done and what’s being actively worked on.

Specific comments below.

  1. What are the “interfaces”?
  2. Is this supposed to be guidelines for the interfaces or are they requirements that interfaces will have to follow?
  3. Does the C++ need to align to this? What part of the C++ is that?
  4. I don’t think this discussion happened (online) prior to that meeting… is this really a focus of Stan in the short term?
  1. Is there one package stylized as *Stan? Is it pronounced “star stan”?
  2. If this is separate from RStan and PyStan, what happens to those?
  1. What are these “new service API calls”? This wasn’t discussed online prior to this meeting.
  2. Who’s making these decisions? It seems like some of the key people here didn’t even have a chance to discuss this.
  3. If you’re using fit objects, are we going back to a God object design?

I think that’s kind of what we’re defining here, something like what the basic interface should include. It’s not an exclusive list.

@bgoodri and @ariddell or @ahartikainen are you all on board with this extract_one_draw function? it seems really similar to something like fit$extract()[1,] if we’re going by iterations first, right?

The Stan electorate will be asked to ratify the roadmap as stated in the third paragraph of the top post. I don’t think Allen was confused about that, and I answered Bob’s questions regarding items not listed in this roadmap in an earlier post. It looks like you haven’t read the thread, so I’m going to skim your post and try to mine it for valuable feedback but otherwise not respond to questions like “What are the ‘interfaces’?”

This is a good question - I will annotate and mark which parts are lofty goals (“Interface Package Architecture”) and which ones are requirements. I already had this on my todo list from earlier comments, but I’m only updating this once every few days to let people have a chance to get their thoughts out.

No. The architecture is designed for the reasons Paul stated here. I’ll include his reasoning in the roadmap.

I’ll change the name. Would you find <Lang>Stan less confusing? It’s referring to standardization across PyStan, RStan, JuliaStan, etc.

I’ll add “listed below” so that when people are skimming, they can find the list of calls. We’re having an discussion online with all dev stakeholders for the past 2 months; I believe this is adequate time to fully read the proposal and comment on specifics.

Hi, see responses below.

seantalts
Stan Developer

    September 19

andrewgelman:
There’s another issue that comes up a lot, which is that there are features that we want in all the interfaces, including R-hat, n_eff, quantiles, extract_one_draw(), etc. Maybe this means we need a “universal interface” or “basic interface” which includes all the features that are common to all interfaces? I have no idea.

I think that’s kind of what we’re defining here, something like what the basic interface should include. It’s not an exclusive list.

@bgoodri and @ariddell or @ahartikainen are you all on board with this extract_one_draw function? it seems really similar to something like fit$extract()[1,] if we’re going by iterations first, right?

I’m not sure what fit$extract() does, but there are two issues: First, extract_one_draw would take the draw number as an argument, it wouldn’t just extract the first draw. Second, my big concern here is dealing with arrays of different length. Here’s what I want to avoid: if alpha is a scalar, beta is a vector of length 10, and theta is a 2 x 3 x 4 array, then now I need to do something like sims$alpha[s], sims$beta[s,], sims$theta[s,]. This makes my code ugly and is also against the sprit of probabilistic programming that I need to keep track of these indexes.

Also please don’t forget I’d like to be able to access summaries. For example, foo(fit, median) would return a list with three elements corresponding to the median of alpha, the pointwise median of beta (i.e., a vector of length 10), and the pointwise median of theta (i.e., a 2 x 3 x 4 array). In practice, extracting these summaries could be even more valuable for routine use than extract_one_draw.

I’ll change the name. Would you find <Lang>Stan less confusing? It’s referring to standardization across PyStan, RStan, JuliaStan, etc.

One thing that confuses me (and probably other users as well!) is the role of CmdStan. In some way, CmdStan is an interface like RStan etc. But in another way, CmdStan is special in that it’s a minimal interface. And “the command line” is not a statistical environment in the same way that Python, R, Julia, are. I don’t think this is a practical problem–I assume that CmdStan is useful to external developers who want to link to Stan from their software–I just find it confusing whenever CmdStan comes up in discussion.

I’m on board with the idea of getting one draw or looping to get a few draws. But there needs to be an act of Congress before defining a new stand-alone function. Doing fit[1,] as a no-copy shortcut for as.matrix(fit)[1,] is fine.

1 Like

@seantalts, there really isn’t a need for this rudeness. I did read the thread; a lot of the wording here isn’t as clear as you may imagine.

I asked “What are the ‘interfaces’” because I’m not sure what you mean. Is that for the “New Lightweight Interfaces,” in which case you could have said “Lightweight Package Architecture” to make it clear, or is this for RStan, PyStan, CmdStan, or is it for that plus JuliaStan, MatlabStan, StataStan, etc. I am asking for clarity and there just isn’t a need to respond in that manner. Those questions don’t derail the conversation.

Thanks. That paragraph doesn’t say why this needs to be ratified by the whole Stan electorate. If you don’t mind, could you provide some reason for that?

Thanks. That would be really helpful.

[Minor point] Doesn’t @paul.buerkner’s logic lead directly to breaking down the C++ to align in this way?

Thanks for changing the name. I now understand what you mean; it would help if you were consistent in your usage of this. I think you’ve called this “*Stan” and “Interfaces” and “Stan Interfaces” all in the same thread.

To me, <Lang>Stan is not less confusing. Can we not call this “Stan Interfaces”? If someone adds their own interface, are they considered part of this collection? Is it a description of a class of interfaces or is it specific things that we can enumerate and list and needs blessing to be called part of this?

I think I understand you now. The confusion wasn’t the listed calls. The confusion was that the terminology used, “Stan Services,” had meant calls to the algorithms in C++ up until this use. It would be nice if you clarified that in the post. An easy fix would be to stay “For the new diagnostic service API calls” (rather than “For the new service API calls”) to match the previous sentence where you separate out the idea of “diagnostic services” and “support services.”

1 Like

Ben:

Doing fit[1,] as a no-copy shortcut for as.matrix(fit)[1,] would be cool, because I do think that as.matrix is annoying to have to go through.

But what is also super-important to me is being able to pull out draws and summaries that are the same size and shape as the objects in question, so that foo$alpha is a scalar, foo$beta is a vector of length 10, and foo$theta is a 2x3x4 array. I think it’s important to not have to do hashing an index fiddling.

I understand that. In fact, I wrote code to do things like that in June of 2015, and we have been waiting for the Stan3 transition since (actually before) then.

1 Like

This is interesting in that I am mostly not sure what you are suggesting here. I agree that CmdStan is both an interface and that we would like to view it as a sort of “reference implementation” - a phrase that e.g. nvidia uses to describe a graphics card they produce, while their core chips can be used on any graphics card by other manufacturers. Would using some terminology like that help? I’m not sure I understand enough about your confusion here to help address it.

This may be wishful thinking, but it sorta sounds like you two agree on fit[i,] being the way to access a single draw, assuming that it returns something with the correct shape for the first parameter and the second and so on? Is that what fit[i,] would do, or would it have all of the flattened parameters (e.g. theta1.2.3 and so on)?

Sorry, I wasn’t trying to be rude - I assumed you were mischaracterizing Bob’s and Allen’s posts because you had skimmed them, not because it was intentional. So I wanted us to get on the same page about level of effort we’d each be investing in this as that’s been a point of contention in the past.

I think I will just remove this section from the document as no one seemed particularly keen on implementing it.

The ratification idea came out of a call with you, me, Breck, and James Vasille. James recommended ratification to us because the roadmap has more legitimacy at the end of the day if it’s shown to be the will of the majority of the electorate (which hopefully includes more than just developers soon). It’s important to find common ground and get commitments from folks - people will be more inclined to comment on something to improve it if they have to vote for it and get behind it if the vote passes. I am pretty sure you agreed with his reasoning and this path at the time because I summarized that conversation in an email to us all afterwards on August 1st.

Hmm, I’ll do a pass-through and see if I can fix this up. Thanks for pointing it out.

Great idea, will do.

Did we have a wikipage or something similar for wanted interface functionality (including input–>function description–>output formats)?

So we could also gather what parts should be independent functions and what should be glued to fit object.

We have

although it hasn’t been updated in a while.

Yeah looks good.

Do we have a definition how to continue the sampling and what kind of interface should that have? I guess this would mean there was some way to move RNG state (stan math thing I guess).

Also, what is the support for ragged arrays in R? I’m not totally sure how numpy can handle them.

I think in R a list would be the most natural way to store many matrices (or arrays, vectors, etc) each of which can have different dimensions.

1 Like

seantalts
Stan Developer

    September 19

andrewgelman:
One thing that confuses me (and probably other users as well!) is the role of CmdStan. In some way, CmdStan is an interface like RStan etc. But in another way, CmdStan is special in that it’s a minimal interface. And “the command line” is not a statistical environment in the same way that Python, R, Julia, are. I don’t think this is a practical problem–I assume that CmdStan is useful to external developers who want to link to Stan from their software–I just find it confusing whenever CmdStan comes up in discussion.

This is interesting in that I am mostly not sure what you are suggesting here. I agree that CmdStan is both an interface and that we would like to view it as a sort of “reference implementation” - a phrase that e.g. nvidia uses to describe a graphics card they produce, while their core chips can be used on any graphics card by other manufacturers. Would using some terminology like that help? I’m not sure I understand enough about your confusion here to help address it.

I’m not making any suggestions, just registering my confusion. I’m not sure whether CmdStan represents a minimal interface, or whether it’s just something for developers who don’t want dependency on R or Python. Often CmdStan, RStan, and PyStan are discussed as being parallel entities, but they seem to have different sorts of users.

It’s no big deal, just my confusion.

andrewgelman:
Doing fit[1,] as a no-copy shortcut for as.matrix(fit)[1,] would be cool, because I do think that as.matrix is annoying to have to go through.

This may be wishful thinking, but it sorta sounds like you two agree on fit[i,] being the way to access a single draw, assuming that it returns something with the correct shape for the first parameter and the second and so on? Is that what fit[i,] would do, or would it have all of the flattened parameters (e.g. theta1.2.3 and so on)?

I’m not sure. I think that Ben was talking about fit[i,] pulling out the i-th draw for one scalar parameter. Because as.matrix concatenates all the parameters into a single named one-dimensional array (e.g., alpha, beta[1], beta[2], …). This is not so useful to me because if there are vector or array parameters then I have to do some hashing to figure out how to grab them.

I was more interested in those other functions, but now I think Ben is saying they’re all ready in rstan and just waiting for Stan 3. Not sure what this implies for CmdStan, PyStan, etc. I’ve been thinking a lot about these functions because I’ve been doing these operations in the case studies I’ve been writing recently, and I feel that the postsprocessing of simulation draws is not so transparent, especially for non-scalar parameters.

“Ready” would be a strong word. We had a bunch of prototype code for things in 2015 that got sidetracked. But we can have fit[1,] return a list of the first iteration on all main parameters. I was just saying that we are trying to avoid introducing new standalone functions to do things that can be accomplished via class methods or S3 methods of existing generic functions (which is what [ is).

1 Like

I’m a fan of this as well - it aids discoverability in RStudio or notebooks to have everything you can do with a complex piece of data available as a method on that object. Someone earlier wondered if we were going with God objects here - God objects typically describe (often singleton) objects that just hold all the mutable state (and methods that operate on that state) in an entire program. I think we’ll have no mutable state and are legitimately trying to use the object as a namespace for functions that operate on that piece of data as a whole, and not for all possible methods. It just so happens that there are a lot of things you can do with a collection of parameter draws and that comprises a lot of what RStan and PyStan are set out to do.

I think what historically has been known as a stanfit object — which is what Andrew is talking about — has no state and otherwise is consistent with what you are saying. The thing that generates a stanfit object is (usually) a Markov Chain so we want to be able to keep its state and run another X main iterations or whatever. But in both cases, the more Pythonic approach of relying more on “class methods” is going to yield a better workflow and no clashes with standalone functions in other packages.

The current roadmap is much more focused on little details and much less on the big picture. Not being on the roadmap isn’t blocking for a feature (though I have no idea how they’re going to get evaluated going forward when someone wants to add a unique piece of functionality to RStan or PyStan rather than coding it in a C++ service).

I think it’s also staying away from specifying much about interface details and extract_one_draw() is not something that can exist in CmdStan. But then neither can a “stan fit” object, so I guess that’s OK.

@andrewgelman—can you make a more concrete proposal for what extract_one_draw() will look like in R? What’s the result of str(...) going to look like on the return result? Which draw is it going to take if there are K chains of M draws each?

Thanks—that’s really useful (and I think a good decision).

Presumably, anything breaking backward compatibility would require a major version bump. The question’s always been how to do that all at once so we don’t get Stan 3, Stan 4, Stan 5, etc. in quick succession.