I’m moving the discussion from a cmdstanpyt issue, which devolved into a meta-discussion on when something’s ready to be made an issue.
My understanding is that we were trying to avoid using GitHub for discussion what should be done and instead were going to insist that issues be concrete enough that someone could implement them.
The issue I was objecting was one of two bullet points:
* Something faster to parse and create than JSON (I know this would also require changes in CmdStan). This post compares a few options: https://yuhui-lin.github.io/blog/2017/08/01/serialization
I replied to the issue:
For the second bullet, it's not concrete enough to implement, so should probably be kicked to the forums for discussion before being promoted to an issue.
to which @seantalts replied
I think "not concrete enough to implement" is a fuzzy zone - it's possible someone could read up on a good 3rd data encoding scheme and just make a PR to add it, right? This isn't a case where we have to make one perfect choice - we could support a variety of formats.
This is definitely a fuzzy zone.
Rather than people reading this one-sentence feature request and coming up with a full pull request, I’d rather encourage them to propose a functional spec and get design feedback. It reduces the risk of a lot of wasted work and hurt feelings.
I think this makes sense for a variety of features, but probably not something where we’re just adding additional support for data formats. To me one distinguishing characteristic is whether or not there is a choice to be made - here, we don’t have to choose just 1 data format for CmdStan (we already have two), so to me most open source contributions adding additional (reasonably mature / popular) data formats would be uncontroversial and unlikely to benefit from a design discussion.
There are a gazillion decisions to be made about data formats. The most basic is whether to store column major (by parameter) or row major (by iteration). And then if we store row major, how much structure do we want in a row compared to just binary data? Each data format we bring up creates an issue of how we round trip out of CSV output (unless that changes).
The only cases I’ve ever seen of data I/O being a bottleneck is for very large scale optimizations with very bad R dump format I/O readers.
That’s when designing a data format - in this case I think we could add support for some existing data formats without much design or choice going into it beyond implementation details.
I don’t care very strongly here, so we can just say that this should have more details in the issue for data formats. I just want to make sure it’s easy enough to have conversations about these ideas online, and that we leave open some possibilities for contributions that we haven’t necessarily spec’d out ourselves.
I don’t think it’s a good idea, but I’ll try to
stop trying to get people to clarify issues to the point where we could judge whether a PR satisfies them or not, and
stop trying to move meta-conversations out of the issue tracker and onto Discourse.
P.S. I do think it’s a good idea for us to be consistent, which is why I need to change my behavior.
I think both of those are judgement calls that we have to examine case by case. I think moving this discussion here was a good idea, for example, so we’re in agreement there. And I also think in the particular issue, Marmaduke came up with a criteria for closing the issue, which was that i/o time was under 5% of total execution time even for some contrived example models. So maybe I should have tried to have the discussion on discourse where we got to that point, and then summarized it on the github issue - I just thought the maintainers of CmdStanPy seemed more active on github than discourse. I’ll experiment with trying to have more of the CmdStanPy discussions on discourse going forward.
I’m mostly just trying to calibrate so we don’t give contributors conflicting advice. So I’m just going to step back and let you manage where the discussions happen since I don’t understand your preferences and don’t want to get in the way.
@Bob_Carpenter, what you posted in the original post is how the Math library works. On Math, please continue to shift discussion to Discourse. Issues should be clearly specified, PRs should be implementations. GitHub is harder to search than Discourse.
I’d prefer if we were consistent across what we consider “Stan,” but if we aren’t, it’s not a big deal.
I’d also like to see RFC more on math(through dedicated repo
design doc). To me it’s kind of between discourse and formal PR, and allows more organized plan layout.