Bayesplot guts

I don’t know if this is the right place to talk Bayesplot but here goes:

I’m writing some plotting functions for a package and I was going to use Bayesplot for base classes to deal with the irritating reality of handling arrays that might have a variety of labels for each margin (you know, long-format data represented as a vector in Stan with implicit indexing specified elsewhere). … then I see that the base classes in Bayesplot just treat samples as arrays… is there any room to do more OOP with this?

Asking for a friend who wastes a lot of time on indexing mistakes…

I’m imagining something like:

  • consistency enforced by validation pre/post operations
  • a sample is an array
  • the first two dimensions are: 1) iteration; 2) chain;
  • all further dimensions are sample-specific
  • methods for, in order of implementation:
    0) constructor takes an array and optionally a list of data frames for labels
    1. merging chains
    2. trimming warmup
    3. thinning
    4. labelling dimensions w/ a data.frame that functions like the attributes dplyr::group_by uses
    5. transformations to ggplot-friendly long-format data frames made by unrolling specific margins
    6. calculating diagnostics
    7. element-wise math
    8. broadcasting

Anyway, I’ve implemented a bunch of stuff like this in a package as free functions that operate on a list but my next step was to make the operations more reliable in terms of keeping the object in a consistent state so I wanted to see if there’s room in one of the Stan R packages for that kind of thing.

Just to be clear I’m not thinking of a complete ‘fit’ object, but something that could reliably represent the labelled sample from a given multi-dimensional parameter. Maybe it would make more sense in that rstantools (?) package?

@jonah @bgoodri (?)

If it involves dplyr or ggplot2, then it should go in bayesplot, which already has those dependencies, rather than rstantools.

A lot of these ideas for labeling and post-processing are on the table for rstan3.

I did use dplyr when I initially wrote the code but it would be good to avoid it and I think plausible. If this is something that goes into rstantools I think we could de-couple any release from rstan and it could catch up when rstan3 comes out.

Just to be clear I’m not thinking of a complete ‘fit’ object, but something that could reliably represent the labelled sample from a given multi-dimensional parameter.

What about a matrix of posterior predictions? That something we have to wrangle a lot.

I am curious to hear what @mjskay says because tidybayes does a lot of heroic work on reshaping MCMC samples.

2 Likes

Yeah, the idea is to reduce how heroic that effort has to be. There’s a huge number of ways of doing the reshaping but we only need some of them. I think a 2-D array of predictions (so n-iterations x n-chains x n-dim-1 x n-dim-2) can be treated the same way a matrix of predictions is when it comes to reshaping. Did you mean something different? I’d love to hear more too and see if there’s a group that could contribute to this (so the development burden gets spread out a little more).

I would say that tidybayes currently concentrates on 4/5/7, with the assumption that 1/2/3 are already taken care of (though it has a function for doing 1 if you are already in long format and need to do that). The nice thing about 7 (element-wise math) is that once you solve 4 and 5, 7 is already taken care of for you.

At a conceptual level, something I have recently been thinking about is the boundary between when you want to think about and manipulate samples in a linear algebra sort of way, and when you want to think about and manipulate samples in a relational algebra sort of way. I think the latter is particularly useful the closer you get to wanting to visualize things, which is when unrolling into long format dataframes becomes an important operation. However, unrolling can make your life painful for things like matrix multiplication, when you want to be in linear algebra land.

To address this, one useful format I have been considering is a data frame with chain, iteration, and one or more list columns, which could contain things like samples from parameters that are matrices. Then you could do element-wise operations and matrix multiplication using a combination of dplyr::mutate and purrr::map, while staying somewhat closer to relational algebra land for summarization and visualization tasks (and easing the transition into it).

I agree with @sakrejda that there is likely some small set of useful reshaping operations, and would add that there’s a small set of useful formats, for most goals. The trick is figuring out what those are without exploding your vocabulary to an unmanageable size. I think I am a little less concerned with what the base input format into those reshaping operations is, because whatever it is it is unlikely to work for every situation (put another way: the reshaping will always be necessary). The goal then is to articulate the useful output formats, which ideally would not be esoteric/custom data structures, but rather things like data frames upon which one could apply the many existing functions for manipulating data frames to achieve one’s ends.

To be honest, I’m not sure how much tidybayes would change given a nicer format to get the samples from — once you have the code for extracting indices from names like a[i,j] and doing the couple of reshaping operations you need, you can apply it to any MCMC package (stan or otherwise) that gives you output in that format (which are several). So in some ways writing code to work against a nicer format output by stan paradoxically takes more work than writing it against the ugly format, for a package like tidybayes whose goal is compatibility across many MCMC packages. Unless the idea is to convince many others to also output nicer formats (which could be great!).

In any case, if you are looking for folks I would be interested in being involved in efforts to identify / standardize sample formats and operations on them.

1 Like

What’s the signature you’re envisioning here? An element is a whole draw? Or do you mean just one variable? Most of what I need to do is multivariate and involves multiple named variables.

This is the place where generated quantities lets you do it naturally, but you’d really want to be running generated quantities again.

What’s that?

Could you say a bit more about how? (4 is labeling, 5 is transformation to long format, whereas 7 is elementwise math).

(7) is exactly where the motivation for Jouni Kerman’s rv package arose. This is the one Andrew regularly mentions. It was also the motivation for me to refer to Stan as a probabilistic programming language (elementwise math on the draws works the right way for expectations and quantiles, which is all we ever compute).