Merging / grouping rvars

I’m trying to use rvars instead of draws in my workflow (since it makes the post-processing of posterior draws much faster), but I ran into an issue I was unable to solve: how to group/merge rvars after “drawing” them.

Let’s say you extract rvars from a cmdstanr model and merge them with the original dataset the model was fit on. Then, you want to plot those draws grouped by another variable (that was in the original dataset but not in the model). This requires you to combine multiple rvars into a single one (i.e. group_by() |> summarize(my_rvars)).

But I was not able to find how to combine multiple rvars into one.

I guess I could simply extract the draws from each rvar and plot those, but that defeats the purpose of using rvars in the first place.

Thanks !

Can you give an example of what you’re trying to do? Then I might be able to give more specific advice.

1 Like

I have an example online that will illustrate the problem well:

  • Open this RPubs post and head to the last section (4. Posterior Predictions).
  • Here, I extract samples from the posterior (var name route_difficulty) in two ways:
      1. As draws, which I group into a list for each route_idx (using data.table since there are 2M+ rows)
      1. As rvars (using your package), with one rvar per route_idx.

So, both data.frames are the same, except one has the draws in a list and the other in a rvar.

Right after that, using the data.table obtained from method 1, I plot the distribution of route_difficulty (variable estimated by the model) for each route_rating (the actual difficulty rating of the route, which is not in the model). Basically, I’m comparing the model’s route difficulty estimates with the actual route ratings.

I would like to do the same plot with the “rvar” data.frame instead of the “list” one, except I do not know how to go from having 4288 rvars (one per route_idx) to having 19 rvars (on per route_rating - there are 19 unique route ratings in total, many route_idx having the same route_rating).

Does that make sense ?

Thanks for your help @mjskay :)

1 Like

Okay, got it (I think).

rvar is a little bit stricter than manipulating draws directly, and this is deliberate — it tries to make things that are “dangerous” require breaking out of the abstraction. In your case, if I’m understanding correctly, you want to basically create mixtures of the posterior distributions of route_difficulty[...]. I’m not entirely sure what those mixtures correspond to in your model. I can tell you how to make them, but I’d double check they actually are what you want.

As an example, here is an rvar vector of length 10:

library(posterior)
x = rvar_rng(rnorm, 10, 1:10)
x
## rvar<4000>[10] mean ± sd:
##  [1]  0.99 ± 1.01   1.97 ± 1.00   3.00 ± 1.01   4.00 ± 1.00   4.99 ± 1.01 
##  [6]  6.01 ± 1.01   7.00 ± 1.01   8.00 ± 0.97   9.01 ± 1.00   9.98 ± 1.00

We can overplot all those distributions (kinda illegible)

library(ggdist)
library(ggplot2)

data.frame(x) |>
  ggplot(aes(xdist = x)) +
  stat_halfeye(alpha = 0.25)

You basically want an operation that merges all k draws in an rvar of length n returning a new rvar of length 1 with k*n draws. To do that, you can extract the draws from an rvar with draws_of(), remove the dimension information with as.vector(), and create a new rvar() from the result:

x_mix = rvar(as.vector(draws_of(x)))
x_mix
## rvar<40000>[1] mean ± sd:
## [1] 5.5 ± 3

Note it now has 40,000 draws, not 4,000. It is a mixture of all the above distributions:

data.frame(x_mix) |>
  ggplot(aes(xdist = x_mix)) +
  stat_halfeye()

If you want to do this inside a data frame, it looks something like this:

library(dplyr)

draws = tibble(
  g = rep(c("a", "b"), 5),
  x = rvar_rng(rnorm, 10, 1:10)
)

draws
## # A tibble: 10 × 2
##    g                 x
##    <chr>        <rvar>
##  1 a       0.97 ± 1.01
##  2 b       2.01 ± 1.02
##  3 a       3.00 ± 1.01
##  4 b       4.01 ± 0.99
##  5 a       4.98 ± 1.01
##  6 b       5.99 ± 0.99
##  7 a       7.00 ± 0.99
##  8 b       7.97 ± 1.01
##  9 a       9.00 ± 1.01
## 10 b      10.00 ± 1.01
draws_by_group = draws |>
  group_by(g) |>
  summarise(x = rvar(as.vector(draws_of(x))))

draws_by_group
## # A tibble: 2 × 2
##   g          x
##   <chr> <rvar>
## 1 a      5 ± 3
## 2 b      6 ± 3
draws_by_group |>
  ggplot(aes(y = g, xdist = x)) +
  stat_halfeye()

2 Likes

Thanks for your reply ! But I’m having an error while trying that:

group_by(df , route_rating) |> 
    summarize(rr = rvar(as.vector(draws_of(route_difficulty))))
Error in `summarize()`:
! Problem while computing `rr = rvar(as.vector(draws_of(route_difficulty)))`.
ℹ The error occurred in group 19: route_rating = 5.14a.
Caused by error:
! Cannot broadcast array of shape [1,13500] to array of shape [1,332500]:
All dimensions must be 1 or equal.

Could it be that the rvar format does not allow each cell of the column to have different dimensions ? In my case, there’s 4288 route_difficulty (each being a rvar of 500 draws), and I’m trying to summarize that on 19 route_rating, but each route_rating won’t correspond to an equal amount of rows in the original (4288 rows) dataframe, meaning that there are discrepancies in the total number of draws that correspond to each route_rating.

I’ve made a very hacky solution for that, but there must be a better way to accomplish this …

  • Get the number of total draws of the smallest group
  • Sample that number from every other group
min_ndraws <- pp_df |> group_by(route_rating) |> 
   summarize(ndraws = draws_of(route_difficulty) |> lengths() |> sum()) |> 
   with(min(ndraws))

pp_df |> group_by(route_rating) |> 
  summarize(route_difficulty = draws_of(route_difficulty) |> as.vector() |> sample(min_ndraws) |> rvar()) |> 
  ungroup()

PS: I’ve noticed that rvars don’t seem to work with data.table (getting a malformed data.table error when trying to convert a df with an rvar column to a dt). Is there a way around that ?

Ah yeah, rvar is really not designed for rvars with different numbers of draws in the same object. You can do something like that using distributional::dist_sample(); which also works with the geoms in ggdist and the median_qi() functions; e.g.

data.frame(x = rvar_rng(rnorm, 9, 1:9), g = c("a", "a", "b")) |>
  group_by(g) |>
  summarise(x = dist_sample(list(as.vector(draws_of(x))))) |>
  ggplot(aes(y = g, xdist = x)) +
  stat_halfeye()

PS: I’ve noticed that rvars don’t seem to work with data.table (getting a malformed data.table error when trying to convert a df with an rvar column to a dt). Is there a way around that ?

Yeah… this is on my radar but I don’t have a great solution yet. There’s an open issue here. I have a suspicion it might be fixable by moving the rvar implementation to S4, but that’s another can of worms and I haven’t had a chance to get around to investigating it.

1 Like

Thanks for the distributional tip !

Now, remains to find why the rvar (and rvar + distributional) solutions give me widely different plots than the “list” one.
I’ve published a “test” version of the same RPubs doc here with all 3 plots at the end (in 3 tabs), to illustrate the issue. Any idea why this happens? Did I somehow mess up my code? Or is this how it’s supposed to work?

Yeah… this is on my radar but I don’t have a great solution yet. There’s an open issue here. I have a suspicion it might be fixable by moving the rvar implementation to S4, but that’s another can of worms and I haven’t had a chance to get around to investigating it.

Thanks ! I hope you (or someone else) can find time to implement that.

Hmmm dunno. Could have to do with route indices not being correctly extracted somewhere? I have seen that be an issue where the join back to the original dataframe on incorrect indices causes a problem. Hard to debug without a more minimal example.

Fair point, sorry.
I’ll try to make a smaller reproducible example when I have some more free time (probably this weekend), and I’ll restart the thread then.

Thanks again :)

1 Like

No worries! Sorry there wasn’t an obvious problem I could point to.