Merging / grouping rvars

ma-riviere · May 21, 2022, 11:33am

I’m trying to use rvars instead of draws in my workflow (since it makes the post-processing of posterior draws much faster), but I ran into an issue I was unable to solve: how to group/merge rvars after “drawing” them.

Let’s say you extract rvars from a cmdstanr model and merge them with the original dataset the model was fit on. Then, you want to plot those draws grouped by another variable (that was in the original dataset but not in the model). This requires you to combine multiple rvars into a single one (i.e. group_by() |> summarize(my_rvars)).

But I was not able to find how to combine multiple rvars into one.

I guess I could simply extract the draws from each rvar and plot those, but that defeats the purpose of using rvars in the first place.

Thanks !

mjskay · June 9, 2022, 5:03am

Can you give an example of what you’re trying to do? Then I might be able to give more specific advice.

ma-riviere · June 13, 2022, 5:07pm

I have an example online that will illustrate the problem well:

Open this RPubs post and head to the last section (4. Posterior Predictions).
Here, I extract samples from the posterior (var name route_difficulty) in two ways:
- 1. As draws, which I group into a list for each route_idx (using data.table since there are 2M+ rows)
- 1. As rvars (using your package), with one rvar per route_idx.

So, both data.frames are the same, except one has the draws in a list and the other in a rvar.

Right after that, using the data.table obtained from method 1, I plot the distribution of route_difficulty (variable estimated by the model) for each route_rating (the actual difficulty rating of the route, which is not in the model). Basically, I’m comparing the model’s route difficulty estimates with the actual route ratings.

I would like to do the same plot with the “rvar” data.frame instead of the “list” one, except I do not know how to go from having 4288 rvars (one per route_idx) to having 19 rvars (on per route_rating - there are 19 unique route ratings in total, many route_idx having the same route_rating).

Does that make sense ?

Thanks for your help @mjskay :)

mjskay · June 14, 2022, 10:22pm

Okay, got it (I think).

rvar is a little bit stricter than manipulating draws directly, and this is deliberate — it tries to make things that are “dangerous” require breaking out of the abstraction. In your case, if I’m understanding correctly, you want to basically create mixtures of the posterior distributions of route_difficulty[...]. I’m not entirely sure what those mixtures correspond to in your model. I can tell you how to make them, but I’d double check they actually are what you want.

As an example, here is an rvar vector of length 10:

library(posterior)
x = rvar_rng(rnorm, 10, 1:10)
x

## rvar<4000>[10] mean ± sd:
##  [1]  0.99 ± 1.01   1.97 ± 1.00   3.00 ± 1.01   4.00 ± 1.00   4.99 ± 1.01 
##  [6]  6.01 ± 1.01   7.00 ± 1.01   8.00 ± 0.97   9.01 ± 1.00   9.98 ± 1.00

We can overplot all those distributions (kinda illegible)

library(ggdist)
library(ggplot2)

data.frame(x) |>
  ggplot(aes(xdist = x)) +
  stat_halfeye(alpha = 0.25)

You basically want an operation that merges all k draws in an rvar of length n returning a new rvar of length 1 with k*n draws. To do that, you can extract the draws from an rvar with draws_of(), remove the dimension information with as.vector(), and create a new rvar() from the result:

x_mix = rvar(as.vector(draws_of(x)))
x_mix

## rvar<40000>[1] mean ± sd:
## [1] 5.5 ± 3

Note it now has 40,000 draws, not 4,000. It is a mixture of all the above distributions:

data.frame(x_mix) |>
  ggplot(aes(xdist = x_mix)) +
  stat_halfeye()

If you want to do this inside a data frame, it looks something like this:

library(dplyr)

draws = tibble(
  g = rep(c("a", "b"), 5),
  x = rvar_rng(rnorm, 10, 1:10)
)

draws

## # A tibble: 10 × 2
##    g                 x
##    <chr>        <rvar>
##  1 a       0.97 ± 1.01
##  2 b       2.01 ± 1.02
##  3 a       3.00 ± 1.01
##  4 b       4.01 ± 0.99
##  5 a       4.98 ± 1.01
##  6 b       5.99 ± 0.99
##  7 a       7.00 ± 0.99
##  8 b       7.97 ± 1.01
##  9 a       9.00 ± 1.01
## 10 b      10.00 ± 1.01

draws_by_group = draws |>
  group_by(g) |>
  summarise(x = rvar(as.vector(draws_of(x))))

draws_by_group

## # A tibble: 2 × 2
##   g          x
##   <chr> <rvar>
## 1 a      5 ± 3
## 2 b      6 ± 3

draws_by_group |>
  ggplot(aes(y = g, xdist = x)) +
  stat_halfeye()

ma-riviere · June 15, 2022, 4:48pm

Thanks for your reply ! But I’m having an error while trying that:

group_by(df , route_rating) |> 
    summarize(rr = rvar(as.vector(draws_of(route_difficulty))))

Error in `summarize()`:
! Problem while computing `rr = rvar(as.vector(draws_of(route_difficulty)))`.
ℹ The error occurred in group 19: route_rating = 5.14a.
Caused by error:
! Cannot broadcast array of shape [1,13500] to array of shape [1,332500]:
All dimensions must be 1 or equal.

Could it be that the rvar format does not allow each cell of the column to have different dimensions ? In my case, there’s 4288 route_difficulty (each being a rvar of 500 draws), and I’m trying to summarize that on 19 route_rating, but each route_rating won’t correspond to an equal amount of rows in the original (4288 rows) dataframe, meaning that there are discrepancies in the total number of draws that correspond to each route_rating.

I’ve made a very hacky solution for that, but there must be a better way to accomplish this …

Get the number of total draws of the smallest group
Sample that number from every other group

min_ndraws <- pp_df |> group_by(route_rating) |> 
   summarize(ndraws = draws_of(route_difficulty) |> lengths() |> sum()) |> 
   with(min(ndraws))

pp_df |> group_by(route_rating) |> 
  summarize(route_difficulty = draws_of(route_difficulty) |> as.vector() |> sample(min_ndraws) |> rvar()) |> 
  ungroup()

PS: I’ve noticed that rvars don’t seem to work with data.table (getting a malformed data.table error when trying to convert a df with an rvar column to a dt). Is there a way around that ?

mjskay · June 15, 2022, 7:28pm

Ah yeah, rvar is really not designed for rvars with different numbers of draws in the same object. You can do something like that using distributional::dist_sample(); which also works with the geoms in ggdist and the median_qi() functions; e.g.

data.frame(x = rvar_rng(rnorm, 9, 1:9), g = c("a", "a", "b")) |>
  group_by(g) |>
  summarise(x = dist_sample(list(as.vector(draws_of(x))))) |>
  ggplot(aes(y = g, xdist = x)) +
  stat_halfeye()

PS: I’ve noticed that rvars don’t seem to work with data.table (getting a malformed data.table error when trying to convert a df with an rvar column to a dt). Is there a way around that ?

Yeah… this is on my radar but I don’t have a great solution yet. There’s an open issue here. I have a suspicion it might be fixable by moving the rvar implementation to S4, but that’s another can of worms and I haven’t had a chance to get around to investigating it.

ma-riviere · June 15, 2022, 7:53pm

Thanks for the distributional tip !

Now, remains to find why the rvar (and rvar + distributional) solutions give me widely different plots than the “list” one.
I’ve published a “test” version of the same RPubs doc here with all 3 plots at the end (in 3 tabs), to illustrate the issue. Any idea why this happens? Did I somehow mess up my code? Or is this how it’s supposed to work?

Yeah… this is on my radar but I don’t have a great solution yet. There’s an open issue here. I have a suspicion it might be fixable by moving the rvar implementation to S4, but that’s another can of worms and I haven’t had a chance to get around to investigating it.

Thanks ! I hope you (or someone else) can find time to implement that.

mjskay · June 16, 2022, 4:04pm

Hmmm dunno. Could have to do with route indices not being correctly extracted somewhere? I have seen that be an issue where the join back to the original dataframe on incorrect indices causes a problem. Hard to debug without a more minimal example.

ma-riviere · June 16, 2022, 4:19pm

Fair point, sorry.
I’ll try to make a smaller reproducible example when I have some more free time (probably this weekend), and I’ll restart the thread then.

Thanks again :)

mjskay · June 16, 2022, 4:29pm

No worries! Sorry there wasn’t an obvious problem I could point to.

Topic		Replies	Views
Tidybayes, posterior prediction, and cmdstanr General techniques	4	495	May 19, 2024
Rvars in data.frame - super efficient way of handling your posteriors brms	6	953	June 5, 2023
Extract parts of a matrix parameter General	2	465	September 16, 2021
How to combine posterior draws/samples from separate datasets/models? brms brms	4	1094	April 6, 2022
Combining posteriors rstanarm	1	558	February 18, 2019

Merging / grouping rvars

Related topics