Attaching labels estimating proportion of posterior density within a fixed interval in the ggdist package

I don’t know if this is the right forum for this question but I haven’t had much luck at stack overflow and I know the people on this site know their way around a distribution.

I am trying to create a graph plotting distributions of estimated difference between several groups using the ggdist() package in R. The help has been useful but it does not cover anywhere near the breadth of functionality possible with the package. Here is an example analysis using the iris dataset in R and brms, examining the association between Species (three-level categorical) and Sepal Length (also three-level categorical, created for this analysis from a continuous variable) on Sepal Width, a numeric variable.

library(tidyverse)
library(brms)
library(ggdist)
library(tidybayes)
library(marginaleffects)
library(modelr)

#  create dataset
iris %>%
  mutate(SepalLength_cat = factor(case_when(Sepal.Length >= 4.3 & Sepal.Length < 5 ~ "cat1",
                                            Sepal.Length >= 5.1 & Sepal.Length < 6.4 ~ "cat2",
                                            TRUE ~ "cat3"))) -> irisAlt

# run model
mod <- brm(formula = Sepal.Width ~ Species*SepalLength_cat,
           data = irisAlt)

# sample from the posterior distribution at each level of the two three-level variables (so nine cells)
data_grid(data = irisAlt,
          Species = c("setosa", "versicolor", "virginica"),
          SepalLength_cat = c("cat1", "cat2", "cat3")) %>%
  add_epred_draws(mod,
                  re_formula = NA,
                  seed = 1234) -> modPred

Now using the compare_levels() function from tidybayes we compare estimated difference in Sepal Width between levels of Sepal Length at each level of Species.

# compare levels
modPred %>%
  compare_levels(variable = .epred,
                 by = SepalLength_cat) %>%
    mutate(SepalLength_cat = factor(x = SepalLength_cat,
                                    levels = c("cat2 - cat1",
                                               "cat3 - cat1",
                                               "cat3 - cat2")),
           Species = factor(Species)) -> compCat

What I want to do is first plot the distribution, shading the 95% HDI, adding a point interval below also showing the HDI and adding spikes. What I also want to include is

(a) an interval with a ROPE between -0.25 and 0.25

(b) shading the distribution falling within that ROPE a different colour.

(c) Adding labels, in a convenient place, indicating the proportion of posterior density falling within that ROPE.

Here is my attempt

compCat %>%
  ggplot(mapping = aes(x = .epred,
                       y = SepalLength_cat)) +
         stat_histinterval(mapping = aes(fill = after_stat(level)),
                           point_interval = median_hdi,
                           .width = c(.95)) +
         stat_spike(mapping = aes(linetype = after_stat(at)), 
                    at = c("median",
                           "hdi")) +
         stat_spike(mapping = aes(linetype = after_stat(at)), 
                    at = c(-0.25, 0.25),
                    colour = "red") +
         scale_fill_manual(values = "skyblue",
                           na.value = "gray75") +
         scale_thickness_shared() + # makes sure spike same size as distribution
         facet_grid(.~Species) + 
         labs(y = "Sepal Length comparison",
              x = "Sepal Width (in mm)") +
         theme_minimal() +
         theme(legend.position = "none")

Which should come out looking like this

I used spikes for the ROPE instead of an interval, but I would really like to use an interval and especially to shade the area of the distribution that falls between -0.25 and 0.25, and annotate with labels indicating proportion of density within the ROPE. One of the problems is that stat_interval() and its related functions don’t seem to accept a fixed cartesian interval, instead calculating HDI, which moves around depending on the distribution of estimates.

I don’t know if ggdist contains the necessary functionality to do this, but if it does, or if anyone can suggest another way of achieving my aims, I would appreciate it.

I’m not sure how to do this within ggdist (or if it can be done), but @mjskay probably knows. For now, here’s a way to construct the plot by directly calculating the desired density values from the posterior draws and then creating a custom ggplot.

The overlapping shading of the ROPE and HDI might be confusing, but hopefully you can play around with the code to get something that meets your needs. (I’ve also added this answer to the Stackoverflow question.)

# Calculate HDIs
hdi = compCat %>% 
  group_by(Species, SepalLength_cat) %>% 
  summarise(as_tibble(hdi(.epred))) %>%
  rename(hdi.lwr=V1, hdi.upr=V2)

# Calculate densities
adj = 0.5
dens.hist = compCat %>% 
  group_by(Species, SepalLength_cat) %>% 
  group_split() %>% 
  map_df(~{
    dens = density_bounded(.x$.epred, adjust=adj)
    tibble(Species=.x$Species[1], SepalLength_cat=.x$SepalLength_cat[1],
           x=dens$x, .epred=dens$y)
  }) %>% 
  left_join(hdi)

# Calculate ROPE density
rope.rng = c(-0.25, 0.25)

rope.dens = dens.hist %>% 
  filter(x >= min(rope.rng), x <= max(rope.rng)) %>% 
  group_by(Species, SepalLength_cat) %>% 
  summarise(rope.dens = sum(mean(diff(x)) * .epred))
rope.color = hcl(0, 80, 30, alpha=c(1, 0.7))
hdi.color = hcl(240, 100, 30, alpha=c(1,0.5))
lwd = 0.3
theme_set(theme_minimal(base_size=14))

dens.hist %>% 
  ggplot(aes(x, .epred)) +
    geom_hline(yintercept=0, linewidth=0.1, colour="black") +
    # Overall densities
    geom_line(linewidth=lwd) +
    # Shade ROPE
    geom_area(data=. %>% 
                filter(x >= min(rope.rng), x <= max(rope.rng)),
              aes(fill="ROPE")) +
    # Shade HDI
    geom_area(data=. %>% 
                filter(x >= hdi.lwr, x <= hdi.upr),
              aes(fill="HDI")) +
    # HDI segment
    geom_segment(data=hdi, 
                 aes(x=hdi.lwr, xend=hdi.upr, y=-0.1, yend=-0.1,
                     colour="HDI")) +
    # ROPE segment
    geom_line(data=. %>% 
                group_by(Species, SepalLength_cat) %>% 
                reframe(x=rope.rng, .epred=-0.2),
              aes(colour="ROPE")) +
    # ROPE density value
    geom_text(data=rope.dens, 
              size=3.5, vjust=0.5,
              aes(label=paste0(round(rope.dens*100, 1),"%"), 
                  x = mean(rope.rng), 
                  y = -0.4,
                  colour="ROPE")) +
    facet_grid(cols=vars(Species), row=vars(fct_rev(SepalLength_cat)), switch="y") +
    theme(
      strip.text.y.left=element_text(angle=0, vjust=0.02),
      axis.text.y=element_blank(),
      axis.ticks.y=element_blank()
    ) +
    scale_y_continuous(expand=c(0.03, 0.03)) +
    scale_fill_manual(values=c(ROPE=rope.color[2], HDI=hdi.color[2])) +
    scale_colour_manual(values=c(ROPE=rope.color[1], HDI=hdi.color[1])) +
    guides(colour="none") +
    labs(y="Sepal Length comparison", x="Sepal Width", fill=NULL)

2 Likes

Thanks you so much @joels. I’ve awarded you the bounty over at SO. And thoroughly deserved. Thank you so much.

Just FYI you called an object called lwd that was undefined. I defined it as lwd <- 0.5 and it worked great. A lot of work clearly went into this. I very much appreciate that.

I’m glad it was helpful. Thanks for spotting the missing lwd. I forgot to include it when I copied the code into my answer. I’ve updated my answer to include it. I was originally going to use that linewidth for all of the line and segment geoms, but ended up using it in only one of them. If you need to set a custom linewidth multiple times, it’s easier to define it once, so you only have to update one line of code if you decide to change it later.

These are the Stan forums, so questions related to Stan are in scope. That’s a rather fuzzy definition. We’re not so adamant about this that we erase useful answers that are not related to Stan.

I don’t know who is officially in charge of our forums, so I’m pinging @SGB to see if they could create a new topic category we can use to tag questions that are not Stan related.

1 Like

This question involved summarizing and visualizing posterior draws from a model fit with Stan (via brms), so it seems like it ought to be in scope.

Sorry about that—my bad for being sloppy with language. Your post is definitely Stan-related. I was thinking about a topic for Stan-adjacent uses of packages that are not Stan packages, like ggdist (from Matthew Kay and Brenton Wiernik) and tidybayes (from Matthew Kay). We get a lot of questions about them, but we (the Stan project) are not the maintainers of these packages and might not be the best people to answer questions about them.

I don’t know if tidybayes or ggidst have a way to ask questions other than issues on GitHub. One option is to ping Matthew (@mjskay) here—it’s really hard to find people’s handles even if you know their names.

@joels could you explain , in the following code above

rope.dens = dens.hist %>% 
  filter(x >= min(rope.rng), x <= max(rope.rng)) %>% 
  group_by(Species, SepalLength_cat) %>% 
  summarise(rope.dens = sum(mean(diff(x)) * .epred)

What does the sum(mean(diff(x)) * .epred do?

That was just a quick-and-dirty numerical integration to get the area under the density curve subtended by the ROPE range. It’s equivalent to:

\int_{xmin}^{xmax} density(x)dx \approx \sum\limits_{xmin}^{xmax}density(x)\Delta x

Where xmin and xmax are the ROPE endpoints, density(x) is the posterior density of .epred, and \Delta x is mean(diff(x)), where I took the mean because I figured there might be small differences between each \Delta x interval due to numerical imprecision. But I probably should have done:

rope.dens = sum(c(NA, diff(x)) * .epred, na.rm=TRUE)

Or maybe I should have used the y-value at the midpoint between each x value. And of course there are much more accurate algorithms for more exacting situations.

Just as a check, we expect the total area under the density curve to be equal to 1.0 for the whole distribution:

options(digits=10)

dens.hist %>% 
  group_by(Species, SepalLength_cat) %>% 
  # Get midpoint density values
  mutate(dx = c(NA, diff(x)), 
         .epred.mid = zoo::rollmean(.epred, k=2, na.pad=TRUE, align="right")) %>% 
  summarise(total.density = sum(c(NA, diff(x)) * .epred, na.rm=TRUE),
            total.density2 = sum(mean(diff(x)) * .epred),
            total.density.midpt = sum(.epred.mid * dx, na.rm=TRUE)) %>% 
  as.data.frame()
     Species SepalLength_cat total.density total.density2 total.density.midpt
1     setosa     cat2 - cat1  0.9999992093   0.9999994093        0.9999992198
2     setosa     cat3 - cat1  0.9999993745   0.9999995802        0.9999993834
3     setosa     cat3 - cat2  0.9999988950   0.9999990747        0.9999988836
4 versicolor     cat2 - cat1  0.9999997022   0.9999998813        0.9999996892
5 versicolor     cat3 - cat1  0.9999994202   0.9999996072        0.9999994200
6 versicolor     cat3 - cat2  0.9999987074   0.9999989102        0.9999987074
7  virginica     cat2 - cat1  0.9999987483   0.9999989148        0.9999986161
8  virginica     cat3 - cat1  0.9999984518   0.9999989102        0.9999985982
9  virginica     cat3 - cat2  0.9999990026   0.9999992023        0.9999989849

When we limit the x-range to be within the ROPE, we get the fraction of the total density that’s within the ROPE:

dens.hist %>% 
  mutate(dx = c(NA, diff(x)), 
         .epred.mid = zoo::rollmean(.epred, k=2, na.pad=TRUE, align="right")) %>% 
  filter(x >= min(rope.rng), x <= max(rope.rng)) %>% 
  group_by(Species, SepalLength_cat) %>% 
  summarise(rope.density = sum(mean(diff(x)) * .epred),
            rope.density2 = sum(c(NA, diff(x)) * .epred, na.rm=TRUE),
            rope.density.midpt = sum(.epred.mid * dx, na.rm=TRUE)) 
  Species    SepalLength_cat rope.density rope.density2 rope.density.midpt
1 setosa     cat2 - cat1      0.000000512   0.000000312        0.000000256
2 setosa     cat3 - cat1      0.585         0.585              0.582      
3 setosa     cat3 - cat2      0.208         0.204              0.211      
4 versicolor cat2 - cat1      0.347         0.346              0.344      
5 versicolor cat3 - cat1      0.216         0.216              0.213      
6 versicolor cat3 - cat2      0.884         0.884              0.882      
7 virginica  cat2 - cat1      0.358         0.357              0.356      
8 virginica  cat3 - cat1      0.143         0.143              0.141      
9 virginica  cat3 - cat2      0.584         0.584              0.581 

You can see there are some small differences, so maybe the quickest fix is to just increase the number of points at which the density is calculated:

dens.hist2 = compCat %>% 
  group_by(Species, SepalLength_cat) %>% 
  group_split() %>% 
  map_df(~{
    # Use 4,096 density points instead of 512
    dens = density_unbounded(.x$.epred, adjust=adj, n=2^12)
    tibble(Species=.x$Species[1], SepalLength_cat=.x$SepalLength_cat[1],
           x=dens$x, .epred=dens$y)
  })
dens.hist2 %>% 
  mutate(dx = c(NA, diff(x)), 
         .epred.mid = zoo::rollmean(.epred, k=2, na.pad=TRUE, align="right")) %>% 
  filter(x >= min(rope.rng), x <= max(rope.rng)) %>% 
  group_by(Species, SepalLength_cat) %>% 
  summarise(rope.density = sum(mean(diff(x)) * .epred),
            rope.density2 = sum(c(NA, diff(x)) * .epred, na.rm=TRUE),
            rope.density.midpt = sum(.epred.mid * dx, na.rm=TRUE)) 
  Species    SepalLength_cat rope.density rope.density2 rope.density.midpt
1 setosa     cat2 - cat1      0.000000493   0.000000469        0.000000456
2 setosa     cat3 - cat1      0.587         0.587              0.587      
3 setosa     cat3 - cat2      0.207         0.207              0.207      
4 versicolor cat2 - cat1      0.348         0.348              0.348      
5 versicolor cat3 - cat1      0.214         0.214              0.214      
6 versicolor cat3 - cat2      0.883         0.883              0.883      
7 virginica  cat2 - cat1      0.360         0.360              0.360      
8 virginica  cat3 - cat1      0.143         0.143              0.143      
9 virginica  cat3 - cat2      0.582         0.582              0.581