Attaching labels estimating proportion of posterior density within a fixed interval in the ggdist package

I don’t know if this is the right forum for this question but I haven’t had much luck at stack overflow and I know the people on this site know their way around a distribution.

I am trying to create a graph plotting distributions of estimated difference between several groups using the ggdist() package in R. The help has been useful but it does not cover anywhere near the breadth of functionality possible with the package. Here is an example analysis using the iris dataset in R and brms, examining the association between Species (three-level categorical) and Sepal Length (also three-level categorical, created for this analysis from a continuous variable) on Sepal Width, a numeric variable.

library(tidyverse)
library(brms)
library(ggdist)
library(tidybayes)
library(marginaleffects)
library(modelr)

#  create dataset
iris %>%
  mutate(SepalLength_cat = factor(case_when(Sepal.Length >= 4.3 & Sepal.Length < 5 ~ "cat1",
                                            Sepal.Length >= 5.1 & Sepal.Length < 6.4 ~ "cat2",
                                            TRUE ~ "cat3"))) -> irisAlt

# run model
mod <- brm(formula = Sepal.Width ~ Species*SepalLength_cat,
           data = irisAlt)

# sample from the posterior distribution at each level of the two three-level variables (so nine cells)
data_grid(data = irisAlt,
          Species = c("setosa", "versicolor", "virginica"),
          SepalLength_cat = c("cat1", "cat2", "cat3")) %>%
  add_epred_draws(mod,
                  re_formula = NA,
                  seed = 1234) -> modPred

Now using the compare_levels() function from tidybayes we compare estimated difference in Sepal Width between levels of Sepal Length at each level of Species.

# compare levels
modPred %>%
  compare_levels(variable = .epred,
                 by = SepalLength_cat) %>%
    mutate(SepalLength_cat = factor(x = SepalLength_cat,
                                    levels = c("cat2 - cat1",
                                               "cat3 - cat1",
                                               "cat3 - cat2")),
           Species = factor(Species)) -> compCat

What I want to do is first plot the distribution, shading the 95% HDI, adding a point interval below also showing the HDI and adding spikes. What I also want to include is

(a) an interval with a ROPE between -0.25 and 0.25

(b) shading the distribution falling within that ROPE a different colour.

(c) Adding labels, in a convenient place, indicating the proportion of posterior density falling within that ROPE.

Here is my attempt

compCat %>%
  ggplot(mapping = aes(x = .epred,
                       y = SepalLength_cat)) +
         stat_histinterval(mapping = aes(fill = after_stat(level)),
                           point_interval = median_hdi,
                           .width = c(.95)) +
         stat_spike(mapping = aes(linetype = after_stat(at)), 
                    at = c("median",
                           "hdi")) +
         stat_spike(mapping = aes(linetype = after_stat(at)), 
                    at = c(-0.25, 0.25),
                    colour = "red") +
         scale_fill_manual(values = "skyblue",
                           na.value = "gray75") +
         scale_thickness_shared() + # makes sure spike same size as distribution
         facet_grid(.~Species) + 
         labs(y = "Sepal Length comparison",
              x = "Sepal Width (in mm)") +
         theme_minimal() +
         theme(legend.position = "none")

Which should come out looking like this

I used spikes for the ROPE instead of an interval, but I would really like to use an interval and especially to shade the area of the distribution that falls between -0.25 and 0.25, and annotate with labels indicating proportion of density within the ROPE. One of the problems is that stat_interval() and its related functions don’t seem to accept a fixed cartesian interval, instead calculating HDI, which moves around depending on the distribution of estimates.

I don’t know if ggdist contains the necessary functionality to do this, but if it does, or if anyone can suggest another way of achieving my aims, I would appreciate it.

I’m not sure how to do this within ggdist (or if it can be done), but @mjskay probably knows. For now, here’s a way to construct the plot by directly calculating the desired density values from the posterior draws and then creating a custom ggplot.

The overlapping shading of the ROPE and HDI might be confusing, but hopefully you can play around with the code to get something that meets your needs. (I’ve also added this answer to the Stackoverflow question.)

# Calculate HDIs
hdi = compCat %>% 
  group_by(Species, SepalLength_cat) %>% 
  summarise(as_tibble(hdi(.epred))) %>%
  rename(hdi.lwr=V1, hdi.upr=V2)

# Calculate densities
adj = 0.5
dens.hist = compCat %>% 
  group_by(Species, SepalLength_cat) %>% 
  group_split() %>% 
  map_df(~{
    dens = density_bounded(.x$.epred, adjust=adj)
    tibble(Species=.x$Species[1], SepalLength_cat=.x$SepalLength_cat[1],
           x=dens$x, .epred=dens$y)
  }) %>% 
  left_join(hdi)

# Calculate ROPE density
rope.rng = c(-0.25, 0.25)

rope.dens = dens.hist %>% 
  filter(x >= min(rope.rng), x <= max(rope.rng)) %>% 
  group_by(Species, SepalLength_cat) %>% 
  summarise(rope.dens = sum(mean(diff(x)) * .epred))
rope.color = hcl(0, 80, 30, alpha=c(1, 0.7))
hdi.color = hcl(240, 100, 30, alpha=c(1,0.5))
lwd = 0.3
theme_set(theme_minimal(base_size=14))

dens.hist %>% 
  ggplot(aes(x, .epred)) +
    geom_hline(yintercept=0, linewidth=0.1, colour="black") +
    # Overall densities
    geom_line(linewidth=lwd) +
    # Shade ROPE
    geom_area(data=. %>% 
                filter(x >= min(rope.rng), x <= max(rope.rng)),
              aes(fill="ROPE")) +
    # Shade HDI
    geom_area(data=. %>% 
                filter(x >= hdi.lwr, x <= hdi.upr),
              aes(fill="HDI")) +
    # HDI segment
    geom_segment(data=hdi, 
                 aes(x=hdi.lwr, xend=hdi.upr, y=-0.1, yend=-0.1,
                     colour="HDI")) +
    # ROPE segment
    geom_line(data=. %>% 
                group_by(Species, SepalLength_cat) %>% 
                reframe(x=rope.rng, .epred=-0.2),
              aes(colour="ROPE")) +
    # ROPE density value
    geom_text(data=rope.dens, 
              size=3.5, vjust=0.5,
              aes(label=paste0(round(rope.dens*100, 1),"%"), 
                  x = mean(rope.rng), 
                  y = -0.4,
                  colour="ROPE")) +
    facet_grid(cols=vars(Species), row=vars(fct_rev(SepalLength_cat)), switch="y") +
    theme(
      strip.text.y.left=element_text(angle=0, vjust=0.02),
      axis.text.y=element_blank(),
      axis.ticks.y=element_blank()
    ) +
    scale_y_continuous(expand=c(0.03, 0.03)) +
    scale_fill_manual(values=c(ROPE=rope.color[2], HDI=hdi.color[2])) +
    scale_colour_manual(values=c(ROPE=rope.color[1], HDI=hdi.color[1])) +
    guides(colour="none") +
    labs(y="Sepal Length comparison", x="Sepal Width", fill=NULL)

2 Likes

Thanks you so much @joels. I’ve awarded you the bounty over at SO. And thoroughly deserved. Thank you so much.

Just FYI you called an object called lwd that was undefined. I defined it as lwd <- 0.5 and it worked great. A lot of work clearly went into this. I very much appreciate that.

I’m glad it was helpful. Thanks for spotting the missing lwd. I forgot to include it when I copied the code into my answer. I’ve updated my answer to include it. I was originally going to use that linewidth for all of the line and segment geoms, but ended up using it in only one of them. If you need to set a custom linewidth multiple times, it’s easier to define it once, so you only have to update one line of code if you decide to change it later.

These are the Stan forums, so questions related to Stan are in scope. That’s a rather fuzzy definition. We’re not so adamant about this that we erase useful answers that are not related to Stan.

I don’t know who is officially in charge of our forums, so I’m pinging @SGB to see if they could create a new topic category we can use to tag questions that are not Stan related.

1 Like

This question involved summarizing and visualizing posterior draws from a model fit with Stan (via brms), so it seems like it ought to be in scope.