I don’t know if this is the right forum for this question but I haven’t had much luck at stack overflow and I know the people on this site know their way around a distribution.
I am trying to create a graph plotting distributions of estimated difference between several groups using the ggdist()
package in R. The help has been useful but it does not cover anywhere near the breadth of functionality possible with the package. Here is an example analysis using the iris dataset in R and brms, examining the association between Species (three-level categorical) and Sepal Length (also three-level categorical, created for this analysis from a continuous variable) on Sepal Width, a numeric variable.
library(tidyverse)
library(brms)
library(ggdist)
library(tidybayes)
library(marginaleffects)
library(modelr)
# create dataset
iris %>%
mutate(SepalLength_cat = factor(case_when(Sepal.Length >= 4.3 & Sepal.Length < 5 ~ "cat1",
Sepal.Length >= 5.1 & Sepal.Length < 6.4 ~ "cat2",
TRUE ~ "cat3"))) -> irisAlt
# run model
mod <- brm(formula = Sepal.Width ~ Species*SepalLength_cat,
data = irisAlt)
# sample from the posterior distribution at each level of the two three-level variables (so nine cells)
data_grid(data = irisAlt,
Species = c("setosa", "versicolor", "virginica"),
SepalLength_cat = c("cat1", "cat2", "cat3")) %>%
add_epred_draws(mod,
re_formula = NA,
seed = 1234) -> modPred
Now using the compare_levels()
function from tidybayes
we compare estimated difference in Sepal Width between levels of Sepal Length at each level of Species.
# compare levels
modPred %>%
compare_levels(variable = .epred,
by = SepalLength_cat) %>%
mutate(SepalLength_cat = factor(x = SepalLength_cat,
levels = c("cat2 - cat1",
"cat3 - cat1",
"cat3 - cat2")),
Species = factor(Species)) -> compCat
What I want to do is first plot the distribution, shading the 95% HDI, adding a point interval below also showing the HDI and adding spikes. What I also want to include is
(a) an interval with a ROPE between -0.25 and 0.25
(b) shading the distribution falling within that ROPE a different colour.
(c) Adding labels, in a convenient place, indicating the proportion of posterior density falling within that ROPE.
Here is my attempt
compCat %>%
ggplot(mapping = aes(x = .epred,
y = SepalLength_cat)) +
stat_histinterval(mapping = aes(fill = after_stat(level)),
point_interval = median_hdi,
.width = c(.95)) +
stat_spike(mapping = aes(linetype = after_stat(at)),
at = c("median",
"hdi")) +
stat_spike(mapping = aes(linetype = after_stat(at)),
at = c(-0.25, 0.25),
colour = "red") +
scale_fill_manual(values = "skyblue",
na.value = "gray75") +
scale_thickness_shared() + # makes sure spike same size as distribution
facet_grid(.~Species) +
labs(y = "Sepal Length comparison",
x = "Sepal Width (in mm)") +
theme_minimal() +
theme(legend.position = "none")
Which should come out looking like this
I used spikes for the ROPE instead of an interval, but I would really like to use an interval and especially to shade the area of the distribution that falls between -0.25 and 0.25, and annotate with labels indicating proportion of density within the ROPE. One of the problems is that stat_interval()
and its related functions don’t seem to accept a fixed cartesian interval, instead calculating HDI, which moves around depending on the distribution of estimates.
I don’t know if ggdist
contains the necessary functionality to do this, but if it does, or if anyone can suggest another way of achieving my aims, I would appreciate it.