How to summarise graphical Posterior predictive checks in one graph?

Background:
I am working with the dataset that consists of 10,000 instances or rows. I am currently using bayesplot package for graphical posterior predictive checking of location variable which consists of 5 location names dummy encoded as 1,2,3,4,5. I have generated the graphs using ppc_dens_overlay and ppc_hist shown below.

Question:
Is there a way where I can summarize the PPC graphs into one graph i.e. one plot with Means + error bars ?

(I don’t want to use density overlay to summarize the ppc for a categorical variable. To me using histogram is more intuitive for a categorical variable moreover, if I am running the ppc_hist it generates so many histograms. I know I can slice to just plot small number of graphs (shown in image attached) but having it summarised in one graph is I think would be better for my understanding)

I would really appreciate if anyone can please point me to resources to refer for solving this problem.

1 Like

@jonah do you know how to do this in Bayesplot?

1 Like

I might be misunderstanding, but does ppc_bars() do what you want? There are some examples in the Examples section here:

3 Likes

It looks like you’re treating your categories 1:5 as a histogram (which assumes an underlying continuous variable), not as a bar plot (which treats them as discrete).

One possible solution that comes to mind would be to make a ggplot that combined geom_bar() or geom_col() for your real data and geom_dotplot(), geom_violin() or something similar for the replicates.
For example:

library(tidyverse)
library(ggforce) # for geom_sina(), which creates the nice jitter plots
set.seed(143)
# define the real data
real_data = tibble(
  category = as.character(1:5),
  N = c(20, 17, 12, 31, 62)) %>% 
  mutate(y_obs = N/sum(N))

  total_N = sum(real_data$N)

# Simulate some y-reps that would come from a PP check
sim = rmultinom(100, total_N, real_data$y_obs)
rownames(sim) = c(1:5)
sim_data = as_tibble(sim, rownames = "category") %>% 
  gather(-category, key = "rep", value = "N") %>% 
  mutate(y_rep = N / total_N)

# Join data
full_data = left_join(real_data %>% select(-N), sim_data %>% select(-N),
                    by = c("category"))
# plot
ggplot(full_data, aes(x = category)) + 
  geom_col(aes(y = y_obs), position = "identity", 
           color = "black", fill = "white") +
  geom_sina(aes(y = y_rep),  alpha = .4, seed = 120) +
  theme_bw() + theme(panel.grid = element_blank())


5 Likes

thanks so much for sharing the code snippet. really appreciate the help.

thanks for outlining that making histogram means assumes an underlying continuous variable. I have location as a categorical variable so I think using bar plots and treating them as discrete would be more suitable