Background:
I am working with the dataset that consists of 10,000 instances or rows. I am currently using bayesplot package for graphical posterior predictive checking of location variable which consists of 5 location names dummy encoded as 1,2,3,4,5. I have generated the graphs using ppc_dens_overlay and ppc_hist shown below.
Question:
Is there a way where I can summarize the PPC graphs into one graph i.e. one plot with Means + error bars ?
(I don’t want to use density overlay to summarize the ppc for a categorical variable. To me using histogram is more intuitive for a categorical variable moreover, if I am running the ppc_hist it generates so many histograms. I know I can slice to just plot small number of graphs (shown in image attached) but having it summarised in one graph is I think would be better for my understanding)
I would really appreciate if anyone can please point me to resources to refer for solving this problem.
It looks like you’re treating your categories 1:5 as a histogram (which assumes an underlying continuous variable), not as a bar plot (which treats them as discrete).
One possible solution that comes to mind would be to make a ggplot that combined geom_bar() or geom_col() for your real data and geom_dotplot(), geom_violin() or something similar for the replicates.
For example:
library(tidyverse)
library(ggforce) # for geom_sina(), which creates the nice jitter plots
set.seed(143)
# define the real data
real_data = tibble(
category = as.character(1:5),
N = c(20, 17, 12, 31, 62)) %>%
mutate(y_obs = N/sum(N))
total_N = sum(real_data$N)
# Simulate some y-reps that would come from a PP check
sim = rmultinom(100, total_N, real_data$y_obs)
rownames(sim) = c(1:5)
sim_data = as_tibble(sim, rownames = "category") %>%
gather(-category, key = "rep", value = "N") %>%
mutate(y_rep = N / total_N)
# Join data
full_data = left_join(real_data %>% select(-N), sim_data %>% select(-N),
by = c("category"))
# plot
ggplot(full_data, aes(x = category)) +
geom_col(aes(y = y_obs), position = "identity",
color = "black", fill = "white") +
geom_sina(aes(y = y_rep), alpha = .4, seed = 120) +
theme_bw() + theme(panel.grid = element_blank())
thanks so much for sharing the code snippet. really appreciate the help.
thanks for outlining that making histogram means assumes an underlying continuous variable. I have location as a categorical variable so I think using bar plots and treating them as discrete would be more suitable