I’m trying to understand what is plotted in the violin based on yrep.
Say that my data looks like this:
data <- data.frame(n= 1:50, group= rep(1:10,5), x=rnorm(50), y= rnorm(50))
I fit a linear model, and I get my y_rep. When I use ppc_dens_overlay I get one density plot for each simulated dataset, so if my data has 50 observation and I have 20 draws from the posterior predictive distribution. I will get 20 light blue density plots made by 50 observations each, right? This makes total sense to me.
I want to have different plots for each group, I was expecting to find a ppc_dens_overlay_grouped where each group appears in one facet, and so on each facet I will still have 20 light blue density plots but based on 5 observations each. As you know, there is no function like that and the closest thing is ppc_violin_grouped.
ppc_violin_grouped(y, yrep, group= data$group)
What I don’t understand is how come ppc_violin_grouped gives me one “predicted” violin for each group without choosing any stats. Is it just making each violin out of the 20 (draws) * 5 (observations for group) y_rep ignoring from which draw each y_rep is coming? If so, does it make sense? Shouldn’t I get 20 violin overlayed for each group?
I hope my problem is clear, if not I’ll be happy to expand and give more examples.
EDIT: Removed possible snark.
Yes, I agree ppc_dens_overlay_grouped would sometimes be great. So just wanted to encourage everybody reading this to consider trying to make it work. But I digress:
The violin plot is usually used as a replacement for boxplot and if you look at it this way, it IMHO makes perfect sense - why you would overlay multiple boxplots? You generally would rather make a single boxplot with all the data. In other words ppc_violin_grouped lets you determine whether the full posterior predictive distribution fits the data, while ppc_dens_overlay tells you whether the distribution of outcomes for each draw separately jump around the distribution of data. Different purposes, different plots. Does that make sense?
Sure, I can implement pp_dens_overlay_grouped. My usual complain is that there is no documentation regarding how to contribute , some CONTRIBUTE.md that says what unit testing are required, which branch is the relevant, whether I can just do a pull request, or I need to open an issue, etc. See for example https://mne.tools/dev/install/contributing.html. Contributing to a project is a lot of work, and I don’t want to do it wrong.
Regarding the ppc_violin_grouped (and also ppc_violin), I’m still unsure if understand it.
The posterior predictive distribution is p(yrep_1, yrep_2,..., yrep_N| Y), where N is the number of observations. So if I understand it right, I should have an entire distribution for each observation. In the overlay, we take a number of samples, S, from each distribution of observations, and we plot the S possible distribution of observations. Then I can check if my observed distribution is somehow like the S possible distributions.
But in the ppc_violin, we take a number of samples,S, from each distribution (yrep_1, yrep_2,…) and the we do a violin plot with all these samples together. Is that right? So how can I check if my observed distribution fits this plot? My observed data will always look to have more extreme values, because the violin plot of the observed data is composed by N observations; the violin plot of the posterior predictive distribution is composed by N \cdot S.
(Am I understand right the way the violin plot is done in the ppc_violin? (Maybe I just need to see a case where the violin plot is actually useful to determine a misfit, do you know some example?)
Sorry, I just realized my initial answer might have come across as snarky, which was not my intention - I literally meant that someone reading it (possibly someone completely else in the far future) might consider doing it :-) You obviously have no obligation to do it, neither did I want to imply that you are unjustifiably complaining.
Yes, it is done in ppc_violing. An (artificially made) example of a misfit detected by violin plot is below:
In group “2” the model misestimates the average of the group and most of the observations are below the mean. In group “4” the mean is just fine, but the variance is larger than what the model can accomodate - there is one huge outlier and two points at the extreme of the tails.
Oh, no, I didn’t take it as snarky. And I’ll be extremely happy to contribute more to the R packages of the Stan-verse if I knew what the guidelines for contributing are. I gave the example of MNE, because I found a bug, and given the clear guidelines, I found it easier to fix it by myself than to report it. As a side-benefit, I’m named as one of the contributors of a version ([Mne_analysis] [MNE_analysis] [ANN] MNE-Python 0.18).
I strongly encourage the Stan-devs to write some contributing guidelines, it might seem like a lot of effort, but I think it will pay off, quite quickly.
Good example. The thing is that if you plot the overlays, you might see that these two points are perfectly fine because a few of the possible distributions of data allow for the existence of these points quite well.
While you might also see that the data is very skewed, while the possible distributions of data aren’t.
But ok, I guess you can deduce all this also from the violin plots. I’ll see if I can come up with an example where the violins are misleading because one aggregates all the samples together (which is my problem).