I am struggling with a posterior predictive check in brms that looks weird (no matter what I try). But first about my data: I conducted an experiment investigating jumping to conclusions bias. This means collecting few information before deciding. In different trials individuals could draw between 1 and 10 information. After each draw they could decide if they wanted another draw (information) or decide for option A or B. Less draws before deciding means higher jumping to conclusions biasā¦
Every individual did 24 trials in different scenarios. Furthermore, there were 3 groups.
The distribution of draws to decision (dependent variable) is bimodal with one peak in the middle and a lot of 10Ā“s at the right end.
First I tried a gaussian family. Classic approach. That yielded the following posterior predictive check:
student and skew_normal looked pretty much the same.
I know I have count data, so I tried poisson family predictive tons of value above 10 (even values of 15), so the gaussian fit was better.
Well, my count data is bounded at 10 - so I tried poisson with truncation between 1 and 10 (and 0 and 9 after subtracting 1 from my dependent variable). That looked pretty good concerning the dens_overlay predictive check, but the stat_2d check was not that good. As I am interested in the means I feel not safe interpreting this results (or what do you think?)! Furthermore there were some divergent iterations and I receive a warning message: ā11% of all predicted values were invalid. Increasing argument āntrysā may help.ā
I tried some other models like binomial family and some hurdle models (after transforming my data with abs(dependent_variable -10) so that I had many 0Ā“s instead of 10Ā“s). The dens_overlay check was between horrible and okay, but the stat_2d check was never promisingā¦
I tried a gaussian mixture, but there were thousand of divergent transition and the model took some hours, so that wasnāt very satisfying as wellā¦
What else can I try? I am grateful for every suggestion! Thanks a lotā¦ :-)
Another question would be: If I will not find a solution to improve my posterior predictive check, can I still interpret the model? At least it is not worse than classic frequentist testing with lme4, right?
thank you for the suggestion. I donāt think there is another predictor accounting for the bimodality.
I have tried a negbinomial (with dependent variable -1, so I have some zeros. Naturally it goes from 1 to 10ā¦):
Same with zero_inflated negbinomial:
I repeated this with abs(dependent_varable-10), so 1 becomes 9 and 10 becomes 0. I hoped his would improve results, as I have far more 10Ā“s than 1`s. Well, not bad, but not much better than the gaussian and much harder to interpret!
I also tried a zero inflated poisson on abs(dependent_varable-10)ā¦ same thing: Not satisfying and very hard to interpretā¦
What do you think? Any other distribution that makes sense? And in doubt, can I interpret my gaussian model?
Great that you are posting those detailed ploty. If your data are integers and bounded on both sides, then I think ordinal models would be a good choice. If I understand your experiment well, the continuation ratio (cratio in brms) family might be particularly suitable. Also check out the ordinal model tutorial by Paul Burkner for more details (currently on phone, so donāt have the link ready, sorry)
here is my groped bars pp_check for the truncated poisson model (which is the best but still unsatisfying and harder to interpret alternative to the gaussian so far):
Its not that bad. The shape of the distribution is captured by the truncated poisson quite good, but the mean value and standard deviation is better reproduced by the gaussian model (which fails with respect to the dens_overlay checkā¦). I am not 100% happy with both so farā¦
With regard to the faceting I sent you am PM. And thank you @martinmodrak , I will try the cratio as soon as I can!
Ok I just naively changed the brmsfamily argument to ācratioā. The posterior predictive checks really look too good to be true!
I have to admit that I have no clue how to interpret the results. I have to read the vignette you mentioned and figure out if I can use the cratio model to answer my hypotheses as good as with the gaussian model but if I can, it is very promising!
Apart from that: Do you think with the given pp_checks is my gaussian standard model a valid choice at all? On the one hand the dens_overlay is really not fine. On the other hand its basically not worse than a typical lme4 frequentist model 99% of the scientists would choose. I am new to bayesian statistics and have no idea how serious this is!
I canāt find a description of the cratio family there. Generally, I do not find information about how the cratio family works and how to interprete the results. Maybe you do have a source or a link for me? Thank you very much for your support!
Yes, thatās the one. It seems it is called āSequential modelsā there (I forgot about the naming conflict)
Tough question. Iāll go with Box: āSince all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.ā In other words, depends on you problem and the inferences you need to make. Since it appears the Gaussian matches your mean and standard deviation OK, then if it also does this well in subgroups you care about and you only make inferences about mean, it might not really matter that much.
With that said I personally think the cratio is overall a safer choice and IMHO very closely matches your experiment and is thus highly interpretable (but make sure it matches, I might have misunderstood things).
Thatās because the ordinal families are very flexible and generally have N-1 parameters for N outcomes - so it is not suprpring they fit the overall distribution. The fits in subgroups are of more interest.
Good question! First, there is a technical problem for this specific case: loo canāt directly compare fits with continuous families and discrete families. (see the relevant entry in Akiās Cross-validation FAQ )
Second, there is the general consideration that loo while useful should (in my personal opinion) be of second-order interest and the most important input should be formed by the question being asked. loo will try to estimate which model is more likely to be better at predicting out-of-sample data. For many questions, we may not want to optimize prediction capacity, but rather understanding, interpretability or ācorrectnessā in some sense. For example, in causal inference, a model that does not adjust for a confounder or that conditions on a collider (or both) might easily be preferred by loo despite not providing a good answer to our question. Similarly when choosing a response distribution, it might sometimes be sensible to choose say a more interpretable option over the one favored by loo.
Thank you, I like your point of view! Tomorrow I will read the vignette you recommendedā¦ I am excited if it solves my problem!
One question that comes to my mind: If I simply have more parameters for the cratio model and as a result my pp_check always looks fantastic - am I not overfitting my model? However, maybe this will become clearer after reading the text tomorrowā¦ I will come back here after that!
For today, big thanks for your support @JLC and @martinmodrak. Your answers helped me a lot!
To avoid overfitting, do cross-validation. You can use the kfold function. But depending on the size of your data, you may want to generate the folds yourself (e.g. 10% from each subject for each fold).
Hi Simon @Simon_Wilms, Can you share your codes to make the graph for pp_check() with stat = āmeanā that can show values on y-axis! I used the default function and cannot show y-axis values and scales! Thanks
The default theme used within the pp_check() function masks the y-axis for some reason. just add + theme_classic() for example, and you have the y-axis again.