Pp_check looking wild...how bad are they?

Hi all,
I ran some models with brms and my pp_checks look wild…Just wonder how bad are they…


I think it is because the distribution is bimodal and I run the code by default using gaussian. However, I’m facing a tricky situation (probably because I am really new to brms /dont know what I am doing):
So I have a 2 (B: present vs absent) 2 (C: high vs low) between subject design and I care more about group comparsions for the B
The code below is the model that come up with the first pp check…
F1.q23 ← brm(Q23 ~ B
C, data = F1, iter = 6000, sample_prior = ‘yes’,
prior = c(prior(normal(0, 6), class = “b”)))
f1.q23 ← emmeans(F1.q23, ~ B*C)
cont ← contrast(f1.q23, “tukey”, reverse = TRUE)
cont_p_f1 ← gather_emmeans_draws(cont)

I tried to set the family as mixture but emmeans wont run…
So will the wonky pp check of the model Q23 ~ B*C be a serious problem if I’m mainly interested in the contrasts?

there are some things which could explain posteriors like you see here. One is that it could be a mixture of different distributions (e.g. normals).
How they come about is the actual question and you hinted at one possible explanation: Subject variance. It is a bit difficult to tell without seeing the input data and what the plots directly refer to, but how it looks to me is that there is something clearly not captured by the model which leads to different values (inside a group?) with an error around. The error you estimate though does not capture this grouping and tries to account for all the variance (global). So random effects should probably be included in your model.

Another explanation could be that you don’t exactly work with continuous data and it is artificially chopped into steps.

In summary:

Not “great” but you can identify the issues step by step :)
You are on the right track with the pp_check. You probably should also check the prior prediction (outcome of the model without the data used).

1 Like

I saw that in the data set there is no subject id. Do you have this information somewhere? This might be crucial because baseline responses in evaluation might be subject-dependent. If you have this then you can use it as random effect for at least the intercept. A mixed gaussian as modeled currently probably does not reflect how the data was generated, right?
Also the response variable is probably rounded which explains some of the shape. One can account for that in Stan but not in brms directly (if I’m not mistaken).
But again subject id might make your life a lot easier.

Thanks again! The sample was not nested or anything so the id is just simple 1,2,3,…So I tried random effect but it gave me an error

FB1$id <-1:nrow(FB1)
FB1.q71_1 <-brm(Q71_1 ~ B*C + (1|id), data = FB1, iter = 6000)
Warning messages:
1: Rows containing NAs were excluded from the model.
2: There were 648 divergent transitions after warmup. See
Runtime warnings and convergence problems
to find out why this is a problem and how to eliminate them.
3: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
Runtime warnings and convergence problems
4: Examine the pairs() plot to diagnose sampling problems
5: The largest R-hat is 1.08, indicating chains have not mixed.
Running the chains for more iterations may help. See
Runtime warnings and convergence problems
6: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
Runtime warnings and convergence problems
7: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
Runtime warnings and convergence problems

I think the posterior predictive check may be misleading and it would be better to use one of the “grouped” variants: “dens_overlay_grouped” instead of “dens_overlay” (and not a mixture for the response).

Both predictors B and C are binary. Here is are histograms of the outcome Q71_1 for each of the four combinations of B and C. There is a clear difference between C = 0 and C = 1 (B doesn’t make a strikingly big difference) and the peaks at about 65 when C = 0 and 85 when C = 1 clearly corresponds to the two peaks in the PPC plot.


Thank you so much!! I didn’t know pp check has this variant and I believe that my problem has been solved. Thanks everyone, hooray!