Discrepancy between y and yrep in pp_check: how bad is it?

Hi all,

I am trying to fit a zero-one-inflated beta model to proportional data. See the code I ran below:

brm(bf(
Bias|weights(Weight) ~ 0 + Intercept + attract_AOI1_cen * attract_AOI2_cen * Gender + (1|Subject),
phi ~ attract_AOI1_cen * attract_AOI2_cen * Gender + (1|Subject),
zoi ~ attract_AOI1_cen * attract_AOI2_cen * Gender + (1|Subject),
coi ~ attract_AOI1_cen * attract_AOI2_cen * Gender + (1|Subject)), data=et_attr,
chains = 4, iter = 5000, warmup = 1000, cores = 2,
save_pars = save_pars(all=T),
sample_prior = T,
family=zero_one_inflated_beta(),
prior = prior(normal(0, 0.25), class=“b”))

When looking at the results of the model, they make a lot of sense based on our expectations and the existing literature. Also, the model runs smoothly, without any issues. However, when looking at the posterior predictive check (using pp_check), I noticed a large discrepancy between the y and yrep, with the yreps looking much “smoother” than the original data (see below).
Rplot

My main question is: how much of an issue is this? I am specifically interested in the predictors that are in the model now. I assume that this discrepancy could indicate that I am missing a specific predictor, but I don’t have any idea which predictor(s). How much of an issue is this with regard to inferences based on the current model?

Thank you in advance!

R version 4.1.3 (2022-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

other attached packages:
[1] brms_2.16.3

1 Like

It’s not great, but I’ve seen a lot worse. If your data are appropriate for the ZOIB (as it appears they are), you are fitting the full model with your predictors of interest (as it appears you are), and there are no other signs of problems (as you have indicated in your question), then I think you’re good to go. To my eye, what the pp_check() plot suggests is there might be other interesting things going on with your data, which might push you in a good direction for the next study.

There’s also the possible issue of sample size. Even synthetic data might look wacky with few cases.

4 Likes

Hi Solomon,

Thank you for your thoughts! That’s really useful. Also, thanks for all your blogs and tutorials. They have helped me a lot.

1 Like

cheers!

An additional thought - I have seen this type of posterior predictive check before for zero-one-inflated beta models when the outcome was a proportion that was converted from some sort of scale score, like a slider scale or something like that. For example, for a slider that goes from 0-100, I have seen where people will plop the slider down on units of 10 or even cruder units of 25. So, something like 0, 25, 50, 75, 100 inflated. So the density of the data will have those peaks at those values, but the beta model will sort of average across those.
Your posterior predictive check reminded me of that, so thought I would mention it as additional information.

I actually have a data set where the DV is a proportion that was converted from a slider scale, which shows clear 50% inflation. Here’s the pp_check() plot from the full distributional model (analogous to @tsroth’s original model, above):

We did okay, but that 50% inflation is an eye sore. If you or anyone else knows how to fit a zero-midway-one-inflated beta model in brms, I’d love to see the code. At the moment, it’s out of my skill set.

2 Likes

Maybe something like this? I cobbled this together this morning…someone should definitely check it before using it, as this is a bit over my head as well. I post it here as an idea. It does OK on the simple simulated data, but seems to overestimate the zero-inflation and middle-inflation.

#create data. 
#proportion of zero-one inflation (zoi) = 200/800 = 0.25 
#proportion of zoi that are ones (coi) = 100/200 = 0.5 
#mid-point inflation (mdi) = 100/800 = 0.125
#mean of the beta distribution is 0.5

library(brms)

#define data
y <- c(rep(0, 100), rbeta(500, 1, 1), rep(0.5, 100), rep(1, 100))
hist(y,breaks=50)
d1 <- cbind(y)
d1 <- data.frame(d1)
str(d1)

#custom brms family
zero_one_mid_inflated_beta <- custom_family(
  "zero_one_mid_inflated_beta", dpars = c("mu", "phi", "zoi", "coi", "mdi"),
  links = c("logit", "identity", "identity", "identity", "identity"),
  lb = c(0, 0, 0, 0, 0), ub = c(1, NA, 1, 1, 1),
  type = "real"
)

#provide Stan functions
stan_funs <- "
     real zero_one_mid_inflated_beta_lpdf(real y, real mu, real phi,
                                    real zoi, real coi, real mdi) {
     row_vector[2] shape = [mu * phi, (1 - mu) * phi]; 
     if (y == 0) { 
       return bernoulli_lpmf(1 | zoi) + bernoulli_lpmf(0 | coi); 
     } else if (y == 1) {
       return bernoulli_lpmf(1 | zoi) + bernoulli_lpmf(1 | coi);
     } else if (y == 0.5) {
       return bernoulli_lpmf(1 | mdi) + bernoulli_lpmf(0 | mdi) + beta_lpdf(y | shape[1], shape[2]);
     } else { 
       return bernoulli_lpmf(0 | zoi) + bernoulli_lpmf(0 | mdi) + beta_lpdf(y | shape[1], shape[2]);
     } 
   }
"

#define stanvars
stanvars <- stanvar(scode = stan_funs, block = "functions")

#run model 
m2 <- brm(y ~ 1, data=d1, family = zero_one_mid_inflated_beta, stanvars = stanvars, cores = 4)
m2

The zoi, coi, and mdi all use identity link, so they are showing the estimated proportion. You can check this against what I programmed into the data, outlined above. Intercept and coi are good, but zoi and mdi are overestimated (the real value in the data is captured by the lower end of the CI), so I might have done something wrong. To use pp_check() you would need to define zero_one_mid_inflated_beta_rng in the stan_funs.

Note - The + bernoulli_lpmf(0 | mdi) + beta_lpdf(y | shape[1], shape[2]) can be taken out of the mid-inflation line, and the model runs and obtains the same results. Originally, I was thinking this part was like the zero-inflated negative binomial case where some zeroes can come from the negative binomial distribution… but since this isn’t integer, the probability of getting exactly 0.5 from beta distribution is actually zero, right? So, maybe it really is like the case of the zeroes and the ones.

Hopefully someone like @paul.buerkner can chime in and clarify.

I think this would be the code to be able to use pp_check for the above model.

posterior_predict_zero_one_mid_inflated_beta <- function(i, prep, ...) {
  zoi <- get_dpar(prep, "zoi", i)
  coi <- get_dpar(prep, "coi", i)
  mdi <- get_dpar(prep, "mdi", i)
  mu <- get_dpar(prep, "mu", i = i)
  phi <- get_dpar(prep, "phi", i = i)
  hu <- runif(prep$ndraws, 0, 1)
  one_or_zero <- runif(prep$ndraws, 0, 1)
  mid <- runif(prep$ndraws, 0, 1)
  ifelse(hu < zoi,
    ifelse(one_or_zero < coi, ifelse(mid < mdi, 0.5, 1), 0),
    rbeta(prep$ndraws, shape1 = mu * phi, shape2 = (1 - mu) * phi)
  )
}

pp_check(m2, ndraws=150)

And here is what you get.

Now compare to the zero-one-inflated beta model.

m2.zoi <- brm(y ~ 1, data=d1, family=zero_one_inflated_beta, cores=4)
pp_check(m2.zoi, ndraws=150)

So the zero-one-mid-inflated beta model looks some better, but certainly not perfect.
What I don’t understand is that in the model output, it appears that the model is overestimating the midpoint inflation, but in the posterior predictive check, it seems underestimated…

Also, the dark blue data line from pp_check doesn’t match the actually data, as the zero, midpoint, and one inflation are all programmed the same.


So, I am not sure why pp_check doesn’t seem to reflect the actual data.

@Solomon maybe you can try a simple intercept only model on your data to see how this works. Hopefully someone with more expertise in custom families in brms can chime in.

It could just be that I’ve done the posterior predict code incorrectly. If anyone wants to look, here is the source code for the zero_one_inflated_beta from brms that I modified for the above.

posterior_predict_zero_one_inflated_beta <- function(i, prep, ...) {
  zoi <- get_dpar(prep, "zoi", i)
  coi <- get_dpar(prep, "coi", i)
  mu <- get_dpar(prep, "mu", i = i)
  phi <- get_dpar(prep, "phi", i = i)
  hu <- runif(prep$ndraws, 0, 1)
  one_or_zero <- runif(prep$ndraws, 0, 1)
  ifelse(hu < zoi,
    ifelse(one_or_zero < coi, 1, 0),
    rbeta(prep$ndraws, shape1 = mu * phi, shape2 = (1 - mu) * phi)
  )
}

Thank you for your efforts, @jd_c! I’ll confirm that, at least from a pp_check() perspective, your zero_one_mid_inflated_beta solution doesn’t seem to fully work on my data set either. Here’s the pp_check() plot from an intercepts-only version of the model on my data.

@Henrik_Singmann, you’re tricky with custom brms families. Do you have any interest in this topic?

Hmmm, yeah that doesn’t seem to work at all… What did the numbers say? What was your actual mid-inflation vs what the model said?

So, I tried it out on some actual data as well. I used data from slider scale responses with more than 33,000 rows. This is many responses for each of 500+ individuals. I just ran simple intercept only models. As you can see, posterior predictive checks are poor for both models (this data is tough! with lots of inflation around specific values), but they are better at the mid-point for the zomi beta model. The zoi beta simply averages right through the mid-inflation.

When I look at the actual numbers, the zero and one inflation are predicted right on for the zoi beta model. They are a tad off but very close for the zomi beta model, but the mid-inflation is over estimated, as the real is 4.4% and the estimate is 6%.

I think I have figured out why the pp_check density plots look a little off even when the inflation is estimated correctly (for example in the zoi beta model above, which looks off in the plots but the model estimates are right on). I think this may have to do with the density smoothing around the inflation.

@Solomon I am surprised that the zomi beta model didn’t work better for your data after looking at my real data above. Are you sure that you have 0.5 inflation? Or is it simply inflated around the middle? There would be a difference. The zomi beta model is only estimating inflation for values exactly at 0.5

Also, if the pp_check plot is really smoothing around the single point inflation like I think it is, it would be better to check the actual model estimates for the inflation, rather than looking at a density plot. Or maybe use a histogram.

Here’s my model code and the summary output:

fit1 <- brm(
  data = dat,
  family = zero_one_mid_inflated_beta,
  y ~ 1,
  cores = 4,
  seed = 1,
  stanvars = stanvars)

print(fit1)
 Family: zero_one_mid_inflated_beta 
  Links: mu = logit; phi = identity; zoi = identity; coi = identity; mdi = identity 
Formula: y ~ 1 
   Data: dat (Number of observations: 623) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -0.69      0.04    -0.77    -0.61 1.00     4556     3124

Family Specific Parameters: 
    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi     4.01      0.24     3.54     4.49 1.00     4775     2856
zoi     0.36      0.02     0.31     0.40 1.00     4709     3136
coi     0.21      0.03     0.15     0.27 1.00     4590     3392
mdi     0.24      0.02     0.21     0.27 1.00     5184     3557

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The empirical proportion of values exactly at 0.5 is 23%.

dat %>% 
  # there are three missing values
  drop_na(y) %>% 
  summarise(pm = mean(y == 0.5))
0.2263242

OK, so then the model did capture the mid-point inflation, as the proportion of values at 0.5 in the data is 0.23 and the model is estimating 0.24 (0.21 - 0.27).
The pp_check plot doesn’t really reflect that, though… so it either has to do with smoothing in the plot, or maybe the bit of code that I wrote for the posterior_predict_zero_one_mid_inflated_beta is wrong in one of those nested ifelse statements.

Also, it’s not just the smoothing in the density curves. For example, here’s what happens if I use type = "hist".

pp_check(fit1, ndraws = 5, type = "hist", binwidth = 0.01)

Ok, haha… seems like the model is ok (based on the output) but the code to get the pp_check might be wrong.
@paul.buerkner or anyone else, what did I do wrong here?

posterior_predict_zero_one_mid_inflated_beta <- function(i, prep, ...) {
  zoi <- get_dpar(prep, "zoi", i)
  coi <- get_dpar(prep, "coi", i)
  mdi <- get_dpar(prep, "mdi", i)
  mu <- get_dpar(prep, "mu", i = i)
  phi <- get_dpar(prep, "phi", i = i)
  hu <- runif(prep$ndraws, 0, 1)
  one_or_zero <- runif(prep$ndraws, 0, 1)
  mid <- runif(prep$ndraws, 0, 1)
  ifelse(hu < zoi,
    ifelse(one_or_zero < coi, ifelse(mid < mdi, 0.5, 1), 0),
    rbeta(prep$ndraws, shape1 = mu * phi, shape2 = (1 - mu) * phi)
  )
}

To summarize: below would be the updated code.

#create data. 
#proportion of zero-one inflation (zoi) = 200/800 = 0.25 
#proportion of zoi that are ones (coi) = 100/200 = 0.5 
#mid-point inflation (mdi) = 100/800 = 0.125
#mean of the beta distribution is 0.5

library(brms)

y <- c(rep(0, 100), rbeta(500, 1, 1), rep(0.5, 100), rep(1, 100))
hist(y,breaks=50)
d1 <- cbind(y)
d1 <- data.frame(d1)
str(d1)


zero_one_mid_inflated_beta <- custom_family(
  "zero_one_mid_inflated_beta", dpars = c("mu", "phi", "zoi", "coi", "mdi"),
  links = c("logit", "identity", "identity", "identity", "identity"),
  lb = c(0, 0, 0, 0, 0), ub = c(1, NA, 1, 1, 1),
  type = "real"
)


stan_funs <- "
     real zero_one_mid_inflated_beta_lpdf(real y, real mu, real phi,
                                    real zoi, real coi, real mdi) {
     row_vector[2] shape = [mu * phi, (1 - mu) * phi]; 
     if (y == 0) { 
       return bernoulli_lpmf(1 | zoi) + bernoulli_lpmf(0 | coi); 
     } else if (y == 1) {
       return bernoulli_lpmf(1 | zoi) + bernoulli_lpmf(1 | coi);
     } else if (y == 0.5) {
       return bernoulli_lpmf(1 | mdi) + bernoulli_lpmf(0 | mdi);
     } else { 
       return bernoulli_lpmf(0 | zoi) + bernoulli_lpmf(0 | mdi) + beta_lpdf(y | shape[1], shape[2]);
     } 
   }
"

stanvars <- stanvar(scode = stan_funs, block = "functions")

m3 <- brm(y ~ 1, data=d1, family = zero_one_mid_inflated_beta, stanvars = stanvars, cores = 4)
m3


posterior_predict_zero_one_mid_inflated_beta <- function(i, prep, ...) {
  zoi <- get_dpar(prep, "zoi", i)
  coi <- get_dpar(prep, "coi", i)
  mdi <- get_dpar(prep, "mdi", i)
  mu <- get_dpar(prep, "mu", i = i)
  phi <- get_dpar(prep, "phi", i = i)
  hu <- runif(prep$ndraws, 0, 1)
  one_or_zero <- runif(prep$ndraws, 0, 1)
  mid <- runif(prep$ndraws, 0, 1)
  ifelse(hu < zoi & mid > mdi,
    ifelse(one_or_zero < coi, 1, 0),
    ifelse(mid < mdi, 0.5, rbeta(prep$ndraws, shape1 = mu * phi, shape2 = (1 - mu) * phi))
  )
}

pp_check(m3)

Model results:

Posterior predictive check:

The model over-estimates the mid-point inflation slightly, which can be seen in both the model results and the posterior predictive check.

I updated the posterior predict code. I think I had an ifelse statement in the wrong place. But it would be great if someone else could check it!

Also, I updated the model code, because as I reasoned here and I think confirmed by reading this, that the + beta_lpdf(y | shape[1], shape[2]) is not needed in the mid-point inflation line, because the probability of getting exactly 0.5 from a continuous distribution is zero.

Hopefully this is a good start if anyone is interested in this model. I think a zomi beta model could be useful for those working with slider scale data.

Hopefully some people can chime in to check and see if this is correct :-) Thanks @Solomon for checking it out on some data.

1 Like

For some reason I got an email notification for this post, although I don’t see any mentions. While I’m here, however, let me take the opportunity to complain about density estimators for posterior retrodictive checks like this. As is evident in the above figures the density estimators over smooth the inflation at zero and one which makes it difficult to figure out exactly what is going on. A histogram summary statistic, see for example Towards A Principled Bayesian Workflow, offers a much more clear picture for inflation models like these.

All of that said note that when inflating continuous variables there’s actually not much benefit to fitting everything jointly. See for example Some Mixture Modeing Basics.

2 Likes

How odd with the email thing. Well, thank you for your thoughts, @betanalpha.