# Discrepancy between y and yrep in pp_check: how bad is it?

Maybe something like this? I cobbled this together this morning…someone should definitely check it before using it, as this is a bit over my head as well. I post it here as an idea. It does OK on the simple simulated data, but seems to overestimate the zero-inflation and middle-inflation.

``````#create data.
#proportion of zero-one inflation (zoi) = 200/800 = 0.25
#proportion of zoi that are ones (coi) = 100/200 = 0.5
#mid-point inflation (mdi) = 100/800 = 0.125
#mean of the beta distribution is 0.5

library(brms)

#define data
y <- c(rep(0, 100), rbeta(500, 1, 1), rep(0.5, 100), rep(1, 100))
hist(y,breaks=50)
d1 <- cbind(y)
d1 <- data.frame(d1)
str(d1)

#custom brms family
zero_one_mid_inflated_beta <- custom_family(
"zero_one_mid_inflated_beta", dpars = c("mu", "phi", "zoi", "coi", "mdi"),
links = c("logit", "identity", "identity", "identity", "identity"),
lb = c(0, 0, 0, 0, 0), ub = c(1, NA, 1, 1, 1),
type = "real"
)

#provide Stan functions
stan_funs <- "
real zero_one_mid_inflated_beta_lpdf(real y, real mu, real phi,
real zoi, real coi, real mdi) {
row_vector[2] shape = [mu * phi, (1 - mu) * phi];
if (y == 0) {
return bernoulli_lpmf(1 | zoi) + bernoulli_lpmf(0 | coi);
} else if (y == 1) {
return bernoulli_lpmf(1 | zoi) + bernoulli_lpmf(1 | coi);
} else if (y == 0.5) {
return bernoulli_lpmf(1 | mdi) + bernoulli_lpmf(0 | mdi) + beta_lpdf(y | shape[1], shape[2]);
} else {
return bernoulli_lpmf(0 | zoi) + bernoulli_lpmf(0 | mdi) + beta_lpdf(y | shape[1], shape[2]);
}
}
"

#define stanvars
stanvars <- stanvar(scode = stan_funs, block = "functions")

#run model
m2 <- brm(y ~ 1, data=d1, family = zero_one_mid_inflated_beta, stanvars = stanvars, cores = 4)
m2
``````

The `zoi`, `coi`, and `mdi` all use identity link, so they are showing the estimated proportion. You can check this against what I programmed into the data, outlined above. `Intercept` and `coi` are good, but `zoi` and `mdi` are overestimated (the real value in the data is captured by the lower end of the CI), so I might have done something wrong. To use `pp_check()` you would need to define `zero_one_mid_inflated_beta_rng` in the `stan_funs`.

Note - The `+ bernoulli_lpmf(0 | mdi) + beta_lpdf(y | shape[1], shape[2])` can be taken out of the mid-inflation line, and the model runs and obtains the same results. Originally, I was thinking this part was like the zero-inflated negative binomial case where some zeroes can come from the negative binomial distribution… but since this isn’t integer, the probability of getting exactly 0.5 from beta distribution is actually zero, right? So, maybe it really is like the case of the zeroes and the ones.

Hopefully someone like @paul.buerkner can chime in and clarify.

I think this would be the code to be able to use `pp_check` for the above model.

``````posterior_predict_zero_one_mid_inflated_beta <- function(i, prep, ...) {
zoi <- get_dpar(prep, "zoi", i)
coi <- get_dpar(prep, "coi", i)
mdi <- get_dpar(prep, "mdi", i)
mu <- get_dpar(prep, "mu", i = i)
phi <- get_dpar(prep, "phi", i = i)
hu <- runif(prep\$ndraws, 0, 1)
one_or_zero <- runif(prep\$ndraws, 0, 1)
mid <- runif(prep\$ndraws, 0, 1)
ifelse(hu < zoi,
ifelse(one_or_zero < coi, ifelse(mid < mdi, 0.5, 1), 0),
rbeta(prep\$ndraws, shape1 = mu * phi, shape2 = (1 - mu) * phi)
)
}

pp_check(m2, ndraws=150)
``````

And here is what you get.

Now compare to the zero-one-inflated beta model.

``````m2.zoi <- brm(y ~ 1, data=d1, family=zero_one_inflated_beta, cores=4)
pp_check(m2.zoi, ndraws=150)
``````

So the zero-one-mid-inflated beta model looks some better, but certainly not perfect.
What I don’t understand is that in the model output, it appears that the model is overestimating the midpoint inflation, but in the posterior predictive check, it seems underestimated…

Also, the dark blue data line from `pp_check` doesn’t match the actually data, as the zero, midpoint, and one inflation are all programmed the same.

So, I am not sure why `pp_check` doesn’t seem to reflect the actual data.

@Solomon maybe you can try a simple intercept only model on your data to see how this works. Hopefully someone with more expertise in custom families in brms can chime in.

It could just be that I’ve done the posterior predict code incorrectly. If anyone wants to look, here is the source code for the `zero_one_inflated_beta` from brms that I modified for the above.

``````posterior_predict_zero_one_inflated_beta <- function(i, prep, ...) {
zoi <- get_dpar(prep, "zoi", i)
coi <- get_dpar(prep, "coi", i)
mu <- get_dpar(prep, "mu", i = i)
phi <- get_dpar(prep, "phi", i = i)
hu <- runif(prep\$ndraws, 0, 1)
one_or_zero <- runif(prep\$ndraws, 0, 1)
ifelse(hu < zoi,
ifelse(one_or_zero < coi, 1, 0),
rbeta(prep\$ndraws, shape1 = mu * phi, shape2 = (1 - mu) * phi)
)
}
``````

Thank you for your efforts, @jd_c! I’ll confirm that, at least from a `pp_check()` perspective, your `zero_one_mid_inflated_beta` solution doesn’t seem to fully work on my data set either. Here’s the `pp_check()` plot from an intercepts-only version of the model on my data.

@Henrik_Singmann, you’re tricky with custom brms families. Do you have any interest in this topic?

Hmmm, yeah that doesn’t seem to work at all… What did the numbers say? What was your actual mid-inflation vs what the model said?

So, I tried it out on some actual data as well. I used data from slider scale responses with more than 33,000 rows. This is many responses for each of 500+ individuals. I just ran simple intercept only models. As you can see, posterior predictive checks are poor for both models (this data is tough! with lots of inflation around specific values), but they are better at the mid-point for the zomi beta model. The zoi beta simply averages right through the mid-inflation.

When I look at the actual numbers, the zero and one inflation are predicted right on for the zoi beta model. They are a tad off but very close for the zomi beta model, but the mid-inflation is over estimated, as the real is 4.4% and the estimate is 6%.

I think I have figured out why the `pp_check` density plots look a little off even when the inflation is estimated correctly (for example in the zoi beta model above, which looks off in the plots but the model estimates are right on). I think this may have to do with the density smoothing around the inflation.

@Solomon I am surprised that the zomi beta model didn’t work better for your data after looking at my real data above. Are you sure that you have 0.5 inflation? Or is it simply inflated around the middle? There would be a difference. The zomi beta model is only estimating inflation for values exactly at 0.5

Also, if the `pp_check` plot is really smoothing around the single point inflation like I think it is, it would be better to check the actual model estimates for the inflation, rather than looking at a density plot. Or maybe use a histogram.

Here’s my model code and the summary output:

``````fit1 <- brm(
data = dat,
family = zero_one_mid_inflated_beta,
y ~ 1,
cores = 4,
seed = 1,
stanvars = stanvars)

print(fit1)
``````
`````` Family: zero_one_mid_inflated_beta
Links: mu = logit; phi = identity; zoi = identity; coi = identity; mdi = identity
Formula: y ~ 1
Data: dat (Number of observations: 623)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000

Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -0.69      0.04    -0.77    -0.61 1.00     4556     3124

Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi     4.01      0.24     3.54     4.49 1.00     4775     2856
zoi     0.36      0.02     0.31     0.40 1.00     4709     3136
coi     0.21      0.03     0.15     0.27 1.00     4590     3392
mdi     0.24      0.02     0.21     0.27 1.00     5184     3557

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
``````

The empirical proportion of values exactly at 0.5 is 23%.

``````dat %>%
# there are three missing values
drop_na(y) %>%
summarise(pm = mean(y == 0.5))
``````
``````0.2263242
``````

OK, so then the model did capture the mid-point inflation, as the proportion of values at 0.5 in the data is 0.23 and the model is estimating 0.24 (0.21 - 0.27).
The `pp_check` plot doesn’t really reflect that, though… so it either has to do with smoothing in the plot, or maybe the bit of code that I wrote for the `posterior_predict_zero_one_mid_inflated_beta` is wrong in one of those nested ifelse statements.

Also, it’s not just the smoothing in the density curves. For example, here’s what happens if I use `type = "hist"`.

``````pp_check(fit1, ndraws = 5, type = "hist", binwidth = 0.01)
``````

Ok, haha… seems like the model is ok (based on the output) but the code to get the pp_check might be wrong.
@paul.buerkner or anyone else, what did I do wrong here?

``````posterior_predict_zero_one_mid_inflated_beta <- function(i, prep, ...) {
zoi <- get_dpar(prep, "zoi", i)
coi <- get_dpar(prep, "coi", i)
mdi <- get_dpar(prep, "mdi", i)
mu <- get_dpar(prep, "mu", i = i)
phi <- get_dpar(prep, "phi", i = i)
hu <- runif(prep\$ndraws, 0, 1)
one_or_zero <- runif(prep\$ndraws, 0, 1)
mid <- runif(prep\$ndraws, 0, 1)
ifelse(hu < zoi,
ifelse(one_or_zero < coi, ifelse(mid < mdi, 0.5, 1), 0),
rbeta(prep\$ndraws, shape1 = mu * phi, shape2 = (1 - mu) * phi)
)
}
``````

To summarize: below would be the updated code.

``````#create data.
#proportion of zero-one inflation (zoi) = 200/800 = 0.25
#proportion of zoi that are ones (coi) = 100/200 = 0.5
#mid-point inflation (mdi) = 100/800 = 0.125
#mean of the beta distribution is 0.5

library(brms)

y <- c(rep(0, 100), rbeta(500, 1, 1), rep(0.5, 100), rep(1, 100))
hist(y,breaks=50)
d1 <- cbind(y)
d1 <- data.frame(d1)
str(d1)

zero_one_mid_inflated_beta <- custom_family(
"zero_one_mid_inflated_beta", dpars = c("mu", "phi", "zoi", "coi", "mdi"),
links = c("logit", "identity", "identity", "identity", "identity"),
lb = c(0, 0, 0, 0, 0), ub = c(1, NA, 1, 1, 1),
type = "real"
)

stan_funs <- "
real zero_one_mid_inflated_beta_lpdf(real y, real mu, real phi,
real zoi, real coi, real mdi) {
row_vector[2] shape = [mu * phi, (1 - mu) * phi];
if (y == 0) {
return bernoulli_lpmf(1 | zoi) + bernoulli_lpmf(0 | coi);
} else if (y == 1) {
return bernoulli_lpmf(1 | zoi) + bernoulli_lpmf(1 | coi);
} else if (y == 0.5) {
return bernoulli_lpmf(1 | mdi) + bernoulli_lpmf(0 | mdi);
} else {
return bernoulli_lpmf(0 | zoi) + bernoulli_lpmf(0 | mdi) + beta_lpdf(y | shape[1], shape[2]);
}
}
"

stanvars <- stanvar(scode = stan_funs, block = "functions")

m3 <- brm(y ~ 1, data=d1, family = zero_one_mid_inflated_beta, stanvars = stanvars, cores = 4)
m3

posterior_predict_zero_one_mid_inflated_beta <- function(i, prep, ...) {
zoi <- get_dpar(prep, "zoi", i)
coi <- get_dpar(prep, "coi", i)
mdi <- get_dpar(prep, "mdi", i)
mu <- get_dpar(prep, "mu", i = i)
phi <- get_dpar(prep, "phi", i = i)
hu <- runif(prep\$ndraws, 0, 1)
one_or_zero <- runif(prep\$ndraws, 0, 1)
mid <- runif(prep\$ndraws, 0, 1)
ifelse(hu < zoi & mid > mdi,
ifelse(one_or_zero < coi, 1, 0),
ifelse(mid < mdi, 0.5, rbeta(prep\$ndraws, shape1 = mu * phi, shape2 = (1 - mu) * phi))
)
}

pp_check(m3)
``````

Model results:

Posterior predictive check:

The model over-estimates the mid-point inflation slightly, which can be seen in both the model results and the posterior predictive check.

I updated the posterior predict code. I think I had an ifelse statement in the wrong place. But it would be great if someone else could check it!

Also, I updated the model code, because as I reasoned here and I think confirmed by reading this, that the ` + beta_lpdf(y | shape[1], shape[2])` is not needed in the mid-point inflation line, because the probability of getting exactly 0.5 from a continuous distribution is zero.

Hopefully this is a good start if anyone is interested in this model. I think a zomi beta model could be useful for those working with slider scale data.

Hopefully some people can chime in to check and see if this is correct :-) Thanks @Solomon for checking it out on some data.

1 Like

For some reason I got an email notification for this post, although I don’t see any mentions. While I’m here, however, let me take the opportunity to complain about density estimators for posterior retrodictive checks like this. As is evident in the above figures the density estimators over smooth the inflation at zero and one which makes it difficult to figure out exactly what is going on. A histogram summary statistic, see for example Towards A Principled Bayesian Workflow, offers a much more clear picture for inflation models like these.

All of that said note that when inflating continuous variables there’s actually not much benefit to fitting everything jointly. See for example Some Mixture Modeing Basics.

2 Likes

How odd with the email thing. Well, thank you for your thoughts, @betanalpha.

The hyperlinked `this` in post #18 is a @betanalpha post. I don’t know if its possible to turn off notifications for that sort of thing…

1 Like

Maybe because I linked to a post or comment of yours? Odd thing is, I never got an email notification for this comment, so didn’t see until now.

Haha! Yes, I think I mentioned that somewhere above. I should have switched to histogram in the last post with the code.

Thanks for the great link!! I think there might be some benefit here simply from a brms user standpoint for those who frequently work with slider scale data. In this case the inflation is not just a nuisance, so one would likely not want to ignore it, and it seems it could be convenient as is `zero_one_inflated_beta` in brms currently.

1 Like

I don’t think @betanalpha was advocating for ignoring the inflation, but rather was pointing out that in many cases inflation models for otherwise continuous variables decompose into two pieces that don’t need to be fit jointly (they should still both be fit, but they can be fit separately). There are potentially two reasons to fit jointly. One is if the different pieces share parameters (e.g. covariance parameters in a random effect, or even more explicit forms of sharing, like if the linear predictor from one sub-model is used as a covariate in the other). The other is simply for the convenience of post-processing in `brms` as a single model. But there are also potentially reasons NOT to fit jointly, including the ease of diagnosing problematic posteriors in one sub-model or the other. It’s a modeling decision worth thinking about.

2 Likes

Thanks!
Yes, he writes this, “In other words because the the inflated and non-inflated points are essentially modeled by separate data generating processes the entire model decomposes into a binomial model for the total inflated observations and a baseline model for the non-zero observations.”

But I was referring to this: “In particular if the inflated counts are just a nuisance then we can ignore them entirely and just fit the non-inflated observations directly without any consideration of the inflation!”
I guess I misunderstood.

For slider scale data, I think the reasons to fit jointly that you listed would almost always apply. Usually you have respondents answering multiple questions, in which case, I would think that you would want to model the varying intercepts for respondent as correlated for the continuous and inflation parts in brms like `(1|p|id)`. It is likely that some respondents may be ambivalent and generally pick neutral values or a ‘love it or hate it’ approach and pick all (1’s) or nothing (0’s). It would really depend on the data, but I think the joint approach might be more useful.

Good point!

No, I think you understood perfectly, and I’m the one who read sloppily. If the inflated counts are just nuisance, then you can often ignore them entirely.

I think you’re good to go. To my eye, what the `pp_check()` plot suggests is there might be other interesting things going on with your data, which might push you in a good direction for the next study.