Improving model fit with zero_one_inflated_beta with a specific case study

I’m trying to model a response variable that ranges between 0 and 1 (preference for a habitat) and because the large number of 0’s and 1’s I’m using the family function zero_one_inflated_beta. However, I’m not very happy when I check the posterior predictive checks (maybe overthinking?) and I would like to know if there are other better alternatives to improve the performance of the modeling process (the model is a bit more complex but I simplified it for the purpose of the example).

Example_data.csv (4.3 KB)

library(tidyverse) #For readr and ggplot2

d = read_csv("Example_data.csv")

model1 = brm(Response ~ Variable1 * Habitat, 
            data = d, family=zero_one_inflated_beta())

pp_check(model1, nsamples=100) + ylim(0,10)

Now I repeat the process and I try to set more informative priors based on my knowledge of the variables (I’m new to specify priors in brms so I do apologize if I’m messing up here). The values for the response as I said above range from 0 to 1 and for the predictor from 0 to 10 in the natural world but in my dataset the maximum is just a bit over 6.

prior1 <- c(set_prior("normal(0,1)", class = "b", coef = "HabitatB"),
            set_prior("normal(0,1)", class = "b", coef = "HabitatC"),
            set_prior("normal(0,1)", class = "b", coef = "HabitatD"),
            set_prior("normal(0,10)", class = "b", coef = "Variable1"))

model1 = brm(Response ~ Variable1 * Habitat, 
            data = d, prior=prior1, family=zero_one_inflated_beta())

Now I conduct posterior predictive checks.


Difficult to see anything here so I adjust the limits to improve visualization.

pp_check(model1,nsamples=100) + ylim(0,10)

I also I have increased the iterations (e.g., warmup = 1000, iter = 3000) but still getting similar results when I conduct the posterior predictive checks. Therefore, is this “sufficiently good” to “trust” the modelling process or can be further improved in order to have a better fit? Thanks in advance, any advice would be more than welcome!

Hi, @JoseBSL and welcome to the Stan forums.

It’s neat that brms has both 0 and 1 inflated beta distributions—they all work the same way, so it’s easy enough to do. For plotting the output, though, you should get a discrete probability of having a 0 value, a 1 value, and then everything else will be drawn from the beta distribution.

The real problem with 0 and 1 inflation is that it literally only handles those two values. Everything in the middle still has to match a beta distribution. If you have something more complicated, you might want to move to a richer family than beta, like a Gaussian process. I have no idea if that’s possible within brms.

You can check if your model can recover simulated data by testing it with simulated data. To evaluate vs. real data, you want to plot some actual posterior predictive checks, which will involve choosing statistics with which to evaluate.

Models can almost always be improved to produce better fits. But the real question is how are you measuring fit? If you care about predictive performance, you want to use cross-validation on posterior predictive inferences. If you care about modeling the data, you might get away with just posterior predictive checks.

1 Like

See this post Ordered Beta Regression model: is a custom posterior predictive check plot necessary? for a discussion of better posterior predictive check plots. I prefer a histogram at the very least, though a combination of both is a good idea.
I have not personally seen those odd spikes in the density pp plots before for beta models. Usually they look pretty smooth, similar to the plots in the post that I link. I think this may be due to the small amount of data between zero and one that the model has to use, so occasionally you get predictions that are quite off, but I am not sure.
Using the default of 10 draws, and your example data and model code that you post, here is the density plot:

And here is a histogram:

It seems like the zero and ones seem to be captured pretty well (although the plots in the post that I link would show this better), but the beta part of the model struggles a little. I think this may be because you only have 16 rows of data that are between 0.1 and 0.9. Everything else is either a 0, a 1, or very close to 0 or 1. That would be my guess.

1 Like

Dear @Bob_Carpenter,

Thanks for your insights, they are very helpful. I haven’t thought about cross-validation and I’m going to explore also this option.

@jd_c thanks for your response! It is true that with only 10 draws it looks quite nice. Because I have the option to explore this with another dataset, increasing the sample size may overcome this issue of having few values between 0.1 and 0.9… I may try that too.

Also the post that you sent looks very nice! It gives me an overall idea of how to explore the fit of this Beta models safely. Thanks!

1 Like