Assessing Bayesian Beta Regression fit using pp_check

Hi,

I’m analyzing transcription accuracy data from L2 English speakers using Bayesian beta mixed models. The accuracy was initially measured on a scale of 0-100, then transformed to the (0,1) interval (without exact 0s and 1s) following Smithson & Verkuilen (2006).

According to what I have learned, Beta regression is theoretically appropriate for this kind of bounded data, and it worked well with native speaker data in my previous experiments.

Here’s a simplified version of my model:
model <- brm( TSR_score_squeezed ~ PredictorA + PredictorB + PredictorC + PredictorA:PredictorB + PredictorA:PredictorC + (1 | ID) + (PredictorA + PredictorB + PredictorC + PredictorA:PredictorB + PredictorA:PredictorC | Sentence), family = Beta(), .... )

However, posterior predictive checks (see attached image 1&2) suggest my models aren’t capturing the data distribution adequately. I suspect the problem might be related to non-native speakers’ data being noisier, forming a strongly left-skewed and seemingly multimodal distribution (please see attached image 3).

I’ve tried several approaches to resolve this issue, such as using different link functions (e.g., cloglog), applying different prior specifications or further transforming the dependent variable. However, these did not solve the issues, and strangely, some models with different link functions and priors produced nearly almost identical pp_check plots, even though their predictive performance differed according to leave-one-out (LOO) cross-validation.

My primary concerns are: Do the posterior predictive plots indicate a fundamental misspecification in my modeling approach? Given the left-skewed, potentially multimodal nature of L2 speaker data, what modeling approach would you recommend instead?

I’m still quite naive about Bayesian statistics, so any advice or insights would be greatly appreciated. Thank you in advance for your help! Please let me know if you need more information.



From a quick glance, it looks like there might be some lumping (rounding?) or discretization artifacts. Is there anything (e.g., from the score transformation) that could potentially result in this?

Hi, thank you so much for your helpful insights!

I’ve double-checked the transformation process, and I think it’s fine. I followed the formula [(Score/100) * (N - 1)] / N to rescale the scores while avoiding exact 0s and 1s. I’ve attached a snapshot showing a segment of the original and transformed scores for reference.

The overall distributions of the original and squeezed scores also appear quite similar (see attached Figures 2 and 3).


That said, I agree that the distribution looks a bit odd, with multiple clusters or “lumpings.” I’m not sure whether this is due to genuinely more variable performance from L2 speakers, or possibly something else going on in the data.
Additionally, I’m a bit puzzled by the first posterior predictive check plot: the dark line representing the observed data (y) doesn’t seem to reflect the actual shape of the data distribution—particularly, it doesn’t capture the high frequency of scores near 1 that’s evident in the raw dataset.

Thanks again for your thoughtful response!

A couple thoughts.

You commented: “I suspect the problem might be related to non-native speakers’ data being noisier.” You might capture that by fitting a distributional version of the model where you attach a linear model to the dispersion parameter phi. In your case, that would look something like this:

model_2 <- brm(
  bf(TSR_score_squeezed ~ <your_predictors> + (1 | ID),
     phi ~ <your_predictors> + (1 | ID)),
  family = Beta(),
  ...
)

You can learn more about distributional models in this vignette.

Based on your histogram, your raw data has exact zeros and ones. You don’t have to transform those values away if you fit a zero-one-inflated-beta model instead with family = zero_one_inflated_beta. You can find an example of that here: Issue with group-level effects using 0-1-inflated beta family

You can learn more about both the zero-one-inflated beta and distributional modeling with Heiss’s great blog post: A guide to modeling proportions with Bayesian beta and zero-inflated beta regression models | Andrew Heiss – Andrew Heiss

1 Like

Also, it looks like this is your first post. Welcome to the Stan forums, @GRH15!

1 Like

Hi @Solomon . Thank you so much for your warm welcome and insightful suggestions. I plan to explore models with precision parameters as well as zero-one-inflated models following the resources you’ve provided. I will keep this thread updated with my progress.​
Thank you once again, and I hope you have a pleasant day.