Hi,
I’m analyzing transcription accuracy data from L2 English speakers using Bayesian beta mixed models. The accuracy was initially measured on a scale of 0-100, then transformed to the (0,1) interval (without exact 0s and 1s) following Smithson & Verkuilen (2006).
According to what I have learned, Beta regression is theoretically appropriate for this kind of bounded data, and it worked well with native speaker data in my previous experiments.
Here’s a simplified version of my model:
model <- brm( TSR_score_squeezed ~ PredictorA + PredictorB + PredictorC + PredictorA:PredictorB + PredictorA:PredictorC + (1 | ID) + (PredictorA + PredictorB + PredictorC + PredictorA:PredictorB + PredictorA:PredictorC | Sentence), family = Beta(), .... )
However, posterior predictive checks (see attached image 1&2) suggest my models aren’t capturing the data distribution adequately. I suspect the problem might be related to non-native speakers’ data being noisier, forming a strongly left-skewed and seemingly multimodal distribution (please see attached image 3).
I’ve tried several approaches to resolve this issue, such as using different link functions (e.g., cloglog), applying different prior specifications or further transforming the dependent variable. However, these did not solve the issues, and strangely, some models with different link functions and priors produced nearly almost identical pp_check plots, even though their predictive performance differed according to leave-one-out (LOO) cross-validation.
My primary concerns are: Do the posterior predictive plots indicate a fundamental misspecification in my modeling approach? Given the left-skewed, potentially multimodal nature of L2 speaker data, what modeling approach would you recommend instead?
I’m still quite naive about Bayesian statistics, so any advice or insights would be greatly appreciated. Thank you in advance for your help! Please let me know if you need more information.