Multi-modal distribution (bounded 0-100)

Dear all,
Thanks for your help in advance!
I am working on speech data where I record subjects saying a couple of sentences then I measure some acoustic parameters such as speech duration, speech intensity, etc. In the following case I measures voicing percentage (bounded 0-100). This parameter measures the percentage a segment contains voiced frames; some segments shows 100% means they are completely voiced segment and vice versa.

I tried modelling this using default priors and gaussian family as in:

p_voiced_1 <- brm(
  percentvoiced_f ~ position*voicing*target_vowel+poa+
    (1| Filename) +
    (1| word),
  data = fric_,
  sample_prior = TRUE,
  family = gaussian(),
  cores = 8,
  iter = 1000,
  control = list(adapt_delta = 0.999, max_treedepth = 15),
  seed = 1432)

This produces the following plot.

Rplot118

I also tried other families such as zero_on_inflated_beta() and binomial() but still not working.
Could you please instruct me on what other families or approaches I should try next?
Thanks!

It seems to me like you have fractions where the denominator is either 1, 2 or 4? If you have the denominator available, you could use a logistic regression with family set to binomial, or beta-binomial (not sure whether it is available in brms now?). Something like nominator | trials(denominator) ~ ....

The normal distribution is not really suitable for bounded (and discrete) data.

2 Likes

Thanks for this @StaffanBetner!
The research question I am trying to answer can’t by answered by logistic regression. So… other solutions are appreciated.

Many thanks again!

Why not?

My response variable in not binary and I am not looking for a binary answer. It is continuous from 0 to 100 in that I hypothesize that some of the predictors investigated vary in the voicing percentage percentvoiced_f. More specifically, the predictor voicing contains two levels voiced and unvoiced categories . So for voiced category, I predict higher percentage of voicing not 100% but something above 80%. For the other category unvoiced I hypothesize that this may show lower voicing value/percentage around 20% or so. The same is true for the other predictors involved the models they are all categorical with different levels. In addition, this model is part of a series where I just run regular linear regressions which is common in my field so I want to be consistent. I hope this makes it clear.
Thank you again for your time!

It sounds like your raw data are binary at the frame level and binomial at the segment level. So you could either analyze the data at the frame level as bernoulli distributed (with logistic regression) or at the segment level as binomial distributed (with a logit link). Of course, you lose this when you transform the trial to a percentage. But you should be able to model the data as described by @StaffanBetner.

By “regular linear regressions” I assume you mean a gaussian distribution with a linear link? If that is the case, I’m not sure how you could achieve a more compatible posterior predictive check.

2 Likes

Many thanks all for this. I will give it a go and see how it goes.

You can try my package ordbetareg: CRAN - Package ordbetareg

2 Likes