Gaussian Likelihood over Beta for Proportion Outcome

Hello Everyone,

As this is my first post, I hope I am doing a good job asking this question.
I am currently building models for my thesis about fault prediction and after some work, I am now wondering if my choice of a Beta Likelihood is the right one.

Both outcome variables I have are proportions, so they are between 0 and 1. Both are metrics from fault prediction/localization that represent proportions of code.
The limitation between 0 and 1 and the fact that they could be interpreted as probabilities made me go with a Beta Likelihood.

I played around with projpred to get a sense of what predictors to use and then iterated between prior sensitivity analysis and model building. I would build some models, compare them with loo_compare (from brms) and then try to find sensible priors by plotting them with different values.
After arriving at what I thought to be the best model I could come up with, I found the pp_check function from bayesplot and found that my posterior did not really fit the data that well and instead tried a gaussian likelihood with mostly the same priors.
It outperformed the Beta model by a lot.

Since then I figured out, that I could improve the loo performance of the beta model a lot by playing around with the priors but I am now wondering if the choice for the Beta was the right one and if playing around with prior values is “allowed” as long as they sample fine.
I put the shortened code for the first gaussian and beta model below in case it helps understanding what I did.

m.beta= brm(
  formula = EXAM ~ 1 + Weighting + LOC + Origin + (1|Project) + (1|Domain) + (1|Language),
  data = ls.df,
  family=Beta(),
  prior = c(
    prior(normal(0,10), class=Intercept),
    prior(normal(0,0.05), class=b),
    prior(cauchy(0,0.05), class=sd),
    prior(gamma(10, 10), class=phi)
  )
)

m.gauss = brm(
  formula = EXAM ~ 1 + Weighting + LOC + Origin + (1|Project) + (1|Domain) + (1|Language),
  data = ls.df,
  family=gaussian(),
  prior = c(
    prior(normal(0,1), class=Intercept),
    prior(normal(0,0.05), class=b),
    prior(cauchy(0,0.05), class=sigma),
    prior(cauchy(0,0.05), class=sd)
  )
)

Do these proportions come from proportions of binary trials or something? Like, for one line of code or section of code or whatever you see there are 43 failures out of 115 trials? So the proportion is 43/115?

If this is the type of data you have, then you can talk about the outcome as binomial distributed with a probability of success that’s a function of your covariates. There’s also a normal approximation that makes sense for large numbers of trials, but the first question is do you have the trial counts.

I don’t think you can interpret them like this.
The first one, the EXAM score, gives the average number of lines one would have to inspect to find each fault, averaged across all faults.
The other is the Area under the cost-effectiveness curve. A little like the ROC curve it plots the % of lines inspected on the x axsis and the % of faults one would find on the y axis. The area under the curve summarizes the shape of the curve.

So while they are kind of related to probabilities of finding faults in higher ranked lines of code, they are not themselves probabilities I think.

If you’re not stuck with these metrics, there’s probably a way to rework this a little.

Like if you have the data somewhere that gives you the number of faults in however many lines of codes, you could model that like a binomial. That would assume a fault happens at each line of code independently, or whatever.

If the number of lines of code is large, then you could do some sort of poisson counting thing: Poisson distribution - Wikipedia . The assumption there is always faults are being generated independently of each other.

And if you needed to deviate from these there are the overdispersed versions of the distributions (Overdispersion - Wikipedia).

I really don’t know much about this myself, but even if it seems a little annoying to throw away metrics people are used to, average rates of occurrence are pretty easy to understand, and I’d guess it’ll make the coefficients in the regression easier to interpret.

With regards to the beta regression. you probably wanna stick with it for a bit and figure out more about what’s going wrong. The answer to is-a-normal-appropriate is probably in the context of what’s screwy with the binomial. I’d assume that a normal would be fine as long as things are staying away from 0 and 1, but it’s not clear to me (other than computationally) why a normal would be working better here, so I’d guess the investigation it worthwhile.

Things to try might be:

  1. Here’s an rstanarm vignette on the Beta regression: Modeling Rates/Proportions using Beta Regression with rstanarm . You could try your model in rstanarm and see if anything different happens

  2. You could try simplifying your model and seeing where things break (try one project or domain or something)

  3. Write your own Stan code for the model (or generate it and the data for it using make_stancode/make_standata in brms) and make sure everything looks good. You could generate data from your model and fit it back to make sure everything is sane.

1 Like

For my thesis I will probably be stuck with these metrics but I found them inadequate in some regards so thinking about other ways to do this is something I might look into afterwards.

I don’t think that complete independence can be assumed but one could probably work some clustering magic to get close to it. Thank you for the idea!

Another thought that I just had was if it would help to transform the outputs to the log scale. It would move them from (0,1) to (-inf, 0) or (0, inf) when multiplied with -1.
This could be modeled with a half-normal for example which might work better computationally than the beta.
Would make interpretation even harder though.

Thank you for your input again. I think I will try around with the beta further and see if I can make it work.

That kinda rang a bell and got me thinking. I think this is related to doing a normal approximation to the binomial. I think I saw this in a class once.

But anyway so say you have a model:

a \sim \text{normal}(0, 1)\\ b \sim \text{normal}(0, 1)\\ \text{logit}(p_i) = a x_i + b\\ y_i \sim \text{binomial}(n_i, p_i)

And if you’re doing a normal approximation to the binomial you just match the mean and variance:

y_i \sim \text{normal}(n_i p_i, n_i p_i (1 - p_i))

And you’re saying your data is proportions, so that’s something like:

\frac{y_i}{n_i} \sim \text{normal}(p_i, \frac{p_i (1 - p_i)}{n_i})

I think what I’ve seen before here is instead of using \frac{p_i (1 - p_i)}{n_i} on the right side just replace that p_i with \frac{y_i}{n_i}. So you’re kinda using the MLE estimate of the individual p_i there to roughly fix the precision of your problem. You’d still have to have some idea what the n_i values are, or at least be willing to guess.

Anyway whatever you fix your standard deviations at (and depending on how you do it you’d have a different standard deviation for every data point, for every y_i there’d be a \sigma_i, and \sigma_i would be a function from data and not inferred), and your model now looks like:

a \sim \text{normal}(0, 1)\\ b \sim \text{normal}(0, 1)\\ \text{logit}(p_i) = a x_i + b\\ \frac{y_i}{n_i} \sim \text{normal}(p_i, \sigma_i)

I think with the transforms you’re talking about you’d end up needing a standard deviation parameter – this is where that parameter would come from with a normal approximation at least. Where I saw this was a class where we were writing Gibbs samplers (so making everything normal was super handy), so you can just skip fixing the standard deviations or whatnot I guess.

Hopefully that’s not too confusing or incorrect to be useless :D.