Help with likelihoods

Hi,
I wanted to ask for some input on my choice of likelihoods for different models I am building. Also, I’ll be using brms and some of the options I found through Wikipedia are not listed in the brms family list so maybe I’d go with the ones that are?

1.

The outcome is a positive integer representing the number of positive cases in the first n positions of a ranked list. I also know the total number of positive cases, although they can differ between samples.
I would probably model this with a Poisson distribution, although the hypergeometric, binomial and negative-binomial distributions also sound like possible candidates.
Edit: Turns out I don’t know the maximum number of positive cases, so probably just a poisson?

2.

The outcome represents the area under a curve and lies between but not including 0 and 1. I am pretty sure I will model this with a beta distribution.

2.1

I also have cases, where I only look at the first part of the curve, so that the areas lie between eg. 0 and 0.05 but in some cases, this can be 0.
Here I don’t know how to model it as I can’t just scale it to 0-1 and use a beta again due to the 0’s.
Would a transformation like exp(var)/e be viable to scale it to 0-1 and use a beta or is there a better solution?
I thought about a zero-inflated beta, but am not sure what to use to model the zi term.

3.

Final outcome is a run-time, so a positive float. I was thinking about a truncated normal, exponential or lognormal but don’t really have arguments for or against any of these due to lack of experience.

Thank you for any help in advance :)

This looks like an order statistic to me. How is the list ranked, and what defines a positive case?

Sure Beta may be a good start. It’s a really flexible distribution.

You could use a zero inflated beta model or use the logit transform and consider using a normal or other distributions.

normal, log normal and exponential are not really interchangeable. They presuppose additive, multiplicative and exponential error respectively. Question to ask is what makes your runtime stochastic, and how does that term scale or aggregate. You could also just try everything and see what fits and then justify it afterwards. Personally, I think that’s data snooping and bad science but it’s also par for the course.

Thanks for your input @emiruz

It is a list of all lines in a software project, ranked by the probability of containing a fault (as determined by some algorithm). So a higher-ranked line is supposed to have an increased likelihood of containing a fault.
A positive case would be if a given line actually contains a fault.

The algorithm does repository mining and the runtime is determined by how many commits it parses, how many lines those commits contain and what the content of those lines is.
The number of commits parsed is the same for each sample so they only differ by the number and content of the lines, something that I don’t measure. That does sound like an additive error to me.

However, I took a look at the data and it looks like most of the time the algorithm runs in under a second with some exceptions taking way longer:


This makes me think that it might be better to either use some transformation or a likelihood that can handle the sharp spike near 0.

It seems that you’re saying that you have a probability of a fault per line but you don’t want to treat it as an actual probability. You’d like to do a kind of regression on it to determine the relationship between the “fault probability” and the occurrence of a fault. I suspect you’ll have to model that as a conditional probability distribution, or in your case, a regression. For example, given that your fault is y\in\{0,1\} and your “fault probability” is x \in [0,1] you could start with:

y\sim Bernoulli\Big(logit^{-1}(\alpha +\beta x)\Big)

I think it’d probably be more useful to think about this conditional on the number of lines in the repository. Your graph could just be an artifact of most repositories being small for example. So perhaps try and scatter plot runtime vs no. of lines first, and then I suspect the distribution conditional on the number of lines will look Gaussian IF the errors are random and additive.

I think my choice of word was vague here. The Scores in the ranked list are not actually probabilities. They are a score derived by the algorithm that is used for the ranking. Higher score equals more fault prone (in theory).

What I want to do is to compare the performance of two different algorithms based on different evaluation metrics that are common in the field of fault prediction. One of them being the number of actually faulty lines in the first n places of the ranked list.
So I am not working with the ranked lists themselves any more but just with the evaluation metrics.

So for each repository, different methods pick a different set of n items which are supposed to be faulty and for each method you calculate how many of those n items actually are faulty.

If that’s the case then the rate of fault is between 0 and 1 and you could consider the beta distribution to characterise any given method . Again it strikes me as odd that repositories might vary in size but n stays fixed; this is bound to give weird results.

If all you want to do is compare methods then considering functions of their differences makes more sense to me. Eg the ratio of method A vs method B scores.

You could also consider using things like the F1 score from ML which is pretty standard for binary classification.

I agree with you here. The idea behind that metric is that someone using the list for some kind of inspection/intervention will only look at the first n entries on the result list, regardless of how long it is. There are other metrics that do more averaging across all list positions but in total, the world of fault prediction evaluation metrics could be improved.

Thanks for the ideas. I’ll try around and see what works :)