 # Non linear distribution with some zeros in data

Hello,
I want to model a non-linear outcome and a covariate with age as a random slope. I wanted to use Non-Linear Models with brms, however, the assumption there is omega and theta is the same across age ( in the vignette across the time). The outcome is the proportion (range 0 to 1) as below. I appreciate your advice on the model that suits this distribution. I wanted to choose gamma but there are zero values in the data. I appreciate hints.

Are you truly observing proportions as your most-raw data? Or do you have access to less-processed data consisting of zeroes and ones?

@mike-lawrence the data I need to look at is in proportion between 0 and 1 as shown in the histogram.

Ok, I asked because quite often folks want to achive inference on the proportion scale and make the mistake of aggregating their binomial 0/1 data to proportions to serve as input to modelling; this would be a mistake as inference on the proportion scale can be achieved most accurately by allowing the model to see the raw binomial data.

1 Like

Just double-checking: for each proportion, do you happen to have the number of observations/events that led to that proportion? If so, then it’s trivial to work out the original counts of 1s/0s

@mike-lawrence I have the original counts. Someone has developed this proportion data and claims it is the best metric to capture health. I am arguing this is not. So I have to use this proportion as the outcome and develop models with its predictor with a model which is based on the raw data. To develop a model on this proportion I want to make sure that the model is correct and when I make my argument it is valid. Does this make sense?

1 Like

I think I use beta regression in brms.

Just to follow up on what Mike said. If you have the counts of observations and events, you can model this data as a binomial or poisson with an offset. You are still estimating a parameter for the proportion/rate, which you are interested in, and you explicitly take into account that 0’s are possible.

1 Like

@stijn This is a metric developed from 30 items as 1/0 (yes/no) over the total number of items (30) for each individual which results in a proportion between (0 to 1). This is not a binary event.
Now I need to use this developed metric as an outcome in a model. With this distribution, I wonder what family is the best. I used the NL model in the brms, also set the zeros to a very small number (0.0001), and used the gamma family, but neither of them is performing well. So, the problem is I am obliged to use this already developed metric as an outcome to address reviewers’ comments. I need to make sure to use the right model for this distribution. Thanks for any input.

That sounds a lot like an outcome variable that follows a binomial distribution with p success and 30 trials, divided by 30 where you are interested in what explains p.

@stijn a beta distribution? because binomial or Bernoulli didnt work!

@ssalimi , I think that the suggestion of using a binomial model sounds like the right way to go, but it may help if you share your brms code. If the binomial doesn’t work or sounds wrong to you, it could be that we’re missing or misreading some information you have, or it could be a mistake in the brms code.

2 Likes

As others have said, it sounds like you can do better than using this aggregated data by using the raw counts. If however, for some reason you are required to use the proportions, and they contain zeroes, then you could try the `zero_inflated_beta` in brms. Based on the histogram that you showed, it doesn’t seem like your data contains any ones, but if so, you could try the `zero_one_inflated_beta`.

3 Likes

@jd_c I totally agree with you and others on using raw data as counts. Indeed, the purpose is to defend this suggestion to respond to the reviewers. I need to use this method of aggregate and compare models to show them the aggregate method is not optimal. I will try again on the binomial approach. I will also use `zero_inflated_beta versus beta.
Thanks a lot.

2 Likes

Binomial model was already mentioned, and if the data is over-dispersed compared to the binomial, then beta-binomial (Beta-binomial distribution - Wikipedia) is also available in Stan.

The hisogram looks like there is some zero-inflation which could be taken into account, too.

Not certain if these are available in brms directly, but at least the necessary compinents are available in Stan

2 Likes

And in case you need references to bolster your assertion that a proper treatment of such data would involve a hierarchical model with a bernoulli likelihood, it’s been discussed extensively in the quantitative methods for psychology literature; here are a few key refs:

3 Likes

@avehtari Thanks for confirming on beta regression approach.

@mike-lawrence Thank you for the references. Much appreciate it.

But, I didn’t! I said binomial and beta-binomial, which are both models for discrete counts with some maximum. Beta is for continuous data, but based on your description the data is discrete and beta is then the wrong model and can be especially bas as there are many zeros.

1 Like

@avehtari Opps, my bad! Yes, I will use zero_inflated beta-binomial.
Indeed, beta regression was truly unstable with many K>0.7. I can communicate this in the paper.
I appreciate it for correcting me.