Distribution of the response variable with skewed data containing small values and zeros

Zuzanna_Skora · October 19, 2018, 1:34pm

Hi!

I’m trying to run a brms model on continuous response data with two multileveled categorical predictors and a random intercept. My response data distribution is as in Figure 1. So The minimum is 0 and maximum 180.

I believe the distribution that describes my data best is a half-normal but I don’t really know how to implement it (it’s not predefined in brms family function). Distributions usually used for response time data (a bit similar to mine) require responses greater than 0.

I’ve tried fitting the model with gaussian family which resulted in a pretty bad fit (pp_check on Figure 2.)

Do you have any idea how I could specify my response distribution more accurately?

Cheers,
Zuzia

Operating System: MacOS Mojave
brms Version: 2.5.0

paul.buerkner · October 19, 2018, 1:47pm

Can you desribed your response variable a little bit more?

Zuzanna_Skora · October 19, 2018, 1:49pm

Yes, it’s data from a colour wheel experiment. So a difference between the target colour and the chosen colour in absolute terms.

matti · October 19, 2018, 1:58pm

I have little expertise with those kinds of data, but a circular model might be appropriate, such as using the von mises distribution in brms.

Zuzanna_Skora · October 19, 2018, 2:04pm

Thanks for the reply!

So, the response was circular but I’ve figured that if I take a part of that circle, so distance between two points on the wheel, so my “correctness” variable then it’s no longer circular.

Now that I’ve written it all down it seems like it’s a mixture of uniform and gaussian maybe.

simon.dp · October 19, 2018, 3:39pm

Interesting problem - it is always easiest if you post reproducible code: I think these few lines come pretty close.

By combining a half-normal and uniform distribution, I can generate data close to what you describe:

If i run

Default method
m0 ← brm(dist ~ (1|id), data = sim, chains = 2,iter = 3000, warmup = 1000)
pp_check(m0)

I also seem to get a poor pp_check similar to yours:

However, a log function can make it closer (though still not perfect):

Log method
m0log ← brm(log(dist) ~ (1|id), data = sim, chains = 2,iter = 3000, warmup = 1000)
pp_check(m0log)

Does this work better on your data?

All the code is available here.

Also I am not sure if this is the ideal solution, but that is what I intuitively would try.

Zuzanna_Skora · October 22, 2018, 9:07am

Thanks Simon!

I’ve tried to run my model with taking the log of my outcome variable as in your example and using family gaussian with “identity” link. Unfortunately I got this error for all of my chains:

SAMPLING FOR MODEL ‘87a5d6d2435e3243f9c7416c4e9a45fe’ NOW (CHAIN 1).

Chain 1: Initialization between (-2, 2) failed after 100 attempts.

[1] “Error in sampler$call_sampler(args_list[[i]]) : Initialization failed.”

error occurred during calling the sampler; sampling not done

Zuzanna_Skora · November 8, 2018, 1:39pm

So, I’ve managed to solve the problem.

Seems that a truncated lognormal distribution is the one describing this type of data the best.

Before running the model, I’ve added a constant of 1 to my response data to remove 0 as lognormal doesn’t allow them.

Then I’ve specified the boundaries of my response variable using trunc() function.

m0_truncL <- brm(correctness | trunc(lb = 1, ub = 181) ~ (1|ID),
               data = results, family = lognormal(link = "identity")

The resulting pp_check looks way more reasonable:

Hope that helps anyone with a similar problem.

Cheers,
Zuzanna

LailaFr · October 9, 2019, 7:09am

Thank for sharing Zuzanna! I do have a similar looking distribution as you. I have been trying to solve this distributional problem myself for a long time now - your solution looks pretty good!
I was just wondering, did you also consider the negative binomial distribution (or zero-inflated models)? Find it hard to understand whether it is okay to use such a distribution for a non-count variable (but with count properties), does anyone have an idea on this?

Many thanks in advance!
Best,
Laila

Zuzanna_Skora · October 31, 2019, 2:05am

No, I haven’t tried the neg binomial nor the zero-inflated (it’s just small values that are the most probable, not necessarily just zeroes so I wasn’t sure if that would be the right solution).
Cheers,
Zuzanna

MegBallard · November 2, 2019, 7:00am

You could try geometric or gamma families.
If you have zeros, it won’t allow you to use gamma distributions, but hurdle_gamma allows for zero values (where otherwise only positive reals are allowed).

A negative binomial distribution I don’t think explicitly implies integer values.

I’m fairly certain that in brms, I think you get error flagged, if you try to run a model with non integer response, using family=poisson. But I’m not certain the same is true in negbinomial.

Topic		Replies	Views
Difficulty fitting a bounded continuous (gaussian) response brms fitting-issues , specification	7	1089	December 23, 2020
Setting custom beta family in brms Modeling ecology , meta-analysis	14	1807	April 24, 2020
Family mixtures, brms brms	1	617	October 8, 2020
Identify response variable probability distribution: on the use of pp_check Modeling techniques , ecology , posterior-predictive	1	435	April 22, 2021
Convergence fails for (every) truncated gaussian model brms	6	2398	August 5, 2019

Distribution of the response variable with skewed data containing small values and zeros

Related topics