# Distribution of the response variable with skewed data containing small values and zeros

Hi!

I’m trying to run a brms model on continuous response data with two multileveled categorical predictors and a random intercept. My response data distribution is as in Figure 1. So The minimum is 0 and maximum 180.

I believe the distribution that describes my data best is a half-normal but I don’t really know how to implement it (it’s not predefined in brms family function). Distributions usually used for response time data (a bit similar to mine) require responses greater than 0.

I’ve tried fitting the model with gaussian family which resulted in a pretty bad fit (pp_check on Figure 2.)

Do you have any idea how I could specify my response distribution more accurately?

Cheers,
Zuzia

• Operating System: MacOS Mojave
• brms Version: 2.5.0

Can you desribed your response variable a little bit more?

Yes, it’s data from a colour wheel experiment. So a difference between the target colour and the chosen colour in absolute terms.

I have little expertise with those kinds of data, but a circular model might be appropriate, such as using the von mises distribution in brms.

So, the response was circular but I’ve figured that if I take a part of that circle, so distance between two points on the wheel, so my “correctness” variable then it’s no longer circular.

Now that I’ve written it all down it seems like it’s a mixture of uniform and gaussian maybe.

Interesting problem - it is always easiest if you post reproducible code: I think these few lines come pretty close.

By combining a half-normal and uniform distribution, I can generate data close to what you describe:

If i run

Default method
m0 <- brm(dist ~ (1|id), data = sim, chains = 2,iter = 3000, warmup = 1000)
pp_check(m0)

I also seem to get a poor pp_check similar to yours:

However, a log function can make it closer (though still not perfect):

Log method
m0log <- brm(log(dist) ~ (1|id), data = sim, chains = 2,iter = 3000, warmup = 1000)
pp_check(m0log)

Does this work better on your data?

All the code is available here.

Also I am not sure if this is the ideal solution, but that is what I intuitively would try.

Thanks Simon!

I’ve tried to run my model with taking the log of my outcome variable as in your example and using family gaussian with “identity” link. Unfortunately I got this error for all of my chains:

SAMPLING FOR MODEL ‘87a5d6d2435e3243f9c7416c4e9a45fe’ NOW (CHAIN 1).

Chain 1: Initialization between (-2, 2) failed after 100 attempts.

[1] “Error in sampler\$call_sampler(args_list[[i]]) : Initialization failed.”

error occurred during calling the sampler; sampling not done

So, I’ve managed to solve the problem.

Seems that a truncated lognormal distribution is the one describing this type of data the best.

Before running the model, I’ve added a constant of 1 to my response data to remove 0 as lognormal doesn’t allow them.

Then I’ve specified the boundaries of my response variable using trunc() function.

``````m0_truncL <- brm(correctness | trunc(lb = 1, ub = 181) ~ (1|ID),
data = results, family = lognormal(link = "identity")
``````

The resulting pp_check looks way more reasonable:

Hope that helps anyone with a similar problem.

Cheers,
Zuzanna

Thank for sharing Zuzanna! I do have a similar looking distribution as you. I have been trying to solve this distributional problem myself for a long time now - your solution looks pretty good!
I was just wondering, did you also consider the negative binomial distribution (or zero-inflated models)? Find it hard to understand whether it is okay to use such a distribution for a non-count variable (but with count properties), does anyone have an idea on this?