How to choose the likelihood function for distance data?

Dear Stan/brms community,

May I ask you for help?
I fit a multilevel/mixed-effects model in brms. The predictor variable is called the condition with two levels: C and D. And the outcome variable is the distances that participants walk from a reference point. Following is the brms syntax:

bm1 <-
  brm(data = d,
      distacne ~ condition + (1 | ppid),
      iter = 4000,
      warmup = 2000,
      chains = 4,
      cores = 4,
      seed = 941129)

I have no model fitting issues. However, when I performed the pp check, it seemed that there is large discrepancies between the observed data and simulated data.

Think I should re-choose the likelihood function/family. But I just have no idea.
Could anyone know what model family I should choose? Or in general how to solve my problem?
Thank you very much in advance!

Since I have no model fitting issues, I wonder is it ok to keep using the Gaussian likelihood? Or it’s better to use a more appropriate one?

Hi @HuaiyuLiu,
Your Choice of likelihood brings some assumptions with it. In your case you can see that the default (a gaussian) assumes that negative values are possible. I would guess that in your data it is impossible for negative distances to exist. You could check out some of the continuous positive families such as a log-normal or a gamma for example. You can find all supported families here with a quick explanation what they can be useful for under details.

While you might have no sampling issues (which is what I assume you mean by no fitting issues), your pp-check shows that you definitely have a somewhat misspecified model at hand so you might want to iterate on it a bit.

1 Like

Hi @scholz , thank you very much for your reply. Indeed, my data is impossible for negative values. I agree that the Gaussian likelihood is somewhat misspecified for my data. And I will check out other continuous positive families.

Also, is your data continuous or discrete? Based on the density plot you show, it almost looks like it starts at 1, and maybe goes 1, 2, 3…


@jd_c makes a good point. If the data is discrete, a negbinomial or poisson would be possible choices.


@HuaiyuLiu if the data is discreet then as @scholz says you may want to try negbinomial or poisson, and if a value of zero is also not possible (for example, participants will always walk at least 1 unit per the experiment), then you might want to use a truncated distribution.

Hi @jd_c , thank you very much for your reply. However, my distance data is continuous, so it could be 2 meters, or 2.234 meters, etc. So I guess a negbinomial or poisson might be inappropriate?

Hi @scholz , sorry to interrupt you again. I’ve taken a look at the supported families that you shared with me. It seems that a log-normal or a gamma or a skewed_normal likelihood are plausible for my data. I am confused, may I know your suggestions? I mean all of these three distributions look quite similar to my observed data.

Correct. If it’s continuous, then you don’t want to use negative binomial or Poisson.


A skewed_normal still allows negative values, so I would try the other two and compare the models via loo_compare and the pp_checks to see if there is a meaningful difference (likely not).

Got it, makes perfect sense! Many thanks

As @scholz mentioned you might have to fit different models to see which one describes your data best. This can be especially less obvious with multi-level data since a skew etc. can come from adding the different levels.
Also concerning the distributions: They differ in their properties, especially the variance and median/mean relationship is different between log-normal and gamma.

Sometimes you can also decide for one distribution based on the underlying data generating process (not always easy to know though).

@jd_c makes also a valid point. Even if your data is continuous it can have a lower bound. Is your measurement limited to a specific minimum value? If so you might want to include a boundary condition in the model.

Hi @danielparthier , thank you very much for your reply and nice suggestions. I actually would like to try to include a boundary condition in the model as @jd_c suggested. So I guess I should include a ‘DV | trunc(lb = xx) ~ condition + (1| ppid)’ in my brms model, right?
But I have two new questions:
(1) What does this trunc do? Is it custom the likelihood function, let’s say, allow a Gaussian likelihood with only positive values?
(2) I wonder how to use a truncated likelihood in brms?

@scholz , I tried both gamma and lognormal families. The pp check now looks much better (the upper one is Gamma and the bottom one is lognormal).

And model comparison shows that the Gamma model is slightly better than the log-normal one.
However, what surprises me is that when I examine the pp_check for median values, it is weird that both gamma and lognormal failed to capture such pattern compared to the raw data (again, the upper one is Gamma model and the lower one is the lognormal model).

I am confused, is this normal for Gamma or lognormal families? I am very appreciate for any of your suggestions.

What’s happening here is that the best-fitting lognormals and gammas yield medians that are substantially higher than the median of the data. This is strong evidence for some degree of misspecification. The data seem to have a huge peak near 1 that isn’t getting captured well by the model.

You can view what the truncation is doing by using the stancode() function on your model and viewing the likelihood part of the Stan code that is generated from the brms call.
You would include truncation as you suggested. However, I would not use a Gaussian and then truncate it at zero for example, just to get positive values. I would try something like lognormal, as suggested. The reason to use truncation would be if say values less than 1 meter were not possible, or maybe some upper bound.
I can see you tried gamma and lognormal, and as @jsocolar says, you can see there is misspecification. It would be a good idea to go back to step one and think about the process by which these data are generated. Is a value less than 1 meter possible? Is there an upper bound naturally (perhaps participants arrive at some point) or via some measurement process? Why are there so many apparent one’s (or values near one) in the data? Can the process that generates so many ones be explained and incorporated into the model? Although you said earlier that the data is indeed continuous, the density plot makes it look like most of the data falls on 1, 2, and ~2.5-3. Is that so, and if so, why would that be? Are these actual measurements of distance, or are they self-reported measures where many people round to some integer? Those are just example questions that might be good to think about. I would think hard about the data generation process before I went hunting through gamma, lognormal, exponential, truncated gaussian etc.


Hi @jd_c @jsocolar , thanks for your valuable suggestion, I fully agree, should think hard about the data generation process before moving on. I will think hard about that.

1 Like