How bad is this pp_check? Should I alter the distribution?

Hi! I am trying to estimate the effect of treatment on my response variable. I am using a Gaussian GLMM in brms in R, however when I check my model with pp_check, it seems my posterior predictions are underestimating the peakiness of my true data?

Is this pp_check okay, or should I alter my model somehow to better fit the data? I was thinking that maybe I need to transform my normal distribution to better reflect the data’s peakiness?

model <- response ~ treatment * distance + (1|individual / trial), family = gaussian()

Yeah, that doesn’t look great. It appears that your model’s assumptions probably don’t align with what you know about your response variable. It looks like there are no negative values in your response, but the Gaussian distribution doesn’t know what and thinks there should be a bunch. So, some questions:

What is your response variable? Can it only be a positive number? Can it only be a non-negative integer (i.e. whole numbers including zero)?

Also, what do your other variables mean here? If you explain the structure and nature of your data better, we’ll be able to help you sort out your modelling troubles in a much more principled way.

2 Likes

Hi sjp!

My response variable is a difference between two angles, so response = angle1 - angle2. The response does have some negative observations although much less than the positive observations, and there is a right skew to the data. I think I may have landed on a better model fit with the student t distribution, as it is better (although not perfectly) simulating the “peaky-ness” of my data.


You could try family=skew_normal() (see list of all available data model families at Special Family Functions for brms Models — brmsfamily • brms)

2 Likes

Would it make sense for your research question to work with the log ratio of the two angles instead of their difference? Skew normal is worth a try but my impression is it can handle only moderate skewness.

2 Likes

I’m definitely intrigued by this idea!

I don’t fully understand the upside of the log ratio of angles instead of a difference in angles, is this to account for any skew that occurs? What are the upsides to a log ratio response rather than a difference in angles?

I am looking at acoustic bat echolocation data to determine whether a bat is directing the center of its echolocation beam closer to which of two very closely spaced objects, object1 or object2. So my response is the differences between angle1 (angle between the echolocation call’s center and object 1) and angle 2 (angle between the echolocation call’s center and object 2).

I tried this and the model fits so much better using a log ratio response and the student t distribution! Thank you for your insight!!

1 Like

With raw differences you have both positive and negative values, which prevents a log-tranformation (or the use of the lognormal distribution) to handle the skewness. With the angle ratio you straddle 1 rather than zero and with a log ratio you are looking at multiplicative difference (for base 2, log(a1/a2)=1 would indicate that a1 is twice as large as a2 so the call center is two thirds to the direction of object 2). Now that I wrote that I am thinking that, if the call center is always between the two objects, you could also work with a1/(a1+a2) and a Beta distribution.

2 Likes

Oh fun! Thank you so much!

The angular differences are very small, I was playing with the natural log initially, but the base 2 transformation is more intuitive for back calculating the ratio! I’m including the histogram of the log ratio base 2 transformation as well as the beta, just for fun!


I tried a beta regression model with the ratio (angle1 / (angle1 + angle2) and my model pp_check isn’t fitting very well, again due to the “peakiness” of my data. What methods are possible for adjusting the beta regression to better fit my data?

If the distribution is the difference between two angles, wouldn’t a von Mises distribution be what you want? This has support in [0, 2pi]

My understanding was that VonMises helps account for circularity, i.e. 1° being closer to 359° than to 90°. Because I am comparing differences in two very similar angles, my data doesn’t occupy the entirety of the circular range, the difference between the two angles ranges from -1 to 8°, so if I converted to VonMises it would be 359 - 8°, I thought VonMises doesn’t perform well at very limited ranges with high concentrations ?

What is the treatment and the distance?

treatment1 is an object that has greater spatial complexity so like two balls that are textured and closely spaced, while treatment 0 is a simple object like 1 smooth ball.

Distance is how far the bat was from the object at each echolocation call. So a large distance is when the bat is far from the object, and distance gets smaller as it approaches.

But doesn’t each trial involve two objects (also two distances)?

Yup, the objects are super closely spaced, centimeters apart, and I measure the distance between the bat’s location and the object, they are so closely spaced we don’t know if the bat can differentiate whether it is two objects or not using echolocation, so you can think of it as an “object complex” rather than two objects.

I look at the distance between the bat and the mean of the two objects (so the center of the object complex). While the simple object I look at the distance between the bat and the center of the simple object. I look at echolocation behavior starting from when the bat is 5 m away to when they approach.

If a trial in treatment1 involves two closely spaced objects and a1, a2 are the angles between the call’s center and each object, what are the angles in treatment0?

The edge of the ball that is present in both treatments, we want to know if the bat is consistently shifting it’s call towards the center of the object complex or shifting it’s attention between the edges of the object complex, whereas we expect for the simple object their the bat’s call will generally be consistently in the center.

2 Likes

OK that really clears things up :)
It’s also a good demonstration for why the Beta would not make sense here. See how in treatment 1 the call center is to the right of the right point you calculate an angle for? The log-ratio would still make sense as it would register that as a call that is biased towards X2. But with calls whose center is outside the X1-X2 range, your two angles are not proportions of a full X1-bat-X2 angle.

Let’s break this down. Assuming individuals are exposed to both treatments and given that distance changes during a trial, the index-type specification of the model would be:

response ~ 0 + treatment + treatment:distance + 
          (0 + treatment + treatment:distance | individual) + 
          (0 + treatment:distance | individual:trial)

You are estimating an intercept for each treatment and a slope for distance for each treatment, assuming the effect of distance is linear. You let both vary by individual and you let the slope vary by trial. I think the corresponding contrast-type specification should be:

response ~ 1 + treatment + distance + treatment:distance + 
          (1 + treatment + distance + treatment:distance | individual) + 
          (0 + distance | individual:trial)

Depending on how many calls you have per trial you might also relax the linearity assumption.

2 Likes

I really appreciate this thoughtful discussion!! Here is another fun caveat, what if the angular difference DOES have a ~somewhat~ nonlinear relationship with distance? Because we are looking at angles, as the bat approaches the object, the angles tend to increase exponentially as distance decreases, SO if there are larger angular differences, these measurements increase exponentially as distance decreases, BUT if the call is more equidistant between the points, then it stays close to zero.