Brms: issues with model specification (gaussian, student) and posterior predictive check

Hi,

I am using brms and I am seeking advice on how to adjust my model specification based on posterior predictive check results.

My response variable is continuous and represents the difference between the final and the initial distances (measured by GPS) of an individual to a fixed polygon, for each month.

The distance variable is then positive when the individual moves away from the polygon during the month (i.e., its final distance is greater than the initial distance), and negative when it gets closer (i.e., final distance < initial distance). If the individual was and remained within the polygon, or stayed at an equal distance from the polygon, the distance is 0.

I have 59 individuals (females and males), that are monitored several months and sometimes during several years, leading to 765 observations (ID-month).

Here is the distribution of my distance variable:

You can see that a lot of individuals stay within the polygon (peak at 0), and that some individuals can move a lot at greater distances.
My first attempt to understand the monthly variation in distance is below. I first tried to fit a brm() model with the gaussian family:

mod_gauss <- brm(Dist ~ Month + Sex + Weight + (1|ID) + (1|Year),
data = tab_dat, family = gaussian(),
chains = 3, iter = 24000, warmup = 4000, thin = 4,
cores = 3, control = list(adapt(adapt_delta = 0.98)))

where the Month and Sex are factors (and the Weight variable corresponds to the number of GPS locations recorded during the month for each individual to measure for a potential bias due to the GPS recording).

That model converged well, here is the summary:

Family: gaussian 
Links: mu = identity; sigma = identity 
Formula: Dist ~ Month + Sex + Weight+ (1 | ID) + (1 | Year) 
Data: all_month_ind (Number of observations: 765) 
Draws: 3 chains, each with iter = 6000; warmup = 1000; thin = 4;
total post-warmup draws = 3750

Group-Level Effects: 
  ~ID (Number of levels: 59) 
                Estimate Est.Error l-95% CI u-95% CI   Rhat Bulk_ESS Tail_ESS
sd(Intercept)   1.9038    1.5884   0.0659   5.9645 1.0000    14317    13498

~Year(Number of levels: 10) 
                Estimate Est.Error l-95% CI u-95% CI   Rhat Bulk_ESS Tail_ESS
sd(Intercept)   2.4372    2.1880   0.0805   8.0871 1.0000    13866    14661

Population-Level Effects: 
                Estimate Est.Error  l-95% CI u-95% CI   Rhat Bulk_ESS Tail_ESS
Intercept       5.5497   24.2095  -41.2484  53.3543 1.0000    14662    14310
Monthmai      -16.1251   10.2041  -35.8115   3.6511 0.9999    12978    14178
Monthjuin     -41.2289   10.1286  -60.8787 -21.2383 1.0000    13154    13986
Monthjuil.    -62.1557   10.3287  -82.4877 -42.0309 1.0000    13075    13573
Monthaoût     -28.9785   10.3716  -49.0260  -8.5567 1.0000    13742    14304
Monthsept.    -26.5548   10.4687  -46.7636  -5.8644 1.0000    13311    13878
Monthoct.     -35.9533   10.6489  -56.6318 -14.9359 0.9999    13321    14093
Monthnov.    -121.3006   11.0212 -143.1058 -99.8523 1.0001    13107    14077
Monthdéc.     -73.5587   11.2449  -95.9426 -51.4944 1.0000    13493    14013
Monthjanv.P1  -27.1956   11.3819  -49.7920  -5.1402 1.0001    12417    13840
Monthfévr.P1  -13.5913   11.6732  -36.9981   9.1901 1.0001    13571    13913
MonthmarsP1   -23.6835   12.1765  -47.4323   0.3809 1.0000    13956    14082
SexM           -7.3838    4.9331  -16.9902   2.2179 1.0001    14187    14268
Weight         0.1416    0.1372   -0.1263   0.4120 0.9999    14767    13910

Family Specific Parameters: 
  Estimate Est.Error l-95% CI u-95% CI   Rhat Bulk_ESS Tail_ESS
sigma  64.4803    1.6609  61.3537  67.8209 1.0000    14222    14132

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

But I find the pp_check very odd (left panel : full view, right panel : zoomed x30):

(1) Any ideas on what is happening here?

I saw that the Gaussian is a specific case of the Student family and that using the Student family could help as it is more flexible and could allow for fat tails. I ran the same model with family = student() but it was extremely long to run (several hours for student vs half an hour for the gaussian) and I never got it converge properly (Rhat > 3). I am not even sure that the pp_check of a model that did not converge is relevant but here it is:

(2) Is the gaussian model good to use ? If not, is there something I can do to fix it or should I use a different family?
(3) Is the Student family a good choice here? Any idea on what might be the reason for the increased duration of run (compared to gaussian) and how to make it converge?

Thank you in advance for all your help, here are the data if needed:

Data.csv (34.7 KB)

Neither the Gaussian nor Student-t distributions look like they’d be very good to use here. The main issue is that your response is zero-inflated. It looks like well over half of your observations are zero. Something like a hurdle model seems like it would be more appropriate. A hurdle model is basically a mixture distribution where data are either zero and if if they are not zero they arise from some other probability distribution. brms has a few built-in hurdle distributions, but all of them are for non-negative distributions: Hurdle: Hurdle Distributions in brms: Bayesian Regression Models using 'Stan'

You might be better off writing your own model in Stan.