Identify response variable probability distribution: on the use of pp_check

Hello everyone!

Using brms, I want to define a hierarchical model capable of relating the response variable (fish_body_size) and a predictor variable (seawater temperature, SST). The response variable is positive and continuous, having no values ​​equal to zero (Figure 1 and Figure 2). Initially, I used an “empty” model, in which only the intercept varies according to the species (since the model with the SST variable takes more than 8 hours to run). I then used the pp_check function to check the empty model’s ability to explain the distribution and range of observed values ​​(Figure 3). I have identified the Gamma distribution and the skew_normal distribution as possible candidates. But I discarded the skew_normal because the model gives negative predictive values. But it seems to me that the empty model with Gamma distribution is not capable of representing the probability distribution of fish_body_size. I ask the community if I am following a wrong method (use the pp_check) or if the lack of capacity of the empty model derives from the formulation of the model (with or without predictor variable, SST). If you consider it appropriate, I can share the raw data.
Figure 1
longo_freq-counts.pdf (10.0 KB)
Figure 2
fish_per species_frequency.pdf (12.6 KB)
Figure 3
pp_check.pdf (991.8 KB)


unfortunately discrepancies can be caused by both a wrong response distribution and insufficient predictors, so in general hard to say exactly. What I would definitely do would be to subset the dataset first to a much smaller one where you can experiment more quickly (e.g. just take 10 species at random or something similar).

EDIT: I originally missed the histograms, so my first take was a bit of.

However, in your specific case, there is a very suspicious feature of the data, namely that looking at the histrograms, multiples of five and “nice” numbers in general are highly overrepresnted in the data. This probably means that some rounding happened during data collection. In that sense your data likely cannot be usefully considered completely continuous. For some use cases, it might be safe to ignore this, but in general you might need to handle the rounding explicitly (e.g. as in 6.1 Bayesian measurement error model | Stan User’s Guide) or - at the cost of loosing some signal - round all the numbers and then treat the data as discrete/ordinal.

Best of luck with your model!

Note: you should be able to copy-paste figures from R directly into your post or upload the figures as .png/.jpg which would make them rendered in the post and increase readability :-)