Identify response variable probability distribution: on the use of pp_check

leomarameo7 · April 16, 2021, 1:47pm

Hello everyone!

Using brms, I want to define a hierarchical model capable of relating the response variable (fish_body_size) and a predictor variable (seawater temperature, SST). The response variable is positive and continuous, having no values equal to zero (Figure 1 and Figure 2). Initially, I used an “empty” model, in which only the intercept varies according to the species (since the model with the SST variable takes more than 8 hours to run). I then used the pp_check function to check the empty model’s ability to explain the distribution and range of observed values (Figure 3). I have identified the Gamma distribution and the skew_normal distribution as possible candidates. But I discarded the skew_normal because the model gives negative predictive values. But it seems to me that the empty model with Gamma distribution is not capable of representing the probability distribution of fish_body_size. I ask the community if I am following a wrong method (use the pp_check) or if the lack of capacity of the empty model derives from the formulation of the model (with or without predictor variable, SST). If you consider it appropriate, I can share the raw data.
Thanks
Figure 1
longo_freq-counts.pdf (10.0 KB)
Figure 2
fish_per species_frequency.pdf (12.6 KB)
Figure 3
pp_check.pdf (991.8 KB)

martinmodrak · April 22, 2021, 3:59pm

Hi,
unfortunately discrepancies can be caused by both a wrong response distribution and insufficient predictors, so in general hard to say exactly. What I would definitely do would be to subset the dataset first to a much smaller one where you can experiment more quickly (e.g. just take 10 species at random or something similar).

EDIT: I originally missed the histograms, so my first take was a bit of.

However, in your specific case, there is a very suspicious feature of the data, namely that looking at the histrograms, multiples of five and “nice” numbers in general are highly overrepresnted in the data. This probably means that some rounding happened during data collection. In that sense your data likely cannot be usefully considered completely continuous. For some use cases, it might be safe to ignore this, but in general you might need to handle the rounding explicitly (e.g. as in 6.1 Bayesian measurement error model | Stan User’s Guide) or - at the cost of loosing some signal - round all the numbers and then treat the data as discrete/ordinal.

Best of luck with your model!

Note: you should be able to copy-paste figures from R directly into your post or upload the figures as .png/.jpg which would make them rendered in the post and increase readability :-)

Topic		Replies	Views
Brms: issues with model specification (gaussian, student) and posterior predictive check Modeling fitting-issues , ecology , brms	1	1068	May 18, 2022
Choosing a sampling distribution for left skewed data brms	15	1507	March 20, 2024
How bad is this pp_check? Should I alter the distribution? Modeling fitting-issues , specification , brms	28	282	March 24, 2025
Plot doesn't look good from pp_check() in brms Modeling brms	3	1112	January 27, 2023
Posterior predictive checks: which of these models should I choose? Modeling	4	827	November 18, 2019

Identify response variable probability distribution: on the use of pp_check

Related topics