I believe I have worked out the right approach to this, but would appreciate the insight of others.
To summarize, I am working with the skew normal distribution, which is both a good choice for my subject matter and reasonably easy to fit. The model fits well with good mixing, and the posterior-predictive checks look good. But, my usual approach to summarizing predictions of taking the mean or median across all draws of the PPD is producing “wrong” or at least highly misleading values. I suspect the answer is that I need to calculate the mode of the density of the PPD draws rather than the more traditional (and much easier) mean or median.
Consider a simple model with no covariates:
library(brms)
set.seed(13579)
### generate fake data ###
skew_df <- data.frame(y = rskew_normal(1e4, 89, 14, -8)) # mean, sd, shape
### fit simple y-intercept model and call results ###
skew_mod <- brm(y ~ 1,
family = skew_normal(),
cores = 4, chains = 4,
seed = 13579,
iter = 3000,
warmup = 1000,
data = skew_df)
skew_mod
If one cares to look at the model call, the parameters are substantially recovered, which is good. So now do a posterior predictive check:
pp_check(skew_mod, ndraws = 100) + ggtitle("PP Check, 100 draws")
Again, it looks good.
But, now compare the densities of (1) the original y
variable, (2) a single random draw from each row of the PPD, and (3) the density of colMeans(posterior_predict(skew_mod))
:
Obviously, taking the mean of the PPD draws is not going to work.
Reasoning this out, I posit the following:
- The goal in summarizing at least a unimodal distribution is to find the most likely probable value for each
y
. - With unimodal distributions, this typically is the mean of the draws, or the median, which will either give a similar answer to the mean but accommodate a bit of messiness in the distribution.
- But when the distribution is unimodal and skewed, by definition the mean is no longer a sufficient descriptor of location, because if it were, you would just be using the normal distribution in the first place.
- The “most likely” location in a unimodal density is the mode, and we use the mean and median only as convenient substitutes for the vast majority of situations because they are usually very similar to the mode in value and far easier to calculate.
- But when a unimodal distribution is skewed, the mean and median are no longer sufficient to describe the most likely value, and we revert to the mode.
Am I correct? Incorrect? Sort of correct?
That leaves the question of how to efficiently calculate the mode, particularly across many thousands of rows. Perhaps this is a good start, but if there is an existing, fast function someone else has already written in R
, I’d appreciate a reference.
Thanks much.
- edited to fix mismatched legend