Summarizing skewed, unimodal PPD

bachlaw · January 15, 2024, 6:07pm

I believe I have worked out the right approach to this, but would appreciate the insight of others.

To summarize, I am working with the skew normal distribution, which is both a good choice for my subject matter and reasonably easy to fit. The model fits well with good mixing, and the posterior-predictive checks look good. But, my usual approach to summarizing predictions of taking the mean or median across all draws of the PPD is producing “wrong” or at least highly misleading values. I suspect the answer is that I need to calculate the mode of the density of the PPD draws rather than the more traditional (and much easier) mean or median.

Consider a simple model with no covariates:

library(brms)

set.seed(13579)

### generate fake data ###

skew_df <- data.frame(y = rskew_normal(1e4, 89, 14, -8)) # mean, sd, shape

### fit simple y-intercept model and call results ###

skew_mod <- brm(y ~ 1, 
                family = skew_normal(), 
                cores = 4, chains = 4, 
                seed = 13579,
                iter = 3000,
                warmup = 1000,
                data = skew_df)

skew_mod

If one cares to look at the model call, the parameters are substantially recovered, which is good. So now do a posterior predictive check:

pp_check(skew_mod, ndraws = 100) + ggtitle("PP Check, 100 draws")

Again, it looks good.

But, now compare the densities of (1) the original y variable, (2) a single random draw from each row of the PPD, and (3) the density of colMeans(posterior_predict(skew_mod)):

Obviously, taking the mean of the PPD draws is not going to work.

Reasoning this out, I posit the following:

The goal in summarizing at least a unimodal distribution is to find the most likely probable value for each y.
With unimodal distributions, this typically is the mean of the draws, or the median, which will either give a similar answer to the mean but accommodate a bit of messiness in the distribution.
But when the distribution is unimodal and skewed, by definition the mean is no longer a sufficient descriptor of location, because if it were, you would just be using the normal distribution in the first place.
The “most likely” location in a unimodal density is the mode, and we use the mean and median only as convenient substitutes for the vast majority of situations because they are usually very similar to the mode in value and far easier to calculate.
But when a unimodal distribution is skewed, the mean and median are no longer sufficient to describe the most likely value, and we revert to the mode.

Am I correct? Incorrect? Sort of correct?

That leaves the question of how to efficiently calculate the mode, particularly across many thousands of rows. Perhaps this is a good start, but if there is an existing, fast function someone else has already written in R, I’d appreciate a reference.

Thanks much.

edited to fix mismatched legend

jsocolar · January 15, 2024, 9:15pm

This is one possible goal among many, and its not a common goal in Bayesian statistics.

As you not below, this isn’t actually true; it additional depends on the unimodel distribution being symmetric.

Sufficient is not defined here, but I think your thoughts are along the right lines.

No! We rarely care much about the mode, except when as a matter of happenstance it happens to be a good summary of where the important probability mass is.

Again no. In general we come up with a good way to describe the actual shape of the distribution, rather than summarizing it down to a point.

bachlaw · January 15, 2024, 10:06pm

Thanks for these insights.

But if we put nomenclature to one side, a model that can’t predict y isn’t very useful. Taking the means or medians of draws from this model cannot predict y whatsoever, despite the model fitting the data well. So, how would you be inclined to extract predicted values of y in this situation with reasonable accuracy, matching the approximate distribution of y, like the single random draw from each row of y is able to do?

I recognize that any value of y has uncertainty and I am used to capturing that also. But here, we aren’t even able to do that. How do we accurately predict y from this model across multiple draws? Having multiple draws should be making the prediction for each value of y more accurate, not less. Which means there must be a better way to summarize those rows.

jsocolar · January 16, 2024, 2:42am

I’m not sure what you mean by “predict” here. To predict y from this model, we would take row-wise draws from the posterior predictive distribution. The rest of your question looks to me like it stems from a misplaced belief that there ought to be a good general purpose formula for summarizing arbitrary univariate probability distributions using a single scalar.

bachlaw · January 16, 2024, 3:53am

OK. And it sounds like, in your view, there is no measure of central or predominant tendency that would most likely recover the original y, with some level of uncertainty around that number, from those row-wise draws of a skewed normal distribution.

I remain puzzled that there is no meaningful alternative for this distribution to colMeans(posterior_predict) as we would use for your typical exponential distribution, which also seems to be the foundation of the predict function included with some of the Stan front ends. But that disconnect seems to be on my end. Again, I appreciate the responses.

jsocolar · January 16, 2024, 3:58am

If the goal is to choose a number that is within \epsilon of the true y with maximal probability, for some fixed \epsilon, then this is a well-defined problem that has a clear solution, though the solution will not in general correspond to the mean, median, or mode, and will depend on the details of the distribution an the choice of \epsilon.

Topic		Replies	Views
Posterior predictive check looks weird - what can I do? Modeling posterior-predictive , brms	16	4052	April 24, 2024
Choosing a sampling distribution for left skewed data brms	15	1507	March 20, 2024
Plot doesn't look good from pp_check() in brms Modeling brms	3	1113	January 27, 2023
Posterior predictive checks - kurtosis and skew brms	18	4022	October 9, 2019
Bayesian models - choosing between Means, Medians, MAD, MPE Modeling brms	3	2653	April 1, 2022

Summarizing skewed, unimodal PPD

Related topics