Hi all, I think that checking the distributional assumptions of random effect terms is sometimes called “mixed predictive” checking, to emphasize that the discrepancy measure is a function of parameters and not just of simulated data (as in posterior predictive checking). This sort of checking is good to do, most especially if you care about making predictions for weakly informed or out-of-sample groups. If the group means aren’t normally distributed, then you’ll make bad out-of-sample predictions, even if your model is able to constrain the random effect parameters to their best-fitting values.
The question, then, is what discrepancy measure can we use in this mixed predictive check.
DISCLAIMER
What follows is stuff that I basically made up as I tried to navigate this issue (I wanted to predict a model to never-observed groups). I have no idea if it aligns with anybody else’s best practices.
One possibility is to use a Kolmogorov-Smirnov test statistic as the discrepancy measure. This potentially works for any arbitrary parametric form of the random-effect distribution.
However, if we have normally distributed random effects, then we know that in frequentist terms a Shapiro-Wilk test is more powerful than a Kolmogorov-Smirnov test. @mike-lawrence points out that it might be even more powerful to construct multiple tests for the different ranks, but this might carry some risk of elevated type 1 error (i.e. failing at least one of these PPCs frequently even if the underlying process is normal). Frequentist intuition suggests that this should be the case, since Shapiro-Wilk is the most powerful general-purpose test for normality. However if you have specific ideas about how normality is likely to be violated (either in terms of which moments are nonzero or in terms of which data-points are “least normal”) then you can do better. Let’s explore the latter situation, where we have clear expectations about which data-points are “least normal”:
If there are a bunch of really poorly informed data, then they are going to be sampled directly from the hyperparameters, and they will be normally distributed. If we can focus on the strongly-informed data we get a much stronger test. Here’s an illustration of that (using R):
library(cmdstanr)
set.seed(123)
# Suppose that the real values aren't normal:
strongly_informed_values <- rt(50, 3)
shapiro.test(strongly_informed_values)
# Suppose that there are also a lot of *really* weakly informed values. In the weakly informed limit,
# the weakly informed values are drawn directly from their random-effect prior:
# model to get the random-effect distribution:
regression_code <- brms::make_stancode(strongly_informed_values ~ 1, data = as.data.frame(strongly_informed_values))
fileConn<-file("/Users/jacobsocolar/Desktop/mean_model.stan")
writeLines(regression_code, fileConn)
close(fileConn)
regression_model <- cmdstan_model("/Users/jacobsocolar/Desktop/mean_model.stan")
p_strong <- vector()
bayesian_p_all <- vector()
for(j in 1:100){
strongly_informed_values <- rt(50, 3)
# Can we distinguish the strongly informed values from normal?
p_strong[j] <- shapiro.test(strongly_informed_values)$p.value
# get the hyperparameters for the random effect:
regression_data <- brms::make_standata(strongly_informed_values ~ 1, data = as.data.frame(strongly_informed_values))
class(regression_data) <- "list"
fitted_params <- regression_model$sample(chains = 1, data = regression_data)
# Get the mixed predictive distribution for the discrepancy measure (here the Shapiro-Wilk p-value)
shapiro_tests <- vector()
for(i in 1:100){ # 100 posterior iterations
mu <- as.numeric(fitted_params$draws()[,,"Intercept"])[10*i]
sigma <- as.numeric(fitted_params$draws()[,,"sigma"])[10*i]
# There are 500 weakly-informed values
weakly_informed_values <- rnorm(500, mu, sigma)
all_values <- c(strongly_informed_values, weakly_informed_values)
# Get the discrepancy measure over all values, rather than just the strongly informed ones
shapiro_tests[i] <- shapiro.test(all_values)$p.value
}
# Compare the discrepancy measure to draws from the mixed predictive distribution (taking advantage of the fact that
# we know the p-value of a Shapiro-Wilk test is uniform under the mixed predictive distribution--this distribution is
# normal by construction)
bayesian_p_all[j] <- sum(shapiro_tests < runif(100))/100
}
# We don't have great power to detect misspecification if we look at all of the data
hist(bayesian_p_all)
# But we have pretty good power if we look at just the strongly informed data
hist(p_strong)
When the data are really imbalanced in how strongly they are informed, one potential solution is to choose some cutoff X
in terms of the number of data points per group and perform the check only on the estimates with more than X
data points. Choose X
too low and you’ll run into the problem above. Choose X
too high and you’ll have too few data-points to compute the Shapiro-Wilk statistic and you’ll get a noisy discrepancy measure.