Normality assumption for random intercept/coefficients in hierarchical models

Hi, I have a few questions with hierarchical models (this example).

  1. how to assess the normality assumption for the random intercept/coefficients
beta[l,d] ~ normal(mu[d], sigma[d]);
  1. what other distributions (maybe t-distribution?) could be used?
  2. Is it possible to use a multi-modal distribution (maybe a mixture of normals)?

I don’t know if all these questions have answers. But any example in Stan would be fully appreciated. Cheers~

1 Like

Anything you want!

Yes, for sure a mixture can be used. Hierarchical growth-mixture models are decently common in some fields.

You’d need to do posterior predictive checks for this. For checking that the mid-level variates are consistent with the model distribution choice, you could get the empirical CDF of the variates and the difference thereof from expected CDF in a manner akin to what’s done in the context of SBC.

1 Like

Thanks for the fast response. I have not heard of “hierarchical growth-mixture models” before! I tried to google but the number of hits is small. Can you provide some good references on it?

I expect that some sort of posterior predictive checks would be the answer. I am just confused on “what to check against”. I have used posterior predictive checks to simulate the outcomes and compare modeled-based metrics against the actual. But I don’t know whether that is sufficient to say the normality assumption for the random intercept/coefficients is correct. More inputs would be appreciated.

Maybe leave off the “hierarchical” part when doing the search. I’ve certainly seen growth mixture models in the medical literature (ex. modelling longitudinal trajectories of patients with separate latent classes for those that respond to treatment and those that do not).

I was a little less than clear on my proposed PPC quantity. Assuming you have a quantity x modelled hierarchically, you can compare it’s sum-squared-deviations from the expected CDF to those of a random draw:

...
transformed data{
  vector[n] ranks = (1:n)/n - (1/n/2) ;
}
...
parameters{
  real mu ;
  real<lower=0> sigma ;
  vector[n] x ;
...
}
model{
...
  x ~ normal(mu,sigma) ;
...
}
generated quantities{
  // get the sum-squared-difference of ranks for x
  real ssdr_x = sum((ranks - normal_cdf(sort_asc(x),mu,sigma))^2) ;
  // generate simulated data
  vector[n] x_sim = normal_rng(mu,sigma) ; // might have to do rep_array(mu,n), etc
  // get the sum-squared-difference of ranks for simulated data
  real ssdr_x_sim = sum((ranks - normal_cdf(sort_asc(x_sim),mu,sigma))^2) ;
  // compute the difference
  real ssdr_diff = ssdr_x - ssdr_x_sim ;
}  

Then you can take a look at the posterior on ssdr_diff and determine whether the mass is appreciably shifted away from zero; if so, then this suggests that the data are informing on values for x that form a distribution that is inconsistent with the model.

It feels like there is a small confusion here: @sonicking seems to be asking bout the normal prior on a model parameter while @mike-lawrence discuss normal distribution of an observed quantity. (correct me if my impression is wrong).

Speaking about putting a normal prior on a coefficient, it should be pointed out that we don’t necessarily assume that it follows a normal distribution. For a prior to be sensible you need just a set of weaker claims, one pretty safe way to think about “good priors” is that:

  • The prior does not put overly low weight on coefficient values that a reasonable person would find plausible in the problem at hand
  • Your data has enough information about the parameter to make minor details of the prior irrelevant

Those would often be called “weakly informative” in that they are informative only by excluding highly implausible parts of the parameter space - e.g. change in log-odds of an outcome by 5 or more is usually highly implausible.

In the context of hierarchical models, we usually put the normal distribution here to provide a somewhat strong regularization (t-distribution would usually regularize less, the “Finnish horsesoe” tries to balance betwen Cauchy and normal). With enough data for each group, the difference won’t matter. But it can matter in low data settings.

So how do you actually go about checking the assumptions? Prior assumptions can be checked directly with prior predictive checks and one can definitely check the sensitivity of the results against changing the form of the priors (or any other part of model structure) for some more background see the Bayesian workflow preprint: [2011.01808] Bayesian Workflow

This is just a very short window into a potentially complex topic - there are many other sensible ways to think about priors/hyperpriors.

Hope that clarifies more than confuses.

It’s my understanding that the distinction between a “prior” and “likelihood” , and thereby “parameter” and “data”, starts to become nebulous in the context of hierarchical models. You can call the x ~ normal(mu,sigma) a prior with respect to x but it is also a likelihood with respect to mu and sigma. Regardless, my suggested PostPC quantity should be applicable to checking the distributional structure implied for both mid-level/latent quantities like x here as well as checking the distributional structure implied for the final data.

Note that I posted on this idea over on the Stan Slack, and @jsocolar provided input that when the entries in X vary in the degree to which the data informs on their value, those tend to dominate this kind of discrepancy-check and lead to under-sensitivity to true discrepancies. But note also that I have a vague suspicion that this under-sensitivity might be driven by collapsing things to a single discrepancy score, and that leaving things as a per-rank deviation-difference might help.

2 Likes

Please let me clarify. I am asking about the normality assumption for the intercept and coefficients in a hierarchical model (using this as the example):

beta[l,d] ~ normal(mu[d], sigma[d]);

The normal distribution is the prior on the regression coefficients. I also interpret that as the unobserved heterogeneity distribution for the coefficients. Therefore, I am asking whether it can be something else, in particular a multi-modal distribution (a mixture of normals?) and how to assess the accuracy of a single normal distribution.

1 Like

Hi all, I think that checking the distributional assumptions of random effect terms is sometimes called “mixed predictive” checking, to emphasize that the discrepancy measure is a function of parameters and not just of simulated data (as in posterior predictive checking). This sort of checking is good to do, most especially if you care about making predictions for weakly informed or out-of-sample groups. If the group means aren’t normally distributed, then you’ll make bad out-of-sample predictions, even if your model is able to constrain the random effect parameters to their best-fitting values.

The question, then, is what discrepancy measure can we use in this mixed predictive check.

DISCLAIMER

What follows is stuff that I basically made up as I tried to navigate this issue (I wanted to predict a model to never-observed groups). I have no idea if it aligns with anybody else’s best practices.

One possibility is to use a Kolmogorov-Smirnov test statistic as the discrepancy measure. This potentially works for any arbitrary parametric form of the random-effect distribution.
However, if we have normally distributed random effects, then we know that in frequentist terms a Shapiro-Wilk test is more powerful than a Kolmogorov-Smirnov test. @mike-lawrence points out that it might be even more powerful to construct multiple tests for the different ranks, but this might carry some risk of elevated type 1 error (i.e. failing at least one of these PPCs frequently even if the underlying process is normal). Frequentist intuition suggests that this should be the case, since Shapiro-Wilk is the most powerful general-purpose test for normality. However if you have specific ideas about how normality is likely to be violated (either in terms of which moments are nonzero or in terms of which data-points are “least normal”) then you can do better. Let’s explore the latter situation, where we have clear expectations about which data-points are “least normal”:

If there are a bunch of really poorly informed data, then they are going to be sampled directly from the hyperparameters, and they will be normally distributed. If we can focus on the strongly-informed data we get a much stronger test. Here’s an illustration of that (using R):

library(cmdstanr)
set.seed(123)

# Suppose that the real values aren't normal:
strongly_informed_values <- rt(50, 3)
shapiro.test(strongly_informed_values)

# Suppose that there are also a lot of *really* weakly informed values. In the weakly informed limit,
# the weakly informed values are drawn directly from their random-effect prior:

# model to get the random-effect distribution:
regression_code <- brms::make_stancode(strongly_informed_values ~ 1, data = as.data.frame(strongly_informed_values))
fileConn<-file("/Users/jacobsocolar/Desktop/mean_model.stan")
writeLines(regression_code, fileConn)
close(fileConn)
regression_model <- cmdstan_model("/Users/jacobsocolar/Desktop/mean_model.stan")

p_strong <- vector()
bayesian_p_all <- vector()
for(j in 1:100){
  strongly_informed_values <- rt(50, 3)
  # Can we distinguish the strongly informed values from normal?
  p_strong[j] <- shapiro.test(strongly_informed_values)$p.value
  
  # get the hyperparameters for the random effect:
  regression_data <- brms::make_standata(strongly_informed_values ~ 1, data = as.data.frame(strongly_informed_values))
  class(regression_data) <- "list"
  fitted_params <- regression_model$sample(chains = 1, data = regression_data)
  
  # Get the mixed predictive distribution for the discrepancy measure (here the Shapiro-Wilk p-value)
  shapiro_tests <- vector()
  for(i in 1:100){   # 100 posterior iterations
    mu <- as.numeric(fitted_params$draws()[,,"Intercept"])[10*i]
    sigma <- as.numeric(fitted_params$draws()[,,"sigma"])[10*i]
    
    # There are 500 weakly-informed values
    weakly_informed_values <- rnorm(500, mu, sigma)
    all_values <- c(strongly_informed_values, weakly_informed_values)
    
    # Get the discrepancy measure over all values, rather than just the strongly informed ones
    shapiro_tests[i] <- shapiro.test(all_values)$p.value
  }
  
  # Compare the discrepancy measure to draws from the mixed predictive distribution (taking advantage of the fact that
  # we know the p-value of a Shapiro-Wilk test is uniform under the mixed predictive distribution--this distribution is 
  # normal by construction)
  bayesian_p_all[j] <- sum(shapiro_tests < runif(100))/100
}

# We don't have great power to detect misspecification if we look at all of the data
hist(bayesian_p_all)
# But we have pretty good power if we look at just the strongly informed data
hist(p_strong)

When the data are really imbalanced in how strongly they are informed, one potential solution is to choose some cutoff X in terms of the number of data points per group and perform the check only on the estimates with more than X data points. Choose X too low and you’ll run into the problem above. Choose X too high and you’ll have too few data-points to compute the Shapiro-Wilk statistic and you’ll get a noisy discrepancy measure.

1 Like