Hierarchical Linear Models - Bayes vs. Frequentist

I would say that provided it is feasible to do MCMC, having a bunch of not-too-dependent posterior draws for the parameters is always more informative than having just a point estimate and a (negative inverse) Hessian, irrespective of whether the model is hierarchical or flat.

But there are a couple of additional points that are specific to hierarchical models. Suppose that a linear predictor can be written as \boldsymbol{\eta} = \alpha + \mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \mathbf{b} where \alpha is a common intercept, \boldsymbol{\beta} is a vector of common coefficients, and \mathbf{b} is a vector of group-specific coefficients. The frequentist position is that \mathbf{b} represents “errors” in \boldsymbol{\beta} across groups that vary in repeated sampling of observations from a population. Therefore, \mathbf{b} must be integrated out of the likelihood function to leave an integrated likelihood that depends only on the common parameters (\alpha, \boldsymbol{\beta}, and \boldsymbol{\Sigma}, which is the block-diagonal covariance matrix for \mathbf{b}). Since \mathbf{b} is integrated out, you cannot have an “estimator” of \mathbf{b}; rather you can only “predict” \mathbf{b} from the residuals \mathbf{y} - g\left(\alpha +\mathbf{X} \boldsymbol{\beta}\right)^{-1} conditional on \mathbf{X} alone. So, if you are interested at all in making inferences about the group-specific parameters \mathbf{b}, it is much more natural to use a Bayesian approach where the likelihood function depends on \alpha, \boldsymbol{\beta}, \mathbf{b} and \boldsymbol{\Sigma} and conditions on both \mathbf{X} and \mathbf{Z}. Also, too, the uncertainty (across datasets) estimates produced by Frequentist estimators of hierarchical models are known to be too small because they are calculated conditional on the predicted \mathbf{b} rather than integrated over what \mathbf{b} could be.

Last, but not least, hierarchical models are more prone than flat models to have a mode that is on the boundary of the parameter space, which is contrary to the assumptions made by the maximum likelihood estimator that the parameters are bounded away from the edges of the parameter space. If the hierarchical model only allows the intercept to vary across groups or only allows the intercept and one slope coefficient to vary across groups, then you have a decent chance to getting an interior mode but if \boldsymbol{\Sigma} has diagonal blocks that are 3\times 3 or bigger, then in my experience, more times than not you get a \widehat{\boldsymbol{\Sigma}} that is not positive definite or you get other warnings that the Hessian was not negative definite or something. Also, applied statisticians often use a maximum-likelihood estimator, find that \widehat{\boldsymbol{\Sigma}} is not positive-definite, and then remove some of the random effects in order to get a \widehat{\boldsymbol{\Sigma}} that is positive definite but a smaller size, which is akin to p-hacking. The Bayesian approach avoids problems with the mode by not caring about the mode; instead we take means, medians, and other quantiles from the posterior draws, none of which are anywhere near the mode. This concentration of measure phenomenon seems paradoxical to a lot of people, but in multidimensional continuous spaces, it is quite possible for the neighborhood around the mode to have zero posterior probability. Thus, I have been known to say that even if you do not adopt the Bayesian perspective on probability and even if you only care about the Frequentist properties of a point estimator of the common parameters across repeated samples from a population, the posterior mean or median is a better Frequentist estimator than is the mode of the (integrated) likelihood function because the latter has a non-trivial chance of being on the boundary of the parameter space.

32 Likes