# Hierarchical Linear Models - Bayes vs. Frequentist

Good evening,

This question is a bit off-topic and not programming related but, since the community is probably consisting of a large fraction of Bayesian enthusiasts, it feels like a good place to ask.

I have an upcoming presentation where I illustrate Hierarchical Linear Models in a Bayesian Seminar.

In the concluding part I want to do a comparison of why it makes sense to use the Bayesian Framework but also in what instances it would make sense to do it the Frequentist way.

I have seen some lists in Gelman’s books but they seem to be pretty one-sided as in “Bayesian can do whatever Frequentist can and even better”. His resources are really good but when it comes to this, I think the arguments are a bit biased.

I am happy to hear arguments for both sides, especially in the context of Hierarchical Linear Models.

3 Likes

I would say that provided it is feasible to do MCMC, having a bunch of not-too-dependent posterior draws for the parameters is always more informative than having just a point estimate and a (negative inverse) Hessian, irrespective of whether the model is hierarchical or flat.

But there are a couple of additional points that are specific to hierarchical models. Suppose that a linear predictor can be written as \boldsymbol{\eta} = \alpha + \mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \mathbf{b} where \alpha is a common intercept, \boldsymbol{\beta} is a vector of common coefficients, and \mathbf{b} is a vector of group-specific coefficients. The frequentist position is that \mathbf{b} represents “errors” in \boldsymbol{\beta} across groups that vary in repeated sampling of observations from a population. Therefore, \mathbf{b} must be integrated out of the likelihood function to leave an integrated likelihood that depends only on the common parameters (\alpha, \boldsymbol{\beta}, and \boldsymbol{\Sigma}, which is the block-diagonal covariance matrix for \mathbf{b}). Since \mathbf{b} is integrated out, you cannot have an “estimator” of \mathbf{b}; rather you can only “predict” \mathbf{b} from the residuals \mathbf{y} - g\left(\alpha +\mathbf{X} \boldsymbol{\beta}\right)^{-1} conditional on \mathbf{X} alone. So, if you are interested at all in making inferences about the group-specific parameters \mathbf{b}, it is much more natural to use a Bayesian approach where the likelihood function depends on \alpha, \boldsymbol{\beta}, \mathbf{b} and \boldsymbol{\Sigma} and conditions on both \mathbf{X} and \mathbf{Z}. Also, too, the uncertainty (across datasets) estimates produced by Frequentist estimators of hierarchical models are known to be too small because they are calculated conditional on the predicted \mathbf{b} rather than integrated over what \mathbf{b} could be.

Last, but not least, hierarchical models are more prone than flat models to have a mode that is on the boundary of the parameter space, which is contrary to the assumptions made by the maximum likelihood estimator that the parameters are bounded away from the edges of the parameter space. If the hierarchical model only allows the intercept to vary across groups or only allows the intercept and one slope coefficient to vary across groups, then you have a decent chance to getting an interior mode but if \boldsymbol{\Sigma} has diagonal blocks that are 3\times 3 or bigger, then in my experience, more times than not you get a \widehat{\boldsymbol{\Sigma}} that is not positive definite or you get other warnings that the Hessian was not negative definite or something. Also, applied statisticians often use a maximum-likelihood estimator, find that \widehat{\boldsymbol{\Sigma}} is not positive-definite, and then remove some of the random effects in order to get a \widehat{\boldsymbol{\Sigma}} that is positive definite but a smaller size, which is akin to p-hacking. The Bayesian approach avoids problems with the mode by not caring about the mode; instead we take means, medians, and other quantiles from the posterior draws, none of which are anywhere near the mode. This concentration of measure phenomenon seems paradoxical to a lot of people, but in multidimensional continuous spaces, it is quite possible for the neighborhood around the mode to have zero posterior probability. Thus, I have been known to say that even if you do not adopt the Bayesian perspective on probability and even if you only care about the Frequentist properties of a point estimator of the common parameters across repeated samples from a population, the posterior mean or median is a better Frequentist estimator than is the mode of the (integrated) likelihood function because the latter has a non-trivial chance of being on the boundary of the parameter space.

30 Likes

Thanks, for the extensive reply! I’ve learned a few new things.

I prefer the Bayesian approach, but would add one other idea I have heard mentioned but not discussed here.

Essentially, the argument would be that hierarchical models are a more expensive form of ridge regression, and that a random intercepts model can often be equivalently estimated, and perhaps even better estimated, via glmnet and one of its various implementations (L2 or elasticnet penalties), because glmnet is not restricted to particular assumptions it could apply to individual groups or even groups versus other coefficients.

It seems to me this argument, even if correct as I have stated it, might not hold water once we start thinking about varying slope models and other, more advanced applications.

If anyone has thoughts on this, or can point to examples of any practitioner trying to model classic hierarchical datasets comparing glmnet to hierarchical models, that would be very interesting. Bonus points if the comparison is actually fair and run by somebody who is skilled at both methods.

1. The problem with modes being on the boundary of parameter space is discussed here:
http://www.stat.columbia.edu/~gelman/research/published/chung_etal_Pmetrika2013.pdf
and here:
http://www.stat.columbia.edu/~gelman/research/published/chung_cov_matrices.pdf
If you want to use point estimation (maximum penalized marginal likelihood), then we recommend using zero-avoiding priors as discussed in the two above papers. Here the issue is not really “Bayesian vs. frequentist” but “Draws from the posterior vs. Point estimate.”

2. Here’s a paper where I expressed some attempt at a balanced few of risks and benefits of Bayesian multilevel models:
http://www.stat.columbia.edu/~gelman/research/published/multi2.pdf

4 Likes

I would have said that ridge regression is a particular form of hierarchical model. To make the elastic net type things more analagous to hierarchical models that Bayesians do, you would have to penalize departures from the group-level coefficients rather than penalizing departures from zero (which is what is typically implemented). Also, you would need to be estimating the posterior distribution of the hyperparameters, rather than tuning them to optimal values.

I am not aware of any research that claims glmnet produces good point estimates of the group-specific parameters, just that doing that produces somewhat good predictions of the outcome in testing data.

1 Like

There are a lot of things going on with such a comparison.

If you set up a normal prior with a fixed scale \sigma on the coefficients of a regression centered at zero, you get ridge regression. You can set up a hierarchical model to estimate \sigma along with the regression coefficients by giving it a hyperprior.

Either way, you can take maximum likelihood or max marginal likelihood estimates or you can take Bayesian posteriors. With a Bayesian posterior, you can get point estimates or perform full Bayesian inference.

Practitioners of ridge regression in machine learning would cross-validate on some metric of interest in order to choose the prior (aka penalty). If you’re going to point estimates in the end, this can produce a very similar result to fitting max marginal likelihood (sometimes called “empirical Bayes” in various forms).

Same goes for elastic nets (mixed L1 [Laplace] and L2 [normal]) or other priors.

I’m not sure how you’re assigning cost, but running a single hierarchical regression is usually faster than using cross-validation.

2 Likes

Agree with all these points. To be clear, I didn’t agree with the argument, just found it somewhat novel. Thanks for the discussion.