What is meant by having coefficients in "batches"

Hello everyone,

Andrew Gelman frequently refers to the idea of batching coefficients in hierarchichal models (e.g. here: http://andrewgelman.com/2004/10/27/the_blessing_of/ and here: http://www.stat.columbia.edu/~gelman/presentations/mlmtalk.pdf in reference to dimensionality). Am I understanding correctly that this would essentially mean having for e.g. five predictor variables that are economic related that you specify as having a shared distribution and then using the sd of that distribution as a way of assessing the importance of those predictors as a group? If not, could someone please provide directions to some slightly lower level reading that might help me understand?

I wouldn’t try to guess what Andrew meant in 2014, much less 2004. But if I were to guess, I would guess that he is talking in what came to be known as the lme4 sense where you having grouping variables in the formula, as in y ~ x1 + x2 + (1 + x1 + x2 | g). So, you have an intercept, a coefficient on x1 and a coefficient on x2 in a “batch” for each level of the factor g.

This makes sense to me within interaction or subgroup analysis (e.g. g is a subgroup with several levels), but I’m having trouble thinking of how you would specify the model if you have multiple predictors within classes. Say I have 20 variables, ten of them are similar in that they are related to illness severity, and 10 are related to demographics. Would my data be set up as:

y = response
x1 = value at level of g1
g1 = factor 10 illness severity indicators
x2 = value at level of g2
g2 = factor of 10 demographic indicators

y ~ 1 + x1 + x2 + (x1|g1) + (x1|g2)

Then x1 and x2 fixed effects would be the general effect of illness/severity indicators with random coefficients indicating the deflection for each?

Also I know you mentioned to me that we usually include fixed effects for any slopes we give random terms to, but can you recommend any reading on why?

The frequentist distinction between a fixed parameter (that can be estimated) and a component of the random error term (that cannot be estimated) is not going to make it any easier to understand what is going on. Also, it doesn’t really make sense to think of the levels of g1 and g2 as a random sample from possible severity indicators or demographics.

A Bayesian would tend to look at the sum of the coefficient on x1 the deviation in the coefficient on x1 at each level of g1 and the deviation in the coefficient on x1 at each level of g2. In other words, for each pair of levels of g1 and g2 there are three components to the expected difference in the outcome between two otherwise identical people who differ by one unit on x1.

Doing y ~ (x1 | g1) + (x1 | g2) forces one component to be exactly zero and deviates the coefficient on x1 by level of g1 and g2.

Did you eliminate the x2 in my example because you didn’t think it was needed, or just to make your example easier to understand? I was imagining a dataframe:
id y x1 g1 x2 g2
1 5 5 severity scale 30 distance
1 5 2 # of infections 23 age

Is this a bad thing?

The same goes for x2.

It seems difficult to think of a variable where one would be sure it has no effect on average across levels of g1 and yet believe the variance in the effect across levels of g1 is non-zero.

This notation, if I’m not mistaken, maps to the following much more explicit mathematics:

y = A + B(g1)*x1 + C(g2)*x2 

Where now B(g1) is a different number for each possible value of g1 (and all the same stuff applies to g2 so I’ll just drop it for explication)

Now you can simply redefine the function B(g1) for example as B0 + Bc(g1) where B0 is a constant, and Bc is a function which averaged across all the possible values of g1 averages 0.

Then, you have (B0 + Bc(g1))* x1, which you can rewrite B0*x1 + Bc(g1) * x1

and that’s the difference between y ~ 1 + (x1|g1) vs 1+ x1 +(x1|g1)

It’s more or less just rearrangement of the terms of a polynomial to group them differently.

y ~ 1 + (x1|g1) is the syntax for a restricted model where B0 = 0.
It is tempting to think that y ~ 1 + (x1|g1) is centered and y ~ 1 + x1 + (x1|g1) is non-centered, but the lme4 parser hard-codes a non-centered parameterization where, in terms of vectors and matrices, eta = X * beta + Z * b. And then {g}lmer integrates out b. The stan_{g}lmer functions condition on Z but still use the same non-centered parameterization.

Really? see I knew I hated that simplistic R modeling syntax. One of the things I love about Stan is that I know EXACTLY what model I’m getting by the end of it.

1 Like

I think this is what I assumed I was getting by dropping the x1, so I’m glad it came up. I

Can either of you recommend any resources to really walk a novice through the process of transitioning from lme style syntax to stan/JAGS/WinBUGS? I am gradually learning some WinBUGS as the result of it being the language of choice for a lot of new developments in research synthesis, but so far that’s been limited to applying/slightly modifying/troubleshooting existing code.

So just to bring this full circle for me. This model as specified does accomplish the goal of batching the 10 illness and 10 demographic predictors together into the x1 and x2 coefficients? Wouldn’t I also need to add an intercept for each participant id (1|id)? Am I right that each x1/g1 pair would need to be centered and scaled? And then each x1/g1 slope is interpreted as a deflection from the overall x1?

As a follow-up: If one of the levels of g1 was a four level factor (say ethnicities: white, black, hispanic, indigenous) I would just manually dummy code correct? So if id 1 was white they would be entered as:

id x1 g1
1 0 black
1 0 hispanic
1 0 indigenous

Just learn Stan, there’s no reason to spend time on BUGS/JAGS unless maybe most of your models are with discrete parameters.

I hear good things about McElreath’s book “Statistical Rethinking” for learning more extensive modeling but I haven’t looked at it.

to really walk a novice through the process of transitioning from lme style syntax to stan/JAGS/WinBUGS?

There’s a couple functions in the brms package (https://cran.r-project.org/web/packages/brms/index.html) make_stancode and make_standata that turn formula-like models into Stan models + the input data.

I’ve not done this myself (other than checking to make sure the functions do print out human-readable stuff), but if you want to look under the hood this might be the easiest way. And the package is @bgoodri approved (Brms paper published), so I don’t think I’m leading you too far astray.

I don’t think you are leading anyone astray but it is easy to end up in the Bermuda Triangle anyway.

To start with, one has to understand the lme4 syntax from its documentation or

Then, you have to always keep in mind that lme4 integrates u out of
eta = X * beta + Z * b = X * beta + Z * (s .* (L * u)) * sigma
whereas Bayesian approaches condition on Z (unless Andrew / Dan come up with something that works). In the Stan case, there are independent standard normal priors on the elements of u, a LKJ prior on the correlation matrix L * L', and some prior on the vector of standard deviations, s (which brms does slightly differently than rstanarm). Also, in the Gaussian case sigma is estimated with some prior, but in a logit or Poisson model, sigma is fixed to 1.

Then, you have to keep in mind that rstanarm and brms do equivalent but different versions of this linear predictor where rstanarm (and lme4) creates one Z matrix and brms creates one or more matrices that would be the same as Z if you glued them together columnwise.

Finally, you have to keep in mind that BUGS and other Gibbs-samplers are typically going to use a centered parameterization where
eta_{ij} = x_i * beta_j and the vector of coefficients for group j has a multivariate normal prior with mean vector mu and covariance matrix Sigma = quad_form_diag(s, L * L') * sigma^2. Except they parameterize things in terms of the inverse of this covariance matrix. With a Gibbs sampler and a centered parameterization, you can update each beta_j as a block. With Stan, the leapfrog step evolves all the parameters simultaneously anyway, but the centered parameterization usually (but not always) yields worse geometry.


Really appreciate all of the suggestions, and time spent explaining some of the more minute details. It’s clear I have jumped the line a bit on the understanding basic principles —> responsible application continuum, but I think the more time I spend here the more I will be able to understand where I might benefit from revisting.