When is no-pooling multilevel modelling okay?

This might be a poorly placed question with no general solution, so I’m looking for general guide and pointers.

Problem

I have a setting where I’m looking to obtain 1,000s of regression coefficients. The data I have naturally lends itself to some kind of hierarchical structure (I’m leaving the specifics unspecified as I’m interested in the general class of problem). However, for each regression, I have 1,000s-10,000s of data points, so some of the typical arguments about the benefits of immediate pooling are not clearly applicable. That is to say, shrinkage should be minimal because the hyperprior shouldn’t be playing a major role in any particular regression coefficient’s posterior values for this much data. Also, fitting a hierarchical model, in this case, is really computationally intense, as I’m in the realm of millions of data points, along with thousands of coefficients. If I put them all into one model, hmc sampling could take arbitrarily long. (If you’re not satisfied that this is large enough to take a long to fit, please just scale the number of data points and regression coefficients in the problem up until you’re satisfied that it would take an uncomfortably long time to fit).

Candidate Solutuion

However, if I forgo a hierarchical model in this case, and just fit the 1,000s of regressions separately, I can parallelize these regressions straightforwardly on a compute cluster, avoid all the standard difficulties of trying to fit a potentially really complicated hierarchical, and have all the posteriors be fit in < hour, whilst potentially waiting days for a hierarchical model on this scale to potentially give me it’s first failed fit with divergences, etc.

Thoughts

Am I missing something in this line of reasoning? What’s wrong with doing this if I’ve got some idea that being hierarchical wouldn’t change the marginal posteriors all that much because I have so much data for each regression? Obviously, I’m not modelling that regression coefficients jointly anymore, but beyond this, what am I really losing? If this approach is valid, there are pragmatic questions, such as how many data points, per regression, should I be looking to have before I can think about throwing away hierarchical models and proceeding carefree with my simple separate models? With my thousands of regression coefficients, I’d like to answer things such as, which regressions likely have the largest effect sizes, and how confident can I be about that as reported by posterior rank probabilities, etc? By running each regression separately, I’m obviously modeling them independently, so the regression coefficients become independent random variables. But presumably in the limits of large n data for each regression, something approximating independence is achieved for the regression coefficients anyway? Are these intuitions right, or am I way off? Any feedback or pointers to existing material related to this would be greatly appreciated.

Thanks,
Aidan

1 Like

Welcome to Discourse!

It’s certainly ok to fit models with non-hierarchical priors (what would you do if you received data collected on just one level?), but in many settings there are big advantages to hierarchical priors. The setting you’ve described is one where those advantages are probably not very large. If you intend to try to make some kind of inference about “significance” based on credible intervals not overlapping zero, then you need to be careful in the non-hierarchical setting to either set a sufficiently strong regularizing prior or to worry about multiple testing. However, in the setting you described, it would probably be silly to make inference based on whether credible intervals overlap zero anyway, as you will likely observe some well-constrained minuscule effect sizes that don’t happen to overlap zero.

With all that said, Stan’s reduce_sum function for within-chain parallelization should be really effective at scaling inference in the multi-level model to your problem. My guess is that reduce_sum should scale nearly as well as fitting all of the 1000s of regressions separately. If the number of regressions goes into the tens or hundreds of thousands then reduce_sum can still be effective, but for performance reasons you would want to make sure to slice the random effects parameter vector rather than the data.

For more about using reduce_sum, see
https://mc-stan.org/users/documentation/case-studies/reduce_sum_tutorial.html

Hi @jsocolar,

Thanks for your reply.

I have previously read this paper by Gelman and Hill regarding multiple comparisons. http://www.stat.columbia.edu/~gelman/research/published/multiple2f.pdf

My understanding of this paper seems is that it demonstrates the advantages of hierarchical model, and that it mitigates the need for multiple hypothesis correction as hierarchical models are explicitly modelling the variation between groups explicitly. The paper gives examples of this in the small n for each group regime. However, with a large n in each group, the posteriors of the regression coefficients should asymptotically become the same when the model is or is not hierarchical, as the dependence on the prior should be diminished. As such, if I was to do many comparisons with a non-hierarchical model, I should largely reach the same conclusions as the hierarchical model does, and whether the x% credible intervals overlapped with zero should have similar conclusions in both models when there are a lot of data in each group. Is there something wrong with this? In this large n setting, with an unpooled model, it’s really unclear to me why I would need a “really strong regularizing”, or to multiple hypothesis correct, as asymptotically the posterior distribution should be well calibrated. Do you have any references for how or why people “multiple hypothesis correct” when they have many independent posteriors with a lot of data? For instance, if I ran many regressions unpooled regressions, checking to see whether the credible intervals overlapped with the 0, then what procedure would you have in mind for multiple hypothesis correcting this in a Bayesian manner?

If you literally do your inference by checking for credible intervals that overlap with zero, without stopping to think or care about effect sizes, then yes there is something wrong with this. Moving from the small-n regime to the large-n regime does tighten estimates around their true values, BUT: if we (frequentist style) imagine that a bunch of effects are null, then in the large-n regime these effects should be estimated as very close to zero, but they will also be estimated with very tight credible intervals. Without regularization, some too-large fraction of these credible intervals will fail to overlap zero.

With that said, this “problem” is in some sense artificial in that it arises because we’re imagining giving far too much inferential weight to whether or not credible intervals overlap zero. In the large-n regime, you will correctly conclude that these true null effects are all very small, and a more thoughtful inferential procedure simply wouldn’t read too deeply into whether or not they overlap zero. As long as the inference is thoughtful in this way, then you’re right that regularization becomes less important as the n gets large, because the likelihood dominates over the prior.

Edit: this discussion has a direct analogue in the frequentist regime, which might be more clear. If you’re doing null hypothesis significance testing, and you just care about controlling the false discovery rate to be less than some alpha, where “discovery” makes no reference to effect size and just seeks a sufficiently low p-value, then moving to a large-n regime does not avoid the need to correct for multiple comparisons, even though the estimates asymptotically approach the true values. The estimates approach the true values (of zero) but the confidence intervals asymptotically approach zero width, and these processes happen, loosely speaking, at the same rate, such that the false discovery rate for a single model under a true null remains alpha.

Since you have lots of data, you’re probably in a situation where the so-called “centered” parameterization will sample significantly faster (see here).

Also, if you have a likelihood where there are >10 IID replicates for each subgroup of observations (or at least most subgroups), then you should use the Gaussian Sufficient Stats trick, see here & here & here.

1 Like