Hi Martin and all,
I’ve seen this issue discussed before, though I cannot say where/when/under what name. Note that there’s a sufficient statistic for the mean of a bunch of normally distributed observations, so we have a clear expectation for the sampling distribution of the mean of the random effect (edit: conditional on the hyperparameters), and this gives a sense of how much variability the model should be willing to tolerate in terms of attributing it ambiguously to the intercept versus the sample mean of the random effect parameters. It also means that we can approximate (or maybe exactly recover??) the “regular” model in terms of an intercept, a random effect with a hard sum-to-zero constraint, and an additional term \mu_R that represents the sampling distribution of the random effect sample mean.
As you point out, when the number of groups goes up, the posterior for \mu_R gets narrower, which should ameliorate the problem. But I think that a countervailing issue is that the posterior distribution for the intercept also narrows. Since only the sum of the intercept plus \mu_R is identified, progressively tighter distributions for \mu_R can still induce relatively large correlations with the intercept, because the posterior for the intercept is shrinking in tandem with the posterior for \mu_R. I’m not sure exactly how these effects interplay as the number of groups gets progressively larger…
Note that the sum-to-zero constraint implies a weird generative model (suppose you had a hard sum-to-zero constraint; then how would you predict the value for an unsampled group?), and I think this is why the sum-to-zero constraint isn’t standard in practice.
Edited (by me, @jsocolar) to remove potentially unwelcoming language of “it’s obvious that”.