I think that this point cuts to the heart of an apparent confusion in the original post, and is worth elaborating on. This can be made intuitive:
Consider a regular Gaussian linear regression. We can think of this regression as containing a random term, namely the residual. Without any predictors at all, this random term always “explains” all of the variation. But when we introduce predictors with explanatory power, the model much prefers to attribute this variation to the predictors and not to the residual. Why is this? It’s because if we can narrow the standard deviation of the residual term while still fitting the data well, we get a higher likelihood. A tall skinny normal distribution places more probability density over its quantiles than does a short wide normal distribution.
The same thing is happening in the random effects model. Suppose we have a model like y ~ x + (1 | A)
where there is just one value of x
associated with each level of A
. Then this model has the form:
\mu_j = a + bx_j + \epsilon_j
y_i = \mu_j + \mathcal{E}_i
where j
refers to the level of A
corresponding to observation i
, and both \epsilon and \mathcal{E} are Gaussian. Notice that the first line has precisely the form of a linear regression. Again, the model “wants” to attribute all the variation it can to a + bx_j so that it can minimize the standard deviation of \epsilon.
The upshot is that even if (1|country)
explained literally all of the variation, if there are explanatory covariates with predictive power, the model very much would prefer to attribute the variation to them, for the same reason that a linear regression prefers to attribute variation to covariate effects rather than the residual term.