I hesitated to comment here or in the originating thread but I think that there are enough misconceptions that an attempt at clarification might be worth it.

Firstly “identifiability” is a technical property that isn’t all that related to how it’s often used colloquially in applied communities. “Identifiability” refers to whether or not a the likelihood function collapses to the right model configuration with infinite data replications; in general it doesn’t have much to do with the behavior of the likelihood function, or a resulting posterior density function, for finite data sets. For finite data sets the breadth of the likelihood function, or resulting posterior density function, is simply a manifestation of inferential uncertainties. To avoid confusion I refer to complex uncertainties as “degeneracies”. For much more on these terms and their relationships see Identity Crisis.

So long as the observational model is well-behaved then a simple normal hierarchical model

\theta_{k} \sim \text{normal}(0, \tau)
\\
y_{n} \sim \pi(\mu + \theta_{k(n)}, \phi)

will always be identified whenever the infinite data replications fill each of the K contexts. Even if some contexts are not observed, so that the corresponding \theta_{k} is informed only by the hierarchical model, the population parameters \mu and \tau will be identified.

Overlapping factor models, or more colloquially “random effects models”, that overlay multiple hierarchies,

\theta_{k, j} \sim \text{normal}(0, \tau_{j})
\\
y_{n} \sim \pi(\mu + \sum_{j = 1}^{J} \theta_{k(n), j}, \phi)

are a bit more subtle. Here we need enough levels to be observed in each factor so that the factor contributions can be separated from each other; for much more see for example Impact Factor.

In other words identifiability isn’t an issue provided that the factor levels are sufficiently occupied, and there is a wealth of old-school experimental design results that inform how to ensure this in practice. That doesn’t mean, however, that the likelihood functions/posterior density functions for any finite observation will be all that well-behaved.

Perhaps the most common uncertainty is between the baseline and the heterogeneous terms. Even in a simple normal hierarchical model we can often increase \mu while decreasing *all* of the \theta_{k} in observed contexts to achieve a similar fit, resulting in correlated uncertainties between all of those parameters. In a more complicated overlapping factor model one can perturb different factors in different ways to achieve similar behaviors.

To be clear these uncertainties are the consequence of the model and the data. More data and/or more prior information will in many cases suppress these uncertainties without compromising inferential performance. If these strategies are not available then all we can do is accurately quantify the uncertainties to communicate what we weren’t able to learn from the given observations.

More ad-hoc changes to the model, however, aren’t always so robust. One can always change the model to eliminated uncertainties. Replacing a hierarchal model with a single parameter immediately collapses any complicated interactions between the population parameters and the individual context parameters, but it also introduces model misfit if the heterogeneity across contexts isn’t negligible. Arbitrarily modifying the model simply to “improve the fit” is not good statistics.

This brings us to the often recommended “sum-to-zero” constraint. This constraint *fundamentally* changes the population model and its consequences.

Typical hierarchical models are motivated by *infinite exchangeability* which implies conditional independence between the individual context parameters; for much more see Hierarchical Modeling. For example it’s this conditional independence that allows one to readily make predictions for new, unobserved contexts. Perhaps more relevantly to much of this discussion conditional independence is what allows us to talk about the population location \mu as the “mean of the individual parameters” for normal hierarchical models.

Once we add a sum-to-zero constraint, however, we break infinite exchangeability and its useful properties. Obstructing interactions between the population location and the individual context parameters will reduce uncertainties but it will also complicate the relationship between the population parameters and the individual context parameters. We can no longer talk about the mean of any individual parameter without also considering the behavior *of all of the other contexts*. We also cannot generalize to any new contexts. There are a multiple of other issues that can arise as well – when the individual context parameters are strongly informed by data the sum-to-zero constraint can introduce strong degeneracies of its own!

To be clear infinite exchangeability is an assumption, and it may not always be appropriate for a given analysis. Exchangeable but not infinitely exchangeable models can be useful in many circumstances. When building hierarchal models, however, we have to understand these assumptions that we’re making and their inferential consequences. If one wants to add an ad-hoc constraint to reduce uncertainties, and improve computation, then one needs to verify that the resulting model is compatible with their domain expertise. It’s all modeling.

In my opinion much of this confusion arises because people take modeling assumptions for granted. Certainly most of the properties of the infinitely exchangeable hierarchical models on which almost all “random effects models” are built are never discussed in the introductions from which applied practitioners learn. When one doesn’t understand the import of assumptions being made implicitly then it’s easy to change those assumptions without understanding the consequences!