# Sum-to-zero intercepts and terminology

That’s an interesting perspective. Since the non-identifiability between overall and group-level intercepts is only attenuated by the prior, I would have thought that this yields an incorrectly wide posterior (provided that the prior is considerably wider than the likelihood, as should usually be the case). I’m using the term “incorrectly” because the wide posterior is not caused by uncertainty arising from the data but by the non-identifiability, so the cause is a rather technical one. In some sense, yes, this is not “incorrect” if the non-constrained model is the true data-generating model, but I’m not sure if all users are aware of this. The non-constrained model is fine as a data-generating model, but isn’t it problematic as a data-fitting model, at least when the number of groups is small and the marginal posterior of the overall intercept or the marginal posteriors of the group-level intercepts are of interest?

EDIT 1: This refers to @paul.buerkner’s post. I’m not quite sure why it’s not marked as a reply to that.
EDIT 2: My wording here is imprecise or even incorrect at some places. See the comment below.

1 Like

@fweber144 when the marginal posterior of the overall intercept is of interest, usually we are interested in the grand mean of the entire population of groups, not just of the groups that were sampled. As such, we must include the diffuseness due to the non-identifiability. Otherwise we’re just doing inference on the mean of the groups that we sampled.

When the marginal posteriors of the group-level means are of interest, we obtain these by adding the intercept to the group-level offsets, thereby “summing out” the non-identifiability.

3 Likes

As such, we must include the diffuseness due to the non-identifiability.

Hm, that’s the point I’m regarding as problematic. Of course, strictly speaking, you are right because we are talking about the non-constrained model taken as data-fitting model. But as I said above, I’m not sure if all users are aware of the fact that the diffuseness is caused by the non-identifiability and therefore of rather technical nature. I guess most users are not aware that the group-level effects could be constrained to sum to zero, thereby ensuring identifiability and avoiding the “technical” diffuseness. If they were aware, some might consider switching to a constrained model because it might be more appropriate for their analysis. These are cases were

Otherwise we’re just doing inference on the mean of the groups that we sampled.

is the objective (or one of the objectives) of the analysis. For example, consider a “random-effects meta-analysis”. In such analyses, the overall intercept as a centrality measure for only the groups (studies) that were included in the model fitting is often of interest. Prediction for a new group (study) might also be of interest, but that’s another topic (see next paragraph).

Apart from that, I think we need to make a difference between analysing the posterior and making predictions (for new groups). Could it be that you are thinking of predictions? If yes, then of course the uncertainty from the distribution of the group-level effects needs to be taken into account (additionally to the posterior uncertainty), but this is uncertainty I would differentiate from the technical diffuseness due to non-identifiability.

Again apart from that, I’m sorry because the wording in my comment above was a bit imprecise. I should have said “are needed” instead of “are of interest”. I’m also thinking of situations like calling brms::conditional_effects() with default arguments so that the posterior draws for the overall intercept are used, but the group-level effects are not added on top. In such situations, the non-identifiability does not sum out.

When the marginal posteriors of the group-level means are of interest, we obtain these by adding the intercept to the group-level offsets, thereby “summing out” the non-identifiability.

Sorry, the wording in my comment above was incorrect. I actually meant the group-level effects (or group-level offsets, as you said), so the differences between the group-level intercepts and the overall intercept. But thinking about it again, I guess these differences are rarely of interest (or needed) on their own. At least currently, I can’t think of a situation where they would.

2 Likes

@fweber144

For what it’s worth, I wouldn’t have said that the diffuseness was technical in nature or even that it’s caused by a non-identifiability. To me, the diffuseness in the intercept seems analogous to the uncertainty in the mean when fitting a normal distribution to a collection of observations. The difference in the hierarchical context is just that this diffuseness in the intercept induces annoying correlations in the posterior.

Note that if we fit a normal distribution to fixed data Y:
Y \sim Normal(\mu, \sigma)

we can rewrite as
Y = \mu + Z
Z \sim Normal(0, \sigma)

Now \mu and Z show exactly the same non-identifiability as in the hierarchical case. Imposing the sum-to-zero constraint “solves” the non-identifiability by fixing the intercept \alpha to the sample mean of Y and fixing Z = Y - \alpha.

I agree with you that if there are cases where the quantity of interest is the sample mean \alpha, and we are genuinely uninterested in \mu, then the sum-to-zero constraint is appropriate (though we could also recover inference on the sample mean by fitting without the sum-to-zero constraint and then just computing the sample means over the inferred locations of the groups).

3 Likes

If its the intercept-prior you’re talking about then no, this prior is not the only thing identifying it. The random effects assumption that the group means come from a common distribution also does. This gaussian distribution identifies the model, because it doesnt let the random intercepts wander away infinitely.

I’m basically going to repeat in my own words what was already said above, but I think it’s worth repeating as I suspect it is often misunderstood.

I’m with jsocolar on this point. I my view the wideness of the intercept estimate is desireable. I think trying to estimating this intercept with few groups is like trying to estimate the mean of some distribution from too small a sample size (e.g. 4). It’s just not a good idea.

An analogous problem is when you have a simple autoregressive model (e.g. an ar1 model). Here the same effect (huge variance of the intercept and other parameters) occurs when the autocorrelation is strong (ar1 parameter close to 1). It simply means that you need a longer timeseries to be able to estimate these parameters properly, as the autocorrelation reduces the effective sample size.

So in my view (in both of these model types) one shouldn’t think of the diffuseness of the intercept to be caused by a non-idendifieability, a technical problem. Rather the way I see it is that, first of all. Its not non-identified, but actually a completely healthy model, and second, this diffuseness is not caused by a technical problem but accurately reflects the statistical limitations of the underlying estimation problem.

Another, maybe less confusing, way to say this is that the difference between the sum-to-zero model and the other one is that the sum-to-zero model estimates the sample mean of the sample of groups we have, while the other model estimates the mean of the distribution that this sample was taken from. This can easily be confused in conversations as in the english language both of these might be called the “mean of the groups”.

1 Like

I’m not sure if I understood you correctly, but for the equivalent model you wrote down, I don’t see a non-identifiability. Z is not a parameter, but a transformed parameter. So when given \mu, then Z is fixed at Y - \mu. I have appended 4 Stan files. The original model is normal_std.stan. If I understood you correctly, the equivalent model you wrote down is normal_stdeqv.stan. To have a non-identifiability, I think a new parameter has to be introduced, producing something like normal_ext.stan. That extended model can be simplified as done in normal_exteqv.stan (which shows the non-identifiability more clearly). Now fitting the models from normal_std.stan or normal_stdeqv.stan to some simulated data works fine (see normal.R). Fitting the other two models fails, as expected because of the non-identifiability (I used the default flat priors, so there’s nothing to attenuate the non-identifiability).

normal.R (995 Bytes)
normal_ext.stan (288 Bytes)
normal_exteqv.stan (239 Bytes)
normal_std.stan (188 Bytes)
normal_stdeqv.stan (255 Bytes)

1 Like

If its the intercept-prior you’re talking about then no, this prior is not the only thing identifying it. The random effects assumption that the group means come from a common distribution also does. This gaussian distribution identifies the model, because it doesnt let the random intercepts wander away infinitely.

With “prior”, I meant the joint prior for all parameters, not a marginal one. And I regarded the Gaussian distribution of the group-level effects as part of the prior. But of course, it may also be seen as part of the likelihood. In hierarchical models, drawing the line between prior and likelihood can be done in multiple ways. Sorry that I did not make this clear.

I my view the wideness of the intercept estimate is desireable.

It is desirable if the non-constrained model is really the model that a user wants to fit. The question is the latter one. That’s why I wanted to point out that in my opinion, users should be informed more often about the non-identifiability and its consequences. There are situations where the sum-to-zero constraint is preferable.

Rather the way I see it is that, first of all. Its not non-identified, but actually a completely healthy model, and second, this diffuseness is not caused by a technical problem but accurately reflects the statistical limitations of the underlying estimation problem.

Well, this might just be a question of terminology now. What you are calling “the statistical limitations of the underlying estimation problem” is what I’m calling “non-identifiability”. The Stan and R code I posted above illustrates pretty well what I’m meaning with “non-identifiability” and the diffuseness due to it.
Btw, in my original comment above, I said “a rather technical”, not just “a technical”, to avoid ambiguities with convergence problems. At some later occurrences, I left “rather” out, simply because of laziness. So perhaps we had a misunderstanding here.

1 Like

Sorry that we now have basically two parallel discussions going on here. Perhaps we should stop the one I’m involved in (and perhaps move it to a different thread). But I think we largely agree on most of the discussed points. It might just be terminology and emphasis which is different. I just wanted to point out that many users might not be aware of the consequences of the non-identifiability/statistical limitations of the non-constrained model.

I agree its mostly terminological confusion.

I wouldn’t call it “technical” or “rather technical” or anything with “technical”, however, because I feel this implies that it’s a software limitation, while in reality the software is giving the correct answer.

1 Like

Yes, you’re right.

Hi guys,
I went ahead and split this conversation out as a separate topic in case anybody has more to say.

For my part, I’ll note that:

We can view this model as the limiting case of the model
\vec{Y} \sim Normal(\mu + \vec{Z}, s)
\vec{Z} \sim Normal(0, \sigma)
in the limit as s \rightarrow 0

Not sure if that’s clarifying or if we’re all just on the same page already. My only point, like @Raoul-Kima’s is that the sum-to-zero constraint yields an intercept that corresponds to the sample mean of Y rather than the population mean from which Y arises (whether we’re just fitting a normal distribution as in my previous post or whether it’s embedded in a hierarchical model as in this post). IMO the weirdness of doing this is nicely exposed in the limit that s \rightarrow 0 as above, yielding the example from my previous post. Also, it’s hard for me to imagine a situation where it’s possible to formulate a prior on the sample mean except via reference to a prior for the population mean.

As I said before, if the inferential goal is inference on the sample mean and one can formulate the correct prior on the sample mean (this is straightforward if your prior on the population mean is Guassian; otherwise it gets tricky), then this seems like a fine way to obtain that inference, which will be equivalent to the inference that you would obtain by fitting the “standard” model and then adding the intercept to the offsets post hoc.

1 Like

I sense no more disagreement in the parts of the discussion where I participated.

2 Likes

I hesitated to comment here or in the originating thread but I think that there are enough misconceptions that an attempt at clarification might be worth it.

Firstly “identifiability” is a technical property that isn’t all that related to how it’s often used colloquially in applied communities. “Identifiability” refers to whether or not a the likelihood function collapses to the right model configuration with infinite data replications; in general it doesn’t have much to do with the behavior of the likelihood function, or a resulting posterior density function, for finite data sets. For finite data sets the breadth of the likelihood function, or resulting posterior density function, is simply a manifestation of inferential uncertainties. To avoid confusion I refer to complex uncertainties as “degeneracies”. For much more on these terms and their relationships see Identity Crisis.

So long as the observational model is well-behaved then a simple normal hierarchical model

\theta_{k} \sim \text{normal}(0, \tau) \\ y_{n} \sim \pi(\mu + \theta_{k(n)}, \phi)

will always be identified whenever the infinite data replications fill each of the K contexts. Even if some contexts are not observed, so that the corresponding \theta_{k} is informed only by the hierarchical model, the population parameters \mu and \tau will be identified.

Overlapping factor models, or more colloquially “random effects models”, that overlay multiple hierarchies,

\theta_{k, j} \sim \text{normal}(0, \tau_{j}) \\ y_{n} \sim \pi(\mu + \sum_{j = 1}^{J} \theta_{k(n), j}, \phi)

are a bit more subtle. Here we need enough levels to be observed in each factor so that the factor contributions can be separated from each other; for much more see for example Impact Factor.

In other words identifiability isn’t an issue provided that the factor levels are sufficiently occupied, and there is a wealth of old-school experimental design results that inform how to ensure this in practice. That doesn’t mean, however, that the likelihood functions/posterior density functions for any finite observation will be all that well-behaved.

Perhaps the most common uncertainty is between the baseline and the heterogeneous terms. Even in a simple normal hierarchical model we can often increase \mu while decreasing all of the \theta_{k} in observed contexts to achieve a similar fit, resulting in correlated uncertainties between all of those parameters. In a more complicated overlapping factor model one can perturb different factors in different ways to achieve similar behaviors.

To be clear these uncertainties are the consequence of the model and the data. More data and/or more prior information will in many cases suppress these uncertainties without compromising inferential performance. If these strategies are not available then all we can do is accurately quantify the uncertainties to communicate what we weren’t able to learn from the given observations.

More ad-hoc changes to the model, however, aren’t always so robust. One can always change the model to eliminated uncertainties. Replacing a hierarchal model with a single parameter immediately collapses any complicated interactions between the population parameters and the individual context parameters, but it also introduces model misfit if the heterogeneity across contexts isn’t negligible. Arbitrarily modifying the model simply to “improve the fit” is not good statistics.

This brings us to the often recommended “sum-to-zero” constraint. This constraint fundamentally changes the population model and its consequences.

Typical hierarchical models are motivated by infinite exchangeability which implies conditional independence between the individual context parameters; for much more see Hierarchical Modeling. For example it’s this conditional independence that allows one to readily make predictions for new, unobserved contexts. Perhaps more relevantly to much of this discussion conditional independence is what allows us to talk about the population location \mu as the “mean of the individual parameters” for normal hierarchical models.

Once we add a sum-to-zero constraint, however, we break infinite exchangeability and its useful properties. Obstructing interactions between the population location and the individual context parameters will reduce uncertainties but it will also complicate the relationship between the population parameters and the individual context parameters. We can no longer talk about the mean of any individual parameter without also considering the behavior of all of the other contexts. We also cannot generalize to any new contexts. There are a multiple of other issues that can arise as well – when the individual context parameters are strongly informed by data the sum-to-zero constraint can introduce strong degeneracies of its own!

To be clear infinite exchangeability is an assumption, and it may not always be appropriate for a given analysis. Exchangeable but not infinitely exchangeable models can be useful in many circumstances. When building hierarchal models, however, we have to understand these assumptions that we’re making and their inferential consequences. If one wants to add an ad-hoc constraint to reduce uncertainties, and improve computation, then one needs to verify that the resulting model is compatible with their domain expertise. It’s all modeling.

In my opinion much of this confusion arises because people take modeling assumptions for granted. Certainly most of the properties of the infinitely exchangeable hierarchical models on which almost all “random effects models” are built are never discussed in the introductions from which applied practitioners learn. When one doesn’t understand the import of assumptions being made implicitly then it’s easy to change those assumptions without understanding the consequences!

5 Likes

I’m unfamiliar with this notion of identifiability, which to me seems to be conflated a tad with consistency (which is sufficient but not necessary for identification afaik) in this definition. I am also accustomed to seeing models being referred to as unidentified when they have more regressors than data points, for the basic case of OLS. Is that not a finite sample property of the likelihood function? That it has no unique minimum?

I only ask for my own enlightenment, I may be a victim of econometrics-specific jargon. If there is some reference that discusses this in detail I’d appreciate a link!

1 Like

Sorry I was being a bit sloppy.

Formally identifiability is a property of the observational model \pi(y ; \theta). An observational model is identified only if \pi(y; \theta_{1}) \ne \pi(y; \theta_{2}) whenever \theta_{1} \ne \theta_{2}. In other words every probability distribution over the observations in the observational model has to be different. Note that the differences don’t have to be huge – any two distributions just can’t be exactly the same.

This formal definition is largely motivated by asymptotics – lack of identifiability provides an obstruction to the asymptotic limit where likelihood functions all collapse to a point, estimators are consistent, and the like. In other words it’s a necessary but not sufficient condition for consistency of many estimators. There strict uniqueness of the probability distributions in the observational model is enough that to guarantee that they can be discriminated with enough data, and the asymptotic limit will always accumulate enough data eventually.

The strict uniqueness of those probability distributions, however doesn’t necessarily mean much in practice. For any finite data set what matters for many frequentist methods and Bayesian inference is the likelihood function,

\mathcal{L}_{\tilde{y}}(\theta) = \pi(\tilde{y}; \theta).

The likelihood function evaluated at different \theta can still be the same even if the corresponding distributions labeled by those parameters are different. Even if an observational model is identified the likelihood functions can still be degenerate, and the nature of the degeneracy can vary from observation to observation.

At the same time a non-identified observational model will by definition have at least two likelihood equal values for any observation, but that might not be enough to result in a pathologically degenerate likelihood function.

This collinearity is a special kind of non-identifiability where the observational model decomposes into infinite subsets of equivalent probability distributions (formally planes in the model configuration space). In this case for any observation the realized likelihood function will also decompose into infinite subsets of equal likelihood values (again these subsets define planes in the model configuration space).

Some non-identifiabilities manifest as degeneracies for finite data, but not all do, which is why I warned that identifiability itself is not the most relevant concept.

I’ll again refer to my chapter on identifiability and degeneracy which goes into the details more carefully, Identity Crisis. It has has further references to some of the more classic treatments of the material including Casella/Berger and van der Vaart.