Is it okay to penalize a parameter to enforce a sum to 0 constraint? Like if I want beta in a model to be centered at 0, I could set the prior mean to be 0, but I also could penalize the likelihood when the mean of beta isn’t 0. My estimation seems to work very well in this case, but I haven’t seen anyone do this to essentially introduce a sum to 0 constraint. Is this acceptable to do?
Something like this for instance, where i want the betas that belong to a particular gamma to have a mean of 0.
for (i in 1:I) {
beta[i] ~ normal(0, sigma_beta_k[item_type_for_beta[i]]);
}
// Constraint for betas within each gamma group
for (k in 1:K) {
vector[num_items_per_group[k]] beta_group;
int idx = 1;
for (i in 1:I) {
if (item_type_for_beta[i] == k) {
beta_group[idx] = beta[i];
idx += 1;
}
}
// Enforcing the sum of beta_group to be zero
target += -150 * square(sum(beta_group)); // Large penalty for non-zero sums
}
I have followed these quite extensively. Ultimately my current code is just soft constraining with priors. However, i noticed that my “random” effects i have centered with a 0 mean prior end up not having an actual 0 mean in the estimates, lets say beta has a mean of .05 for instance. I think this makes my estimation take a little longer to run since it is struggling a bit with identifiability.
Ideally, the means of any random effects i have should be 0 i assumed such as if i ran glmer(count ~ -1 + (1|person) + (1|item) + type, family = poisson). In this glmer, the means of the person and item parameters would be 0. If i wanted my estimates to match glmer, then i would need to enforce the means of my random effects to be 0, not some value closeish to 0 that seems to occur with soft constraints through priors. Hard constraints didn’t seem to work that well, but i randomly tried penalizing the likelihood when the mean of my random effect parameters deviated from 0 and this seems to give me results close to glmer, in addition to making my estimation nearly 10x faster. I was more confused why this approach is not used as I may be missing something here. Is there something wrong with the approach I am using here?
is the log-kernel of a normal distribution with mean zero and known standard deviation, I would be a bit surprised if this yielded markedly different results than this approach (see line sum(alpha) ~ normal(0, 0.001);).
EDITS: Changed “kernel” to “log-kernel”; corrected the link, and added the corresponding code line here.
I recently have tried this as well. It does give essentially the same results, however using this sum(alpha) prior approach is roughly 5x slower than penalizing when the mean of the parameter deviates from 0.
The short answer is IMHO that the sum-to-zero model, while practically useful, is a bit weird theoretically. For example, keeping everything else in the model fixed, it implies that a priori the variance attributable to the random effect is smaller when there are fewer groups then when there are many groups. It is also unclear how to use that model to make predictions about unseen groups (if those are of interest) and if the population standard deviation is of interest, the sum-to-zero model will incur a bit of bias. You may not care about this, and then sum-to-zero is probably OK.
This is also why in the linked post I was interested in finding a way to keep the sum-to-zero constraint for efficient sampling, and then transforming the draws to obtain samples from the model without that constraint as that would let you keep the best of both worlds. It is also possible that lme4 has some smart tricks to avoid those issues (people have recommended the lme4 paper to me to read, but I haven’t yet had the time)
In my case, I do not think it would be possible to obtain samples from the model without the constraint, as without the constraint the model is not identifiable. In short, i have individual test item effects (beta) and item type effects(gamma). Without a constraint, the model would have no way of determining if the overall effect from an item comes from the sum of individual items or just from their type, and would be able to give infinite solutions to this. So currently, I am just saying that the sum of individual items within an item type must sum to 0.
In my experience, models including random effects nested in this way often work pretty well, so I don’t think that’s inherent to this model class (might still be specific to the data you have). You may also explore the makemyprior paper, which allows you to set quite sensible priors,which might help with identifiability (or may not)…
Just to add: there are certainly examples where “soft” sum-to-zero constraints are useful in Stan (e.g. for identifiability), and there’s nothing intrinsically unacceptable about doing so. Here is a prominent example from @mitzimorris , where a soft sum-to-zero constraint is placed on the parameter phi
From that link:
The expression on line 4 enforces the sum-to-zero constraint on phi:
normal_lpdf(sum(phi) | 0, 0.001 * N);
Since the random vector phi sums to zero, it follows that the mean of phi must
also be zero, but instead of requiring the mean to be exactly zero, this constraint
“soft-centers” the mean by keeping it as close to zero as possible. This expression
calls Stan’s implementation of the normal probability density. The calling syntax
for a probability density functions follows probability function notation so that
a vertical bar is used to separate the outcome from the parameters of the
distribution. The straightforward specification of this constraint is:
normal_lpdf(mean(phi) | 0, 0.001);
The mean is the sum of the vector elements divided by the vector length and
division is a relatively expensive operation. By multiplying the location and
scale parameters by the vector length, we remove the division operation from
the formula.
Doing the soft constraint like you referenced here does bring the mean to essentially 0. This gives the same estimates as penalizing the likelihood when the mean of beta deviates from 0. The soft constraint does take quite a bit longer to sample though.
So it seems either of these approaches work well in my case. I’d be curious how brms handles random effects to push them towards zero, so I may go take a look there out of curiosity. My full model cannot use any built in Stan distributions though.
I also want to confirm, if a random effect in my model doesn’t have a zero mean, that would be a problem right?
No that would not be a problem, though it depends on exactly what you mean.
If the population mean of the random effect is non-zero, such that the levels are drawn from normal with mean mu not equal to zero, then that mean will not be identified from any other intercept terms in the model. Keep things to one single intercept.
If the population mean of the random effect is zero, but there is no constraint on the sample mean, then this is the usual thing that we mean by random effects.
If you enforce that the sample mean of a random effect is zero, then there are some situations where the model might be better identified and marginally faster, but there are very few situations where this gives the inference that you actually want. For example, the linear predictor (leaving out the random effect) no longer corresponds to the population mean, and the variance of the random effect no longer corresponds to the population variance. Unless you have some very specific and nonstandard reason for doing this, don’t do it.
If the mean of my random effect isn’t 0 in the data, but then I force the mean to be 0, what I see happen is that my fixed effect “absorbs” this shift I forced. So if the mean of beta is .05, but I force the estimated mean to be 0, my fixed effect will absorb this .05. My intended interpretation of beta in my model is essentially an error term of the fixed effect. So I estimate a fixed effect, and then beta variance is whatever is unexplained by my fixed effect.
How would this change the interpretation of my parameters compared to just a soft constraint on beta?
In a standard linear model, for simplicity without covariates, the residual is like an error of the fixed effect intercept. That doesn’t mean that you would want to peg the estimate of the intercept to the sample mean with no uncertainty by forcing the sample mean of the residuals to equal zero. Just be aware that this is exactly analogous to what you are doing when you force the sample mean of the random effect vector to be zero. In this sense, the fixed effect predictor no longer provides a well calibrated estimate of the population mean.
Edit: this is equally true if you soft-constrain the mean to be zero. The standard thing that people mean when they talk about random effects is a parameter given a prior that is a normal distribution with mean zero and some hyperprior on the scale, with no additional constraint, hard or soft, on the sample mean.