Centered vs non-centered model, different loglikelihoods

Hi, I am trying to understand the centered vs non-centered parameterizations of hierarchical models. I can understand that non-centered parameterization might work better because it makes the sampled parameters less correlated, avoiding issues such as Neal’s funnel. However, I have trouble accepting that the centering would only affect the geometry of the sampled parameters without affecting the underlying loglikelihood calculations. To me, it seems that the centered parameterization “penalizes” the model correctly for the variance of random effects, whereas the non-centered parameterization does not.

Context: I’m not a trained statistician. I mainly have hands-on experience with frequentist mixed-effects modelling, and some hands-on experience with Bayesian hierarchical modelling.

My initial question is: Is this difference in loglikelihoods a known issue which is not considered meaningful in practice? Any suggestions for reading up on the topic? If this is a widely known issue, then there is no need to go through the lengthy case example below. Otherwise, read on.

I provide a set of minimal working examples below. Although I am interested in estimating all parameters of a hierarchical model, for this minimal case example set it can be assumed that we only have data from one subject i and we only estimate the random effect value b_i for this one subject, together with the variance of random effects \omega. We assume that measurement error \sigma and the population mean \mu are known, and specified as data.

Centered model:

data {
  int N;vector[10] y;
  real mu;real sigma;
}
parameters {
  real b;real logomega;
} 
transformed parameters {
  vector[N] ipred;
  real omega=exp(logomega);
  for (i in 1:N) {
    ipred[i]=b;
  }
}
model {
  y ~ normal(ipred,sigma);
  b ~ normal(mu,omega);
  logomega ~ uniform(-10,10);
}

Non-centered model:

data {
  int N;vector[10] y;
  real mu;real sigma;
}
parameters {
  real braw;real logomega;
} 
transformed parameters {
  vector[N] ipred;
  real omega=exp(logomega);
  real b=mu+braw*omega;
  for (i in 1:N) {
    ipred[i]=b;
  }
}
model {
  y ~ normal(ipred, sigma);
  braw ~ normal(0, 1);
  //correction for use of braw, commented out in "traditional" non-centered parameterization
  target += log(1/omega); 
  logomega ~ uniform(-10,10);
}

The issue between specifying b \sim N(\mu,\omega^2) and b_{raw} \sim N(0,1) is that in the first case loglikelihood does not get penalized by log(\omega^{-1}). We can manually include it to the code, as specified above. I refer to this as “corrected non-centered” model. I refer to the non-centered model without the correction as “traditional non-centered model”.

We will now simulate some toy data, and see where the differently parameterized models converge. We use the “optimizing” function for the sake of reducing variability, we just want to see whether the mode of the parameter estimates is affected by parameterization; and the mode can be found out deterministically. The hypothesis is that the centered and “corrected non-centered” models will end up with the same parameter estimates and loglikelihood values, while the “traditional non-centered” will give differing parameter estimates.

library(tidyverse)
library(rstan)
## gen data
set.seed(1)
dat1 <- list(N=10,y=rnorm(10,1),mu=0,sigma=.1)
m1 <- optimizing(stan_model("stan_centered.stan"),data=dat1)
m2 <- optimizing(stan_model("stan_noncentered_trad.stan"),data=dat1)
m3 <- optimizing(stan_model("stan_noncentered_corrected.stan"),data=dat1)
map(list(m1,m2,m3),~signif(c(.$par[c("b","omega")],loglik=.$value),6)) %>% 
  do.call(rbind,.) %>% cbind(Model=c("centered","trad noncent.","corrected noncent."),.)

Model b omega loglik
centered 1.13132 1.13129 -274.815
trad noncent. 1.13218 85.3197 -274.192
corrected noncent. 1.13132 1.13134 -274.815

We can see that the centered and corrected non-centered models indeed seem to give identical loglikelihood values, and almost identical parameter estimates. The same cannot be said about the traditional non-centered model, which seems to over-estimate the value of \omega. The model has no “incentive” to try to keep the value of \omega small. However, the traditional centered model did not estimate \omega to be at its maximum allowed value at e^{10}\approx 22026 likely because increasing the value above a certain threshold no longer resulted in meaningful improvements in loglikelihood.

Summary: We established that the loglikelihood specification differs between the centered parameterization and the traditionally used non-centered parameterization. It is possible to apply a correction to the non-centered parameterization so that the loglikelihood specification is identical to that of centered parameterization.

Speculation: I have heard that the non-centered parameterization works well if there are not much data, in the case of hierarchical models. I wonder if the good performance of (traditional, non-penalized) non-centered parameterization in the case of small datasets is more due to the model being able to do “whatever it wants” without being properly penalized for the variances of random effects.

Repeating the initial question: Is this difference in loglikelihood specifications between the parameterizations a known issue? Is it meaningful in practice? Any suggestions for reading up on the topic?

1 Like

Hi

Firstly, there is a difference between the likelihood and posterior. The likelihood is the probability distribution of observing the data given some value for the parameter (or in a frequentist sense, the non-penalized loss function). The posterior on the other-hand is the probability of observing the parameters given the data (in a frequentist sense, the penalized loss function). The difference you are referring to is in the posterior and not the likelihood?

If we look at the model you have:

P(y|\mu_y) = \mathcal{N}(y|\mu_y,\sigma_{y}^2)
P(\mu_y|a) = \mathcal{N}(\mu_y|\mu,exp(a)); \space\space a=log(\omega)
P(a) = 1/20\times I[-10<=a<=10]

Or in terms of the log posterior:

ln(P(\mu_y,a|y))=-1/(2\sigma^2)(y-\mu_y)^T(y-\mu_y) - 1/(2exp(a))(\mu_y-\mu)^T(\mu_y-\mu) -1/2 \space ln(exp(a)) + constant

In this case we see that we do have some regularization for the parameter a, but this does not come from the explicit incorporation of the prior on a.

If we now consider the non-centered transformation, letting \mu_{y,raw}=(\mu_y-\mu)/\sqrt{exp(a)}, we use the determinant of the Jacobian when making a parameter transformation to sample from the transformed parameter. That is in the specific case we have: (also detailed in the stan manual )

P(\mu_{y,raw})=d\mu_{y}/d\mu_{y,raw} \space P(\mu_y|a) = \sqrt{exp(a)} P(\mu_y|a) = \mathcal{N}(\mu_{y,raw}|0,1)

Hence by inspection when we make the move from the centered to the non-centered parameterization, we do expect the posterior density to be different (specifically in this case it will differ with -1/2\space ln(exp(a)).

This is hence not a “problem” per say, but rather just an alternative model to the one you specify. You are most likely experiencing these differences due to the prior on logomega.

This only constrains the entries to be in the given range and does penalize the model to be near a certain value as you noted (for the non-centered model). The typical prior used for variance parameters is the inverse-gamma prior (or in some cases the log-normal or exponential priors).

To get an intuition behind why you are getting different results (for the non-centered case), if we look at the frequentist objective we want to minimize the SSE subject to minimizing braw whilst constraining logomega to be between -10 and 10. Hence it is not surprising that logomega takes on larger values (as you noted here). By default optimize uses the lbfgs algorithm which terminates based on relative gradient, objective function or parameter difference (one of these defaults may be encountered hence no further increase in logomega)

Some adjustement for the code

If you have a parameter which is bounded you can specify this as real<lower=-10,upper=10> logomega. If you specify the constraints stan will transform this to an unconstrained space and hence consider an unconstrained optimization problem through the incorporation of the jacobian when sampling (or through specifying the Jacobian=True when using optimization, may be different depending on which stan package you use).

The ipred is not needed. Using y ~ normal(b,sigma); will work fine.

When sampling from standard normal braw ~ std_normal(); is more efficient

1 Like

Dear Garren,

Thank you for all the comments.

The most salient issue seems that I should have used the jacobian adjustment in the non-centered model, and this adjustment would take care of the log posterior being equivalent to that of the centered model? If I understand correctly, you arrived to the conclusion that the jacobian adjustment would be −1/2\space ln(exp(a))=-a/2 and this is with the assumption that a is the log-variance. My original notation defined that the “correction” should be log(\omega^{-1})=-log(\omega) which is equal to the jacobian correction Garren proposed. The apparent “divide by 2” discrepancy between the corrections proposed by Garren and me can be resolved by observing that Garren defined that \omega equals the variance of \mu_y, whereas my original notation defined \omega as the standard deviation of b_i.

We could also come to the same conclusion by comparing the log-posteriors of the centered and non-centered models: The original model with centering, using Garren’s notation where \mu_y=b_i,a=log(\omega), has a log posterior of:

ln(P(\mu_y,a|y))=-\frac{1}{2\sigma^2}(y−\mu_y)^T(y−\mu_y)−\frac{1}{2e^a}(\mu_y−μ)^T(\mu_y−μ)−ln(e^a)/2 +constant

Whereas the non-centered model log-posterior without jacobian correction is

ln(P(\mu_y,a|y))=-\frac{1}{2\sigma^2}(y−\mu_y)^T(y−\mu_y)−\frac{1}{2e^a}(\mu_y−μ)^T(\mu_y−μ) +constant

In other words the non-centered model log-posterior without jacobian correction was missing -ln(e^a)/2 part. This is what I tried to communicate in my original post. In this case the jacobian adjustment agrees perfectly with what I called “correction” in my original post.

I had apparently not understood the Stan manuals regarding the need for jacobian adjustment. Just to make things crystal clear for myself: In this case of non-centered model, we need the jacobian adjustment because we are assigning standard normal distribution to a transformation of the sampled variable (\mu_y in Garren’s notation, b_i in my notation)? We would not need jacobian adjustment if we were using transformed parameters as parameters of a distribution? I ask this because the Stan manual states: "Because the transformed parameters are being used, rather than given a distribution, there is no need to apply a Jacobian adjustment for the transform. "

Housekeeping comments

Garren: Thank you for all the pointers. I agree I should have talked about log posterior and not loglikelihood. I agree that ipred was unneeded, i.e. the minimal working example I provided was not really minimal. I will keep in mind that use of std_normal() is more efficient than using normal(0,1). Thanks for the comment on how to specify the prior for variance parameters. In this case, I chose to use the uniform prior because I wanted to demonstrate that the optimization algorithm will end up giving a very high value for log-omega, and I thought a uniform prior was the most convenient way to accomplish this. I disagree that I am experiencing the between-parameterization differences due to the prior on log-omega, I believe the differences were due to me not including the jacobian adjustment in the non-centerered model, thus indeed having two different models which in retrospect were not even expected to give the same results.

General note: I have an impression that there are many case examples of non-centered parameterizations circulating in the internet, where the jacobian correction has not been used. This is just an impression. If this impression is correct, then newbies may become confused about the non-centered versus centered parameterizations also in future, because the newbies see versions of non-centered parameterization without jacobian correction circulating in the wild.

Sorry for the change of notation, it just made more sense in the context that your are applying the problem. To me it seemed like you have a bunch of noisy data y, where you want to want to estimate the mean of y, hence I used \mu_y instead of b.

What we are doing is changing the parameters of our model when using the non-centered model making it easier to sample from.

Is the posterior of the original model as you have

When we now make the parameter transformation our posterior is P(\mu_{y,raw},a|y) not P(\mu_{y},a|y). Hence we have:

ln(P(\mu_{y,raw},a|y))=-\frac{1}{2\sigma^2} (y-[\mu_{y,raw}\times \sqrt{e^{a}}+\mu])^T(y-[\mu_{y,raw}\times \sqrt{e^{a}}+\mu]) - \frac{1}{2}\mu_{y,raw}^T\mu_{y,raw}+constant

We can substitute back \mu_{y,raw}=(\mu_{y}-\mu)/\sqrt{e^{a}} but the posterior would still be defined in terms of \mu_{y,raw} and not \mu_y. That is:

ln(P(\mu_{y,raw},a|y))=-\frac{1}{2\sigma^2} (y-\mu_{y})^T(y-\mu_{y}) - \frac{1}{2e^{a}}(\mu_y-\mu)^T(\mu_y-\mu)+constant

Hence obtaining an alternative Bayesian framework.

If you want to keep the original posterior of your model, than you would need to include the Jacobian yes. That is optimizing P(\mu_{y},a|y). If you include the Jacobian in the non-centered case it would be the same as just having the centered model. The incorporation of the Jacobian is only useful when you make a bunch of transformations to some parameters and sampling the transformed parameters

Correct yes. In this case we optimize the transformed posterior P(\mu_{y,raw},a|y)

Using a uniform prior does bring this point across, but again, it does not favor any specific setting of the variance. In the original model, this is just a consequence of choosing a normal prior for b. Incorporating prior beliefs that omega should be small should give you fairly consistent results.

1 Like

Many thanks again for the helpful responses, and thanks for patiently teaching me rigor in notation/terminology.

Is fine, I am the one asking for answers with hat in hand, whereas you are using your own time to help out.

It would be the same as just having the centered model in terms of log-posterior, but it would not be the same in terms of sampling properties, right? I mean, if we include the non-centered parameterization, then shrinking the \omega would directly result in shrinking the random effects b_i correspondingly, thus hopefully making it a little bit easier for the sampling algorithm to sample small values of \omega. If we had a centered parameterization, then the sampling algorithm might run into problems by trying out a small value of \omega and not decreasing the b_i values correspondingly.

To explore this further, I copied the centered version of Neal’s funnel from an earlier version of the Stan User’s Guide

parameters {
  real y;
  vector[9] x;
}
model {
  y ~ normal(0, 3);
  x ~ normal(0, exp(y/2));
}

Then I modified the distribution of x to be non-centered, and applied a correction of 9\cdot ln(e^{y/2}) to log-posterior so that the log-posterior of the non-centered model should be the same as the log-posterior of the centered model.

Centered:
ln(P(y,x))=-\frac{y^2}{2\cdot 3^2}−\frac{x^2}{2e^y}−9\cdot ln(e^{y/2}) +constant

Non-centered:
\begin{aligned} ln(P(y,x_{raw}))&=-\frac{y^2}{2\cdot 3^2}−x_{raw}^2/2 −9\cdot ln(e^{y/2}) +constant \\ &=-\frac{y^2}{2\cdot 3^2}−\frac{x^2}{2e^y} −9\cdot ln(e^{y/2}) +constant \end{aligned}

parameters {
  real y;
  vector[9] x_raw;
}
transformed parameters {
  vector[9] x;
  x = exp(y/2) * x_raw;
}
model {
  y ~ normal(0,3); 
  x_raw ~ std_normal(); // implies x ~ normal(0, exp(y/2))
  target += 9*log(1/exp(y/2));
}

When I try to sample from the centered version of Neal’s funnel, I get lots of divergent transitions, as expected. When I try to sample from the non-centered, log-posterior corrected version of Neal’s funnel, I do not get these divergent transitions. Thus, even though the log-posterior is identical, the sampling properties are better in the non-centered version of Neal’s funnel.

I agree that incorporating prior beliefs that omega should be small would help in giving fairly consistent results. But I abhor a model parameterization wherein only the priors keep random effects variance within a reasonable range, whereas the data would be allowed to “pull” the value of random effects variance parameter towards infinite. I think the model should be parameterized such that the data would also be pulling the omega towards the actual variance of the random effects. Moreover, I think it is problematic to assume that the priors will take care of the issue, because if we routinely use non-centered, non-corrected hierarchical models, then there is a chance that results from these models (with over-estimated variances) will eventually get taken as priors for subsequent analyses, and we will have random effects variances that incrementally creep towards infinite (as data accumulate, and analyses get replicated), instead of converging towards the correct value, over time.

So with this line of thinking, I would consider that non-centered parameterization can be useful for helping Stan sample the parameters, but the log-posterior adjustment should always be combined with non-centered parameterization.

I probably should’ve been more direct with this. If we include the Jacobian in the non-centered case, that is P(\mu_{y,raw}, a|y), we would have the original parameter space P(\mu_{y},a|y) hence we would sample \mu_y when including the determinant and not \mu_{y,raw}.

Parameter transformations in statistics is not the same as that in optimization. This is because the parameters are treated as random variables and hence should have an probability distribution which should integrate to unity. If we have some distribution across x, P(x) and want to make a transformation to some different variable y=f(x). We obtain the distribution for the new variable through P(y)=|\frac{dx}{dy}|P(x). Taking the log density we get ln(P(y)) = ln(|J|) + ln(P(x)). Hence when making parameter transformations the new log posterior density differs with the Jacobian (they should not be equal if you are sampling from different parameters). For more details on parameter transformations see https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/03%3A_Distributions/3.07%3A_Transformations_of_Random_Variables.

This would be correct

This is however not the non-centered case. This is still the original posterior ln(P(y,x)).

We already have P(x|y)=\mathcal{N}(x|0, exp(y/2)^2), hence making the transformation x_{raw}=\frac{x}{exp(y/2)}. We thus have \frac{dx}{dx_{raw}}=exp(y/2) and Hence P(x_{raw})=\mathcal{N}(x_{raw}|0,I). The joint density (for the non-centered case) is thus:

P(x_{raw}, y)= P(x_{raw})P(y)
Which is an alternative to
P(x,y)=P(x|y)P(y)

In stan the parameters wrt to which the density was derived from is the parameters which should be defined in the parameters block. Considering the case of P(x,y)=P(x|y)P(y) but having P(x_{raw}) instead of P(x). Stan code for this would be:

parameters {
  real y;
  vector[9] x;
}

transformed parameters {
  vector[9] x_raw = x ./ exp(y/2);
}

model {
  y ~ normal(0,3);
  x_raw ~ std_normal();
  target += -9*log(exp(y/2)); // add jacobian since we have P(x_raw) not P(x)
}

The above would be the same as the original density (for which you should still encounter divergences).

If we now consider the non-centered case P(x_{raw},y)=P(x_{raw})P(y) the stan code is the same as that in the stan docs:

parameters {
  real y;
  vector[9] x_raw;
}

model {
  y ~ normal(0,3);
  x_raw ~ std_normal();
}

generated quantities {
  vector[9] x = x_raw * exp(y/2); // placed in generated quantities since it is not used in the model
}

Hopefully this clears up the confusion regarding parameter transformations in a Bayesian sense.

OK, thanks, I marked your post as “solution”. So I guess the conclusion is that this reparameterization is essentially the same model as the original, centered model, and it would be a misnomer to call it non-centered model. The log-posteriors of centered and non-centered models should differ by definition. Reparameterizing a centered model without causing changes to the log-posterior is still akin to having a centered model.

However, in the current case, the reparameterized version of the centered Neal’s funnel (which i originally called the “non-centered model with correction”) did lead to less divergent transitions than the original version of the centered model. The take-home message that I take is that this reparameterization can still be useful, even if its log-posterior corresponds to the original, centered model.