Correlated posterior - Sum to zero constraint for varying intercepts?!

martinmodrak · December 8, 2021, 4:15pm

Hi,
so here’s an issue I’ve ran into in a couple analyses where simple uses of varying (random) intercepts resulted in problematic posterior distributions. In particular, the posterior for the overall intercept is negatively correlated with the posterior for all varying intercepts. The problem seems quite fundamental and appears already for simple models, but I have not yet seen it discussed elsewhere. Maybe I am missing something basic?

It appears one can get rid of the issue by enforcing a sum-to-zero constraint on the varying intercepts. Is there other/better solution? I don’t think I currently understand the problem well, so I’d be happy for any feedback (or links to other people describing/solving the problem). If you want a single file with all the models and code discussed below, you can get it at: https://github.com/martinmodrak/blog/blob/master/content/post/2021-intercept-correlations.Rmd

Similar problem is also described here: Low bulk ess simple random intercept model - #2 by martinmodrak

Tagging @paul.buerkner and @jonah as this appears to affect quite a lot of brms/rstanarm models, so maybe you’ve seen it before and can link me to resources.

Thanks for any hints.

The setup

The problem can be demonstrated with a very simple Poisson regression with one varying intercept and just two observations per group. Let’s simulate some data:

set.seed(5422335)
N <- 40
N_groups <- 20

groups <- rep(1:N_groups, length.out = N)

intercept <- 3 
group_sd <- 1
group_r <- rnorm(N_groups, sd = group_sd)

mu <- intercept + group_r[groups]
y <- rpois(N, exp(mu))

data_stan <- list(N = N,
                  N_groups = N_groups,
                  groups = groups,
                  y = y)

Note that with intercept 3 (on the log scale) even a single Poisson observation is already quite informative. The problem in my experience only appears when the data provide at least moderate amount of information about the parameters and has more pronounced consequences for convergence/sampling effectiveness when there is more information (e.g. when increasing number of observations per group in the code above).

I’ll also note that the problem manifests as well when using fewer groups, but I find it more puzzling with more groups.

Standard non-centered parametrization

Since non-centered is often the default, let’s start here. My Stan code for this case is:

data {
  int<lower=0> N;
  int<lower=1> N_groups;
  int<lower=1,upper=N_groups> groups[N];
  int<lower=0> y[N];
}

parameters {
  real intercept;
  vector[N_groups] group_z; 
  real<lower=0> group_sd;
}

transformed parameters {
  vector[N_groups] group_r = group_z * group_sd;
}

model {
  intercept ~ normal(3, 1);
  group_z ~ std_normal();
  group_sd ~ normal(0, 1);
  y ~ poisson_log(intercept + group_r[groups]);
}

The model fits without issues, with bulk ESSs between 500 - 1000 but there is definitely weird structure in the posterior - here’s a pairs plot:

Note the negative correlation between intercept and all the group_z parameters (and the positive correlations between group_z themselves. This can in my experience cause actual convergence problems if the varying intercepts are embedded in a more complex model than the simple example presented here.

Centered parametrization

So we remember from Mike Betancourt’s case study (Hierarchical Modeling) that when the data inform the varying intercepts very well (which might be the case here), it might be better to use a centered parametrization. So let’s try that:

data {
  int<lower=0> N;
  int<lower=1> N_groups;
  int<lower=1,upper=N_groups> groups[N];
  int<lower=0> y[N];
}

parameters {
  real intercept;
  vector[N_groups] group_r; 
  real<lower=0> group_sd;
}

model {
  intercept ~ normal(3, 1);
  group_r ~ normal(0, group_sd);
  group_sd ~ normal(0, 1);
  y ~ poisson_log(intercept + group_r[groups]);
}

Once again the model fits without big issues (although there are a couple rhats just above 1.01), however the bulk ESSs decrease to 300 - 800 for most parameters and the correlations don’t go away:

Sum to zero, non-centered

This is where stuff gets interesting. Maybe the real problem is that all the varying intercepts increasing a bit and the overall intercept decreasing the same bit produces the same likelihood and the only thing that’s preventing this from becoming a full-blown non-identifiability is the prior. I realized this might help after seeing that R-INLA/inlabru offer a sum to zero constraint as an option for all types of structures, including varying intercepts.

So we may change the model to force the vector of varying intercepts to sum exactly to zero, effectively removing one degree of freedom from the model. Here, I am using the QR parametrization from Test: Soft vs Hard sum-to-zero constrain + choosing the right prior for soft constrain - #31 by andre.pfeuffer (via @aaronjg and @andre.pfeuffer), but even a soft constraint (as discussed in the linked thread) seems to help the real models I’ve worked with noticeably.

I also get back to non-centered parametrization. The Stan code is:

functions {
  vector Q_sum_to_zero_QR(int N) {
    vector [2*N] Q_r;

    for(i in 1:N) {
      Q_r[i] = -sqrt((N-i)/(N-i+1.0));
      Q_r[i+N] = inv_sqrt((N-i) * (N-i+1));
    }
    Q_r = Q_r * inv_sqrt(1 - inv(N));
    return Q_r;
  }

  vector sum_to_zero_QR(vector x_raw, vector Q_r) {
    int N = num_elements(x_raw) + 1;
    vector [N] x;
    real x_aux = 0;

    for(i in 1:N-1){
      x[i] = x_aux + x_raw[i] * Q_r[i];
      x_aux = x_aux + x_raw[i] * Q_r[i+N];
    }
    x[N] = x_aux;
    return x;
  }
}

data {
  int<lower=0> N;
  int<lower=1> N_groups;
  int<lower=1,upper=N_groups> groups[N];
  int<lower=0> y[N];
}

transformed data {
  vector[2 * N_groups] groups_Q_r = Q_sum_to_zero_QR(N_groups);
}

parameters {
  real intercept;
  vector[N_groups - 1] group_r_raw; 
  real<lower=0> group_sd;
}

transformed parameters {
  vector[N_groups] group_r = sum_to_zero_QR(group_r_raw, groups_Q_r) * group_sd;
}

model {
  intercept ~ normal(3,1);
  group_r_raw ~ normal(0, 1);
  group_sd ~ normal(0, 1);
  y ~ poisson_log(intercept + group_r[groups]);
}

No fitting issues, bulk ESSs increase to 600 - 7000 and the pairs plot looks nice:

Now this changes the interpretation of the model. If I get it right, the fitted intercept is no longer the mean of the assumed population of groups, but rather the mean of the actual observed set of groups. This means that making predictions for new, unobserved groups will be a bit more tricky. And maybe there are other potential problems I am missing?

Also the real question is: this looks like it should be a quite common problem. So why isn’t this a well known thing? The simplest explanation seems to be that it is me who is missing something basic, so I’ll be happy to learn what is it?

It also feels like the problem should be less pronounced as I add more groups (as increasing all the group intercepts should incur larger penalty), but I can easily have 200 groups in this example and still see basically the same issue.

Sum to zero, centered

For this problem, the centered version of the sum-to-zero model also works well, with even better bulk ESS (1500 - 8000). Not showing the details for brevity (the code link has this model as well), but for real models where I encountered this, I needed to use both centered parametrization and sum-to-zero to get rid of convergence issues, but I assume those are just two orthogonal problems and maybe the data I’ve worked with just demonstrate both issues.

jsocolar · December 8, 2021, 8:25pm

Hi Martin and all,
I’ve seen this issue discussed before, though I cannot say where/when/under what name. Note that there’s a sufficient statistic for the mean of a bunch of normally distributed observations, so we have a clear expectation for the sampling distribution of the mean of the random effect (edit: conditional on the hyperparameters), and this gives a sense of how much variability the model should be willing to tolerate in terms of attributing it ambiguously to the intercept versus the sample mean of the random effect parameters. It also means that we can approximate (or maybe exactly recover??) the “regular” model in terms of an intercept, a random effect with a hard sum-to-zero constraint, and an additional term \mu_R that represents the sampling distribution of the random effect sample mean.

As you point out, when the number of groups goes up, the posterior for \mu_R gets narrower, which should ameliorate the problem. But I think that a countervailing issue is that the posterior distribution for the intercept also narrows. Since only the sum of the intercept plus \mu_R is identified, progressively tighter distributions for \mu_R can still induce relatively large correlations with the intercept, because the posterior for the intercept is shrinking in tandem with the posterior for \mu_R. I’m not sure exactly how these effects interplay as the number of groups gets progressively larger…

Note that the sum-to-zero constraint implies a weird generative model (suppose you had a hard sum-to-zero constraint; then how would you predict the value for an unsampled group?), and I think this is why the sum-to-zero constraint isn’t standard in practice.

Edited (by me, @jsocolar) to remove potentially unwelcoming language of “it’s obvious that”.

martinmodrak · December 10, 2021, 8:30am

I see how sum to zero is not a great solution in some sense - but what do people do to avoid the issues this seems to bring in more complex models? For me it seems that bumping up adapt_delta mostly works, but it feels quite unsatisfactory (and the performance suffers). Or is it actually fine for most people (and hence my problems might lie in other parts of the models)?

Regarding the sufficient statistic line of thinking: I’ll try to rephrase it to see if I understand you correctly.

Let’s say we have K groups, then the mean for group i is
\mu_i = \alpha + \beta_i where

\alpha \sim N(\mu_\alpha, \sigma_\alpha) \\ \beta_i \sim N(0, \sigma_\beta)

We can reparametrize in terms of a zero sum vector \bar\beta such as

\mu_R = \frac{\sum_i \beta_i}{K}\\ \bar\beta_i = \beta_i - \mu_R \\ \bar\alpha = \alpha +\mu_R \\ \mu_i = \bar\alpha + \bar\beta_i

Now one should be able to derive (too lazy to do this explicitly) a matrix \Sigma that is a function of \mu_\alpha, \sigma_\alpha, \sigma_\beta such that:

\begin{pmatrix} \bar\alpha \\ \bar\beta_1 \\ \vdots \\ \bar\beta_{k - 1} \\ \mu_R \\ \end{pmatrix} \sim MVN \left( \begin{pmatrix} \mu_\alpha \\ 0 \\ \vdots \\ 0 \\ 0 \end{pmatrix} , \Sigma\right)

Now \Sigma transforms any prior on the original model into a prior on the zero-centered model. Additionally, we note that \mu_R doesn’t enter the model beyond this prior, so we can actually leave it out of the model and use \Sigma to derive the distribution \pi(\mu_R|\bar\alpha, \bar\beta). This means we can sample \mu_R in gen.quants and use it to recover the original \alpha, \beta. Since \Sigma is going to have a lot of structure this all should be amenable to simple analytic formulas. Is that what you had in mind?

Once again the question remains: if this works why isn’t it used widely (removing a parameter feels like an unquestionable win). So I presume it is not so simple and either I made an error or there are drawbacks I missed. Thanks for any hints.

jsocolar · December 10, 2021, 3:07pm

Yeah this is what I had in mind. The challenge is that it interferes a bit with our prior specification, but I think this can be dealt with using an appropriately constructed \Sigma which I am also too lazy to derive. To recover exactly the same model we need:

A weaker prior on the intercept, since it’s now a prior on the intercept plus \mu_R
An appropriate prior on \bar{\beta}. We have to be quite careful with \Sigma here to ensure that we are getting the appropriate prior that interacts with the zero-sum constraint to yield the desired implied prior N(0, \sigma_\beta) on \beta. And this then complicates the specification of the prior on \Sigma. But as you say, it’s not too bad. I think there are two available versions of the appropriate prior. One generative: we increment the target by a multivariate normal lpmf over k-1 elements of \bar{\beta}, with a negative covariance term on the off-diagonals. The other is non-generative: we increment the target by a univariate normal over all k elements (including the kth element set to the negative sum of the others), but to inflate the standard deviation to a value larger than the desired standard deviation. In either case it should be possible to solve for the necessary term.

But where I really get lost is if there are multiple correlated random effects in the model. Perhaps the math all works out neatly, but I cannot say.

Edited to remove a dumb error.

avehtari · December 11, 2021, 12:19pm

One paper discussing the issue clearly (in my opinion) is Bayesian functional {ANOVA} modeling using Gaussian process prior distributions

martinmodrak · December 11, 2021, 12:29pm

I’ll also note that Ogle, K., & Barber, J. J. (2020). Ensuring identifiability in hierarchical mixed effects Bayesian models. Ecological Applications. has an IMHO good discussion. I’ll read through this and Aki’s suggestion and references and hopefully come back with a summary :-)

andre.pfeuffer · December 13, 2021, 2:55am

QR Decomposition, priors on the “reverse” R^{-1} * \theta

Following Stan User Manual QR reparameterization. I constructed a sum-to-zero design matrix A with intercept at the last column.

> cbind(contr.sum(4),1)
  [,1] [,2] [,3] [,4]
1    1    0    0    1
2    0    1    0    1
3    0    0    1    1
4   -1   -1   -1    1

Let Q the Q part and R the R part of the QR decomposition and it’s scaled variants Q^* and R^*. We can generate the inverse of R^*, R^{*-1} . We have

\beta = R^{*-1} * \theta.

Since we can estimate R^{*-1} in the data transformation block, we can store R^{*-1} and
above is a linear transformation. Vecause of no need for a Jacobian adjustment, we can
put a prior on:

R^{-1} * \theta \sim normal(0, \sigma_{group})

Note that the data y is fitted via

y \sim normal(Q^{*} * \theta + intercept, \sigma_{noise})

First variant uses a global intercept term.

data {
  int<lower=0> N;
  int<lower=1> N_groups;
  int<lower=1,upper=N_groups> groups[N];
  vector[N] y;
}
transformed data {
  matrix [N_groups,N_groups] A;
  matrix [N_groups,N_groups] Q;
  matrix [N_groups,N_groups] R;
  matrix [N_groups,N_groups] R_ast_inverse;  
  A[2:N_groups, 1:N_groups-1] = diag_matrix(rep_vector(1, N_groups - 1));
  A[1, 1:N_groups] = rep_row_vector(-1, N_groups);
  A[, N_groups] = rep_vector(1, N_groups);
  Q = qr_thin_Q(A) * sqrt(N_groups - 1);
  R = qr_thin_R(A) / sqrt(N_groups - 1);
  R_ast_inverse =  inverse(R);  
}
parameters {
  vector[N_groups] group_z;
  real<lower=0> sigma;
  real<lower=0> group_sd;
}
transformed parameters {
  vector[N_groups] theta = Q * group_z;
  vector[N_groups] R_group_z = R_ast_inverse * group_z;
}
model { 
  R_group_z[1:N_groups-1] ~ normal(0, group_sd);
  R_group_z[N_groups] ~ normal(3, 1);
  group_sd ~ lognormal(0, 0.3);
  sigma ~ cauchy(0, 1);
  y ~ normal(theta[groups], sigma);
}
generated quantities {
  real intercept = R_group_z[N_groups];
  vector[N_groups] group_r_hat = A[, 1:N_groups-1] * R_group_z[1:N_groups-1];
}

Pairs plot:

The second variant comes without a global intercept term, but with a group level mean parameter.

R^{-1} * \theta \sim normal(\mu_{group}, \sigma_{group})

Note that \mu_{group} and a global intercept is linear dependent. Thus the global intercept
had to be removed from the fit:

y \sim normal(Q^{*} * \theta, \sigma_{noise})

data {
  int<lower=0> N;
  int<lower=1> N_groups;
  int<lower=1,upper=N_groups> groups[N];
  vector[N] y;
}
transformed data {
  matrix [N_groups,N_groups] A;
  matrix [N_groups,N_groups] Q;
  matrix [N_groups,N_groups] R;
  matrix [N_groups,N_groups] R_ast_inverse;  
  A[2:N_groups, 1:N_groups-1] = diag_matrix(rep_vector(1, N_groups - 1));
  A[1, 1:N_groups] = rep_row_vector(-1, N_groups);
  A[, N_groups] = rep_vector(1, N_groups);
  Q = qr_thin_Q(A) * sqrt(N_groups - 1);
  R = qr_thin_R(A) / sqrt(N_groups - 1);
  R_ast_inverse =  inverse(R); 
}
parameters {
  vector[N_groups] group_z;
  real<lower=0> sigma;
  real<lower=0> group_sd;
  real group_mean;
  //real intercept;
}
transformed parameters {
  vector[N_groups] theta = Q * group_z;
  vector[N_groups] R_group_z = R_ast_inverse * group_z;
}
model { 
  R_group_z[1:N_groups] ~ normal(group_mean, group_sd);
  group_sd ~ lognormal(0, 0.3);
  sigma ~ cauchy(0, 1);
  //intercept ~ normal(3, 1);
  //y ~ normal(theta[groups] + intercept, sigma);
  y ~ normal(theta[groups], sigma);
}
generated quantities {
  vector[N_groups] group_r_hat = A[, 1:N_groups] * R_group_z[1:N_groups];
}

Pairs plot:

sumtozero_gaussian_QR_grouplevel_mean.R (1.0 KB)
sumtozero_gaussian_QR_grouplevel_mean.stan (1.1 KB)
sumtozero_gaussian_QR.R (918 Bytes)
sumtozero_gaussian_QR.stan (1.1 KB)

paul.buerkner · December 13, 2021, 2:16pm

I am a little late to the party and @jsocolar has already said some things that I also had in mind, but since I promised to respond I will try to give my perspective anyway.

First of all, I am not sure I consider the posterior correlations between overall and varying coefficients a problem per se. They may be annoying to deal with in terms of somewhat lower sampling efficiency and perhaps some divergent transitions but this can be worked around most of the time in my personnel experience.

In reality, the different groups usually don’t know of each other (they are exchangable) so there is, to my current understanding, nothing in nature that implies the groups to interact in a way that they are actually summing (exactly) to zero. Accordingly, the “standard” multilevel model without this constraint is a much more sensible model of reality I think. The problem I see with the (hard) sum-to-zero constraint prior is that this will lead to bad uncertainty calibration especially of the overall coefficent \alpha (too narrow posterior) if the true data generating model would be the model without hard sum-to-zero constraint. We actually had a visiting researcher working on this kind of problem at Aalto who then apparently stopped working on it some time after he left. @avehtari do you remember? So if I remember his preliminary results right, indeed we would not achieve proper uncertainty calibration if we used the (hard) sum-to-zero contraint model on the “standard” data generating multilevel model. For small number of groups I think both actually did not perform so well but I cannot remember exactly right now.

If what I just said holds true, then it may come down to whether we like some more efficiency MCMC sampling or some better uncertainty calibration more. I argue for the latter at least if the efficiency issue can be worked around with some additional effort. In any case, I think this is a really interesting topic, and perhaps it makes sense to write a paper about it together (if you are interested) that complements the papers mentioned above in the thread. I did not read them in detail but my feeling was that they may not have answered the question completely satisfactorily yet.

jsocolar · December 17, 2021, 3:37pm

10 posts were split to a new topic: Sum-to-zero intercepts and terminology

jsocolar · December 15, 2021, 3:23am

This post turned into a huge mess. I’m deleting its contents and working on an update.

spinkney · December 15, 2021, 8:26am

It’s 3 am and I’m up reading the forums so this could be completely unhelpful, take it fwiw. The above looks a lot like a conditional MVN: Multivariate normal distribution - Wikipedia

In the bivariate case it’s

X_1\mid X_2=a \ \sim\ \mathcal{N}\left(\mu_1+\frac{\sigma_1}{\sigma_2}\rho( a - \mu_2),\, (1-\rho^2)\sigma_1^2\right).

which isn’t exactly what you’ve coded. However, there is a sqrt(1.5) on the sum but with different values for Z1 and Z2. I’ve assumed that \sigma_1 = \sigma_2 = 1 and \rho = -0.5.

Maybe this helps point you or someone else in a useful direction? I didn’t quite understand from the above if I should assume 1 for sigma below though. Should it be \sigma_z = \sqrt{1 - \frac{1}{N^3}}? That’s easy enough to add.

data {
  int N;
}
transformed data {
  real rho = -0.5;
}
parameters{
  vector[N] Z;
}
model{
  Z[1] ~ std_normal();
  Z[2] ~ normal(rho * Z[1], sqrt(1 - rho^2) );
  sum(Z) ~ normal(0, sqrt(2 * (1 - rho^2))); 
}
generated quantities{
  real Z3 = -sum(Z);
}

I ran for 10k iterations just to see and get

 variable  mean median   sd  mad    q5   q95 rhat ess_bulk ess_tail
     lp__ -1.01  -0.70 1.00 0.72 -3.01 -0.05 1.00    17235    20784
     Z[1]  0.00   0.00 0.95 0.95 -1.57  1.56 1.00    13482    18203
     Z[2]  0.00   0.00 0.95 0.95 -1.55  1.58 1.00    13604    17849
     Z3    0.00   0.00 0.77 0.78 -1.28  1.27 1.00    39872    29097

hansvancalster · December 15, 2021, 2:22pm

Interesting discussion! I just want to note here that in the Ogle et al. paper that @martinmodrak linked to, a solution is given that looks compatible with concerns expressed here about generative nature (or not) of the model. It solves the non-identifiability problem post-hoc (in the generated quantities block) by “post-sweeping of random effects” (see the paper’s solution 4).

jsocolar · December 15, 2021, 6:11pm

Ok, trying from a clean slate to write down the math here. Thanks to @andre.pfeuffer for helping to track down a bug.

Given \vec{Y} \sim Normal(\alpha, \sigma), we can reparameterize using \vec{Z} = \vec{Y} - \alpha - \mu_R, where \mu_R is the sample mean of \vec{Y} - \alpha and therefore \sum{\vec{Z}} = 0.

We know that \mu_R \sim Normal(0, \frac{\sigma}{\sqrt{N}}), and from this we can obtain the marginal distributions for the elements of \vec{Z}, which are Normal(0, \sigma_Z) where \sigma_Z = \sigma\sqrt{1 - \frac{1}{N}}. A generative model for \vec{Z} will involve a multivariate normal over the first N-1 elements, which together uniquely define the final element. To get the margin for the final element right, we need negative correlations, and exchangeability implies that all correlations must be equal to one another. Since we need the variance of the (negative) sum to be \sigma_Z^2, the negative sum of all (N-1)^2 - (N-1) off-diagonal covariances must be (N-2)\sigma_Z^2. Thus, the covariances should all be -\sigma_Z^2\frac{(N-2)}{(N-1)^2 - (N-1)}.

However, this reparameterization in terms of \vec{Z}, \mu_R, \alpha, and \sigma is not yet useful, because although we’ve orthogonalized \vec{Z} and \alpha, we have a brutal posterior correlation between \mu_R and \alpha. To deal with this, we need to reparameterize in terms of \gamma = \mu_R + \alpha. This is straightforward, as long as we can derive the prior that we want over \gamma. For example, if the prior for \alpha is normally distributed \alpha \sim Normal(0, s) then we would have \gamma \sim Normal(0, \sqrt{s^2 + \frac{\sigma^2}{N}}).

From this parameterization in terms of \vec{Z}, \gamma, and \sigma, we can readily recover \vec{Y} = \gamma + \vec{Z}. Somewhat more tricky is re-deriving \alpha, due to the influence of the prior. We want to take advantage of the fact that we know the posterior margin for \mu_R to be Normal(0, \frac{\sigma}{\sqrt{N}}), but when the prior on \alpha is informative we know that large values of \gamma will arise primarily when we have large values of \mu_R. Thus, we can’t just simulate \mu_R from its margin in generated quantities and subtract. Instead, we need to simulate \mu_R from a density whose margin is Normal(0, \frac{\sigma}{\sqrt{N}}) but that also corrects for the fact that [\alpha = \gamma - \mu_R] \sim P_{{prior}_\alpha}. This feels like it should be achievable, but I’m not quite seeing the solution right now.

However, since I’ve already assumed that the prior on \alpha is Normal in order to derive the prior on \gamma, I think that a potentially simpler alternative is to exploit conjugacy to solve the posterior for \alpha conditional on P_{{prior}_\alpha}, \vec{Y} ,and \sigma. I won’t write that result out here, but with this solution in hand we can sample from it in generated quantities. I’m pretty sure (but not 100% positive) that this will give correct inference on \alpha.

martinmodrak · December 16, 2021, 5:31pm

I admit I had trouble digging through @avehtari’s suggestion (Kaufman & Sein, Bayesian functional ANOVA), as they handle a very general class of models and the abstraction was a bit difficult to me, but if I understand the paper correctly, they just use a sum-to-zero constraint (and some additional constraints in fact). This might be at least partially because it appears they care primarily about the partitioning of the total variance to various model components, which is likely less affected by this stuff.

Ogle & Barber seem to prefer to just fit the model as is and focus on monitoring identifiable quantities - (e.g. ignore Rhat of the raw intercepts, but look at Rhats of the group means) this IMHO can make sense in a lot of cases, but it cannot help with divergences and sampling efficiency.

As @paul.buerkner noted, it is possible that this is not a big deal in practice. The case where I am trying to get an improvement is 12-dimensional multivariate ordinal model with per-subject varying intercepts - so I have 48 global intercepts and a lot of varying intercepts, so the induced correlations operate in quite high dimension. That’s why I am thinking this could be a real issue in my model. But it is still possible that the main efficiency/convergence bottleneck lies elsewhere in the model and I am chasing phantoms.

Regarding @jsocolar’s attempt above, I think it suffers from a big problem:

I think this does not hold if \gamma is assumed a priori independent of \vec{Z} - if we had no data, the posterior of \vec{Y} should have all the elements uncorrelated, however since \vec{Z} has negative correlations and \gamma is independent of \vec{Z}, then \gamma + \vec{Z} will also have negative correlations and thus cannot represent the correct posterior. Or am I missing something?

I think I managed to make some progress in a somewhat different direction (that also IMHO promises to generalize to more than one varying intercept). But the model does not work completely, so any hints where a problem is in my derivation or in my code would be very welcome.

EDIT: This previously contained a bunch of math and code that had bugs. Will post a new reply with corrected math and code.

jsocolar · December 16, 2021, 5:48pm

I could be having a big brain fart here, but I think that what you are missing is that \gamma isn’t fixed but is itself random. The negative correlations in \vec{Z} are precisely chosen to be the right size to offset the positive correlations that are induced by adding the random \mu_R.

Now it’s possible that I’ve messed up my derivation of the generative distribution for \vec{Z}, but what is clear is that it is possible to reparameterize from \vec{Y} and \alpha to \vec{Z} and \gamma, where \gamma is chosen such that \sum{\vec{Z}} = 0. And given the sum-to-zero constraint on \vec{Z}, we know from first principles that its elements must be negatively correlated. So I’m fairly confident that in general we should expect to recover uncorrelated \vec{Y} from the sum of \gamma plus negatively correlated \vec{Z}.

jsocolar · December 16, 2021, 9:58pm

Something looks fishy here to me. Suppose the prior on \alpha is very narrow around zero. By definition, \mu_R = \bar{\alpha} - \alpha. In the limit that \sigma_\alpha \rightarrow 0, we have that \mu_R = \bar{\alpha}, which seems incompatible with \mu_R | \bar{\alpha},\bar{\beta} \sim N(m_R, \sigma_R) unless there’s an implicit formula for \sigma_R that has gone completely over my head and that shrinks to zero in the limit that \sigma_\alpha \rightarrow 0.

This is what I was getting at with:

jsocolar:

Somewhat more tricky is re-deriving \alpha α\alpha , due to the influence of the prior. We want to take advantage of the fact that we know the posterior margin for \mu_R μR\mu_R to be Normal(0, \frac{\sigma}{\sqrt{N}}) Normal(0,σ√N)Normal(0, \frac{\sigma}{\sqrt{N}}) , but when the prior on \alpha α\alpha is informative we know that large values of \gamma γ\gamma will arise primarily when we have large values of \mu_R μR\mu_R . Thus, we can’t just simulate \mu_R μR\mu_R from its margin in generated quantities and subtract. Instead, we need to simulate \mu_R μR\mu_R from a density whose margin is Normal(0, \frac{\sigma}{\sqrt{N}}) Normal(0,σ√N)Normal(0, \frac{\sigma}{\sqrt{N}}) but that also corrects for the fact that [\alpha = \gamma - \mu_R] \sim P_{{prior}\alpha} [α=γ−μR]∼Ppriorα[\alpha = \gamma - \mu_R] \sim P{{prior}_\alpha} . This feels like it should be achievable, but I’m not quite seeing the solution right now.

However, since I’ve already assumed that the prior on \alpha α\alpha is Normal in order to derive the prior on \gamma γ\gamma , I think that a potentially simpler alternative is to exploit conjugacy to solve the posterior for \alpha α\alpha conditional on P_{{prior}\alpha} PpriorαP{{prior}_\alpha} , \vec{Y} →Y\vec{Y} ,and \sigma σ\sigma . I won’t write that result out here, but with this solution in hand we can sample from it in generated quantities . I’m pretty sure (but not 100% positive) that this will give correct inference on \alpha α\alpha .

avehtari · December 17, 2021, 5:36pm

longitudinal disease progression study where it did matter (maybe this example is more clear than Ogle & Barber)

lgpr: an interpretable non-parametric method for inferring covariate effects from longitudinal data
R package build on top of Stan: Longitudinal Gaussian Process Regression • lgpr

Whether it matters computationally depends on the amount of data and parameterization.

Whether it can change the implied prior, depends on how strong the likelihood is.

Whether it matters when reporting the results depends on what you want to report.

spinkney · December 17, 2021, 6:16pm

Hi Martin, can you share the code for this sbc?

martinmodrak · December 20, 2021, 10:06pm

Definitely possible that I am missing something, but if you have \gamma \sim Normal(0, \sqrt{s^2 + \frac{\sigma^2}{N}}) where s is a constant and a zero-centered vector \vec{Z} \sim N(0, \sigma_Z), then a priori \vec{Z} has negatively correlated elements. Since \gamma is a-priori independent of \gamma then \gamma + \vec{Z} has a priori negatively correlated elements as well. So with little data or no data at all the posterior after reparametrization will have negative correlations, while the posterior for the original model would have uncorrelated posterior in such a case. So either the prior on \gamma should be different or the statement that \vec{Y} = \gamma + \vec{Z} doesn’t hold in your reparametrization.

It sounds promising, but I cannot add any more projects to my already too long wishlist. If I’ll be able to solve this for the model I am trying to build I’ll be more than happy. But if you or anyone else is interested or has students that are interested in doing the hard work (sim studies, etc.) needed for a paper just on this topic, I’d be happy to provide some assistance.

However, there was a bunch of errors in my derivation of A and A^{-1} and fixing those seems to lead to a useful model with both a centered and non-centered variant. Rather than editing the original post, I’ll repeat here, with the errors fixed. The centered version now appears to work quite well.

Using notation from my previous post, we can rephrase the model via a zero-sum vector \bar\beta and a new parameter \mu_R.

\begin{pmatrix} \alpha \\ \beta_1 \\ \vdots \\ \beta_{k - 1} \\ \beta_{k} \\ \end{pmatrix} = \mathbf{A} \begin{pmatrix} \bar\alpha \\ \bar\beta_1 \\ \vdots \\ \bar\beta_{k - 1} \\ \mu_R \\ \end{pmatrix} \\

Expand values of A and its inverse

\mathbf{A} = \begin{pmatrix} 1 & 0 & \ldots & 0 & -1 \\ 0 & 1 & \ldots & 0 & 1 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & \ldots & 1 & 1 \\ 0 & -1 & \ldots & -1 & 1 \end{pmatrix}

i.e. the top-left K \times K part of \mathbf{A} is an identity matrix, the last column starts with -1 and then has just 1 and the last row starts with 0, ends with 1 and has -1 in between. We then have:

\mathbf{A}^{-1} = \begin{pmatrix} 1 & \frac{1}{K} & \frac{1}{K} & \ldots & \frac{1}{K} & \frac{1}{K} \\ 0 & \frac{K - 1}{K} & -\frac{1}{K} & \ldots & -\frac{1}{K} & -\frac{1}{K} \\ 0 & \frac{1}{K} & \frac{K - 1}{K} & \ldots & -\frac{1}{K} & -\frac{1}{K} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & -\frac{1}{K} & -\frac{1}{K} & \ldots & \frac{K - 1}{K} & -\frac{1}{K} \\ 0 & \frac{1}{K} & \frac{1}{K} & \ldots & \frac{1}{K} & \frac{1}{K} \end{pmatrix}

A-priori, we have

\begin{pmatrix} \alpha \\ \beta_1 \\ \vdots \\ \beta_{k - 1} \\ \beta_{k} \\ \end{pmatrix} \sim MVN \left( \mathbf{m} , \mathbf{\Sigma}\right) \\ \mathbf{m} = \begin{pmatrix} \mu_\alpha \\ 0 \\ \vdots \\ 0 \\ 0 \end{pmatrix}, \mathbf{\Sigma} = \mathbf{I}^{K+1} \begin{pmatrix} \sigma_\alpha^2 \\ \sigma_\beta^2 \\ \vdots \\ \sigma_\beta^2 \\ \sigma_\beta^2 \end{pmatrix}

i.e. \Sigma is a diagonal matrix and the first element is \sigma_\alpha^2 and all the other elements are \sigma_\beta^2.

From the properties of normal distribution we have:

\begin{pmatrix} \bar\alpha \\ \bar\beta_1 \\ \vdots \\ \bar\beta_{k - 1} \\ \mu_R \\ \end{pmatrix} \sim MVN\left( \mathbf{m} , \bar{\mathbf{\Sigma}} \right) \\ \bar{\mathbf{\Sigma}} = \mathbf{A}^{-1} \mathbf{\Sigma} (\mathbf{A}^{-1})^{\mathrm{T}}

(note: \mathbf{A}^{-1} \mathbf{m} = \mathbf{m}).

So we can directly transform the prior to apply to the sum-to-zero model.

Now the references seem to agree that actually we learn nothing about \mu_R beyond the prior, so it should be possible to recover \mu_R by conditioning using the prior covariance matrix. This is one of the steps I am the least sure about. This gives us the conditional distribution:

\mu_R | \bar\alpha, \bar\beta \sim N\left(m_R, \sigma_R\right) \\ m_R = \bar{\mathbf{\Sigma}}_{[K+1,2:K + 1]} \mathbf{\bar{\Sigma}}_{[1:K, 1:K]}^{-1} \begin{pmatrix} \bar\alpha - \mu_\alpha \\ \bar\beta_1 \\ \vdots \\ \bar\beta_{K-1}\\ \end{pmatrix}

Where \mathbf{\bar{\Sigma}}_{[a, b]} is Stan-style indexing over rows and columns of the covariance matrix to get sub-blocks and \sigma_R^2 is the Schur complement of the bottom-right element of \bar{\mathbf{\Sigma}}, i.e. we invert \bar{\mathbf{\Sigma}}, take only the bottom right element and invert that back.

It appears that after simplification:

m_R = \frac{(\bar\alpha - \mu_\alpha)\sigma_\beta^2}{K \sigma_\alpha^2 + \sigma_\beta^2} \\ \sigma_R^2= \frac{1}{\frac{1}{\sigma_\alpha^2} + \frac{K}{\sigma_\beta^2}}

Since there is a lot of structure in all the matrices, it should also be possible to arrive at formulas directly for the Cholesky decomposition of \mathbf{\bar{\Sigma}} (for the prior), but I wasn’t able to get it quickly. Even without using the cholesky decomposition, the code is actually pretty fast and seems to remove the correlations (only limited testing so far)

The centered version:

functions {
  vector Q_sum_to_zero_QR(int N) {
    vector [2*N] Q_r;

    for(i in 1:N) {
      Q_r[i] = -sqrt((N-i)/(N-i+1.0));
      Q_r[i+N] = inv_sqrt((N-i) * (N-i+1));
    }
    Q_r = Q_r * inv_sqrt(1 - inv(N));
    return Q_r;
  }

  vector sum_to_zero_QR(vector x_raw, vector Q_r) {
    int N = num_elements(x_raw) + 1;
    vector [N] x;
    real x_aux = 0;

    for(i in 1:N-1){
      x[i] = x_aux + x_raw[i] * Q_r[i];
      x_aux = x_aux + x_raw[i] * Q_r[i+N];
    }
    x[N] = x_aux;
    return x;
  }
}

data {
  int<lower=0> N;
  int<lower=1> N_groups;
  int<lower=1,upper=N_groups> groups[N];
  int<lower=0> y[N];
}

transformed data {
  vector[2 * N_groups] groups_Q_r = Q_sum_to_zero_QR(N_groups);
  matrix[N_groups + 1, N_groups + 1] inv_A;
  real mu_alpha = 3;
  real sigma_alpha = 1;

  inv_A[1,1] = 1;
  inv_A[2:(N_groups + 1), 1] = rep_vector(0, N_groups);
  inv_A[1, 2:(N_groups + 1)] = rep_row_vector(inv(N_groups), N_groups);
  inv_A[N_groups + 1, 2:(N_groups + 1)] = rep_row_vector(inv(N_groups), N_groups);

  inv_A[2:N_groups, 2:(N_groups + 1)] = rep_matrix(-inv(N_groups), N_groups - 1, N_groups);
  for(g in 2:N_groups) {
    inv_A[g,g] = (N_groups - 1) * inv(N_groups);
  }
}

parameters {
  real intercept_sweep;
  vector[N_groups - 1] group_r_raw; 
  real<lower=0> group_sd;
}

transformed parameters {
  vector[N_groups] group_r_sweep = sum_to_zero_QR(group_r_raw, groups_Q_r);
}

model {
  matrix[N_groups + 1, N_groups + 1] prior_sigma =
    quad_form(diag_matrix(append_row([sigma_alpha^2]', rep_vector(group_sd^2, N_groups))), transpose(inv_A));
  
  vector[N_groups + 1] intercept_and_groups = append_row([intercept_sweep]', group_r_sweep[1:(N_groups - 1)]);
   
  target += multi_normal_lpdf(intercept_and_groups | 
       append_row([mu_alpha]', rep_vector(0, N_groups - 1)), 
       prior_sigma[1:N_groups, 1:N_groups]);
  
  target += normal_lpdf(group_sd | 0, 1);
  target += poisson_log_lpmf(y | intercept_sweep + group_r_sweep[groups]);
}

generated quantities {
  real mean_group_r;
  vector[N_groups] group_r;
  real intercept;
  real m_R;
  real sigma_R;
  
  m_R = (intercept_sweep - mu_alpha)*(group_sd^2) / (N_groups * sigma_alpha^2 + group_sd^2);
  sigma_R = sqrt(inv(inv(sigma_alpha^2) + N_groups / (group_sd ^ 2)));

  mean_group_r = normal_rng(m_R, sigma_R);
  group_r = group_r_sweep + mean_group_r;
  intercept = intercept_sweep - mean_group_r;
}

And here’s the resulting pairs plots:

Here’s the SBC ecdf diff plot after 500 fits:

There was quite a lot of fits with divergences, so that’s probably something to investigate. Most likely some sort of non-centering would help, but the most obvious way to decenter the varying intercepts induces terrible geometry.

And here’s the SBC code for @spinkney or anyone else interested:

SBC validation code

library(SBC)
library(future)
plan(multisession)
options(SBC.min_chunk_size = 5)

SBC_cache_dir <- "_intercept-correlations-sbc-cache"
if(!dir.exists(SBC_cache_dir)) {
  dir.create(SBC_cache_dir)
}

model_suff_c <- cmdstan_model("intercept_suff_centered.stan")

generator <- function(N, N_groups) {

  groups <- rep(1:N_groups, length.out = N)

  intercept <- rnorm(1, mean = 3, sd = 1) 
  group_sd <- abs(rnorm(1))
  group_r <- rnorm(N_groups, sd = group_sd)

  mu <- intercept + group_r[groups]
  y <- rpois(N, exp(mu))

  list(parameters = list(
    intercept = intercept,
    group_sd = group_sd,
    group_r = group_r
  ),
       generated = list(N = N,
                    N_groups = N_groups,
                    groups = groups,
                    y = y)
  )
}

set.seed(5465524)

generator_c <- SBC_generator_function(generator, N = 50, N_groups = 8)
datasets_c <- generate_datasets(generator_c, n_datasets = 500)

suff_c_backend <- SBC_backend_cmdstan_sample(model_suff_c, chains = 2)
SBC_res_suff_c <- compute_results(datasets_c, suff_c_backend, 
                                  gen_quants = generated_quantities(
                                    mu_r = mean(group_r),
                                    group_r_c = group_r - mean(group_r),
                                    intercept_c = intercept + mean(group_r),
                                    sd_c = sd(group_r),
                                    log_lik = sum(dpois(y, exp(intercept + group_r[groups]), log = TRUE))),
                                  keep_fits = FALSE,
                                  cache_mode = "results",
                                  cache_location = file.path(SBC_cache_dir, "suff_c_datasets_c.rds"))


plot_ecdf_diff(SBC_res_suff_c)
plot_ecdf_diff(SBC_res_suff_c[SBC_res_suff_c$backend_diagnostics$n_divergent == 0])
plot_rank_hist(SBC_res_suff_c)
plot_rank_hist(SBC_res_suff_c[SBC_res_suff_c$backend_diagnostics$n_divergent == 0])

Luckily for me \mathbf{\bar{\Sigma}} contains elements proportional to \sigma^2_\alpha that end up propagating to \sigma_R and it seems (I still didn’t do the math properly to be able to say for sure) that we get: \sigma_R^2 = \frac{1}{\frac{1}{\sigma_\alpha^2} + \frac{K}{\sigma_\beta^2}} which satisfies this limiting relation.

jsocolar · December 20, 2021, 11:03pm

But \gamma is a scalar that expands to a vector with all elements identical. That is, all elements of \gamma are perfectly correlated with one another. So we’re adding this perfectly and positively correlated \gamma vector to the negatively correlated \vec{Z}. And if I’ve done the math right, what we get out at the end is uncorrelated \vec{Y}. I’m not at all sure I’ve done the math right or made the correct assumptions. Edit: note that definitionally \gamma = mean(\vec{Y}), and \vec{Y} = \gamma + \vec{Z}. Regardless of whether or not I have the arithmetic right, this equivalency can certainly be satisfied with iid gaussian \vec{Y}, sum-to-zero (and therefore negatively correlated) \vec{Z}, and scalar \gamma = mean(\vec{Y}).

Your new derivation looks awesome to me. Note that it is subject to one significant limitation, which is that it only works if the prior on the intercept is Gaussian. Even if my version is correct, it suffers from this same limitation.

Note also that there must be a typo in your formula for \sigma^2_R. Consider the limit when \sigma_\alpha is very large. Then the variance should just be the variance of the sample mean of a collection of iid Gaussian variates. And thus where you have \sigma_\beta you must need \sigma_\beta^2.

Ultimately, I still wonder whether or not what I’ve written down turns out to be the same thing as what you have. If so, it might yield some computational shortcuts (but on the other hand it does not easily yield clarity or generality)!

Topic		Replies	Views
Linear model intercept has unexpectedly small variance Modeling brms	1	48	December 24, 2024
Sum-to-zero intercepts and terminology General	18	2292	February 29, 2024
Model samples perfect correlation between intercept and slope brms	3	501	October 20, 2022
Problem of a high correlation among intercept of a probit model Modeling ordinal-response , brms	1	46	September 27, 2024
Model specification with random intercept for group vs. predicting group-level mean Modeling fitting-issues , specification , brms	1	33	September 9, 2024