Correlated posterior - Sum to zero constraint for varying intercepts?!

Definitely possible that I am missing something, but if you have \gamma \sim Normal(0, \sqrt{s^2 + \frac{\sigma^2}{N}}) where s is a constant and a zero-centered vector \vec{Z} \sim N(0, \sigma_Z), then a priori \vec{Z} has negatively correlated elements. Since \gamma is a-priori independent of \gamma then \gamma + \vec{Z} has a priori negatively correlated elements as well. So with little data or no data at all the posterior after reparametrization will have negative correlations, while the posterior for the original model would have uncorrelated posterior in such a case. So either the prior on \gamma should be different or the statement that \vec{Y} = \gamma + \vec{Z} doesn’t hold in your reparametrization.

It sounds promising, but I cannot add any more projects to my already too long wishlist. If I’ll be able to solve this for the model I am trying to build I’ll be more than happy. But if you or anyone else is interested or has students that are interested in doing the hard work (sim studies, etc.) needed for a paper just on this topic, I’d be happy to provide some assistance.

However, there was a bunch of errors in my derivation of A and A^{-1} and fixing those seems to lead to a useful model with both a centered and non-centered variant. Rather than editing the original post, I’ll repeat here, with the errors fixed. The centered version now appears to work quite well.

Using notation from my previous post, we can rephrase the model via a zero-sum vector \bar\beta and a new parameter \mu_R.

\begin{pmatrix} \alpha \\ \beta_1 \\ \vdots \\ \beta_{k - 1} \\ \beta_{k} \\ \end{pmatrix} = \mathbf{A} \begin{pmatrix} \bar\alpha \\ \bar\beta_1 \\ \vdots \\ \bar\beta_{k - 1} \\ \mu_R \\ \end{pmatrix} \\
Expand values of A and its inverse
\mathbf{A} = \begin{pmatrix} 1 & 0 & \ldots & 0 & -1 \\ 0 & 1 & \ldots & 0 & 1 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & \ldots & 1 & 1 \\ 0 & -1 & \ldots & -1 & 1 \end{pmatrix}

i.e. the top-left K \times K part of \mathbf{A} is an identity matrix, the last column starts with -1 and then has just 1 and the last row starts with 0, ends with 1 and has -1 in between. We then have:

\mathbf{A}^{-1} = \begin{pmatrix} 1 & \frac{1}{K} & \frac{1}{K} & \ldots & \frac{1}{K} & \frac{1}{K} \\ 0 & \frac{K - 1}{K} & -\frac{1}{K} & \ldots & -\frac{1}{K} & -\frac{1}{K} \\ 0 & \frac{1}{K} & \frac{K - 1}{K} & \ldots & -\frac{1}{K} & -\frac{1}{K} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & -\frac{1}{K} & -\frac{1}{K} & \ldots & \frac{K - 1}{K} & -\frac{1}{K} \\ 0 & \frac{1}{K} & \frac{1}{K} & \ldots & \frac{1}{K} & \frac{1}{K} \end{pmatrix}

A-priori, we have

\begin{pmatrix} \alpha \\ \beta_1 \\ \vdots \\ \beta_{k - 1} \\ \beta_{k} \\ \end{pmatrix} \sim MVN \left( \mathbf{m} , \mathbf{\Sigma}\right) \\ \mathbf{m} = \begin{pmatrix} \mu_\alpha \\ 0 \\ \vdots \\ 0 \\ 0 \end{pmatrix}, \mathbf{\Sigma} = \mathbf{I}^{K+1} \begin{pmatrix} \sigma_\alpha^2 \\ \sigma_\beta^2 \\ \vdots \\ \sigma_\beta^2 \\ \sigma_\beta^2 \end{pmatrix}

i.e. \Sigma is a diagonal matrix and the first element is \sigma_\alpha^2 and all the other elements are \sigma_\beta^2.

From the properties of normal distribution we have:

\begin{pmatrix} \bar\alpha \\ \bar\beta_1 \\ \vdots \\ \bar\beta_{k - 1} \\ \mu_R \\ \end{pmatrix} \sim MVN\left( \mathbf{m} , \bar{\mathbf{\Sigma}} \right) \\ \bar{\mathbf{\Sigma}} = \mathbf{A}^{-1} \mathbf{\Sigma} (\mathbf{A}^{-1})^{\mathrm{T}}

(note: \mathbf{A}^{-1} \mathbf{m} = \mathbf{m}).

So we can directly transform the prior to apply to the sum-to-zero model.

Now the references seem to agree that actually we learn nothing about \mu_R beyond the prior, so it should be possible to recover \mu_R by conditioning using the prior covariance matrix. This is one of the steps I am the least sure about. This gives us the conditional distribution:

\mu_R | \bar\alpha, \bar\beta \sim N\left(m_R, \sigma_R\right) \\ m_R = \bar{\mathbf{\Sigma}}_{[K+1,2:K + 1]} \mathbf{\bar{\Sigma}}_{[1:K, 1:K]}^{-1} \begin{pmatrix} \bar\alpha - \mu_\alpha \\ \bar\beta_1 \\ \vdots \\ \bar\beta_{K-1}\\ \end{pmatrix}

Where \mathbf{\bar{\Sigma}}_{[a, b]} is Stan-style indexing over rows and columns of the covariance matrix to get sub-blocks and \sigma_R^2 is the Schur complement of the bottom-right element of \bar{\mathbf{\Sigma}}, i.e. we invert \bar{\mathbf{\Sigma}}, take only the bottom right element and invert that back.

It appears that after simplification:

m_R = \frac{(\bar\alpha - \mu_\alpha)\sigma_\beta^2}{K \sigma_\alpha^2 + \sigma_\beta^2} \\ \sigma_R^2= \frac{1}{\frac{1}{\sigma_\alpha^2} + \frac{K}{\sigma_\beta^2}}

Since there is a lot of structure in all the matrices, it should also be possible to arrive at formulas directly for the Cholesky decomposition of \mathbf{\bar{\Sigma}} (for the prior), but I wasn’t able to get it quickly. Even without using the cholesky decomposition, the code is actually pretty fast and seems to remove the correlations (only limited testing so far)

The centered version:

functions {
  vector Q_sum_to_zero_QR(int N) {
    vector [2*N] Q_r;

    for(i in 1:N) {
      Q_r[i] = -sqrt((N-i)/(N-i+1.0));
      Q_r[i+N] = inv_sqrt((N-i) * (N-i+1));
    }
    Q_r = Q_r * inv_sqrt(1 - inv(N));
    return Q_r;
  }

  vector sum_to_zero_QR(vector x_raw, vector Q_r) {
    int N = num_elements(x_raw) + 1;
    vector [N] x;
    real x_aux = 0;

    for(i in 1:N-1){
      x[i] = x_aux + x_raw[i] * Q_r[i];
      x_aux = x_aux + x_raw[i] * Q_r[i+N];
    }
    x[N] = x_aux;
    return x;
  }
}

data {
  int<lower=0> N;
  int<lower=1> N_groups;
  int<lower=1,upper=N_groups> groups[N];
  int<lower=0> y[N];
}

transformed data {
  vector[2 * N_groups] groups_Q_r = Q_sum_to_zero_QR(N_groups);
  matrix[N_groups + 1, N_groups + 1] inv_A;
  real mu_alpha = 3;
  real sigma_alpha = 1;

  inv_A[1,1] = 1;
  inv_A[2:(N_groups + 1), 1] = rep_vector(0, N_groups);
  inv_A[1, 2:(N_groups + 1)] = rep_row_vector(inv(N_groups), N_groups);
  inv_A[N_groups + 1, 2:(N_groups + 1)] = rep_row_vector(inv(N_groups), N_groups);

  inv_A[2:N_groups, 2:(N_groups + 1)] = rep_matrix(-inv(N_groups), N_groups - 1, N_groups);
  for(g in 2:N_groups) {
    inv_A[g,g] = (N_groups - 1) * inv(N_groups);
  }
}

parameters {
  real intercept_sweep;
  vector[N_groups - 1] group_r_raw; 
  real<lower=0> group_sd;
}

transformed parameters {
  vector[N_groups] group_r_sweep = sum_to_zero_QR(group_r_raw, groups_Q_r);
}

model {
  matrix[N_groups + 1, N_groups + 1] prior_sigma =
    quad_form(diag_matrix(append_row([sigma_alpha^2]', rep_vector(group_sd^2, N_groups))), transpose(inv_A));
  
  vector[N_groups + 1] intercept_and_groups = append_row([intercept_sweep]', group_r_sweep[1:(N_groups - 1)]);
   
  target += multi_normal_lpdf(intercept_and_groups | 
       append_row([mu_alpha]', rep_vector(0, N_groups - 1)), 
       prior_sigma[1:N_groups, 1:N_groups]);
  
  target += normal_lpdf(group_sd | 0, 1);
  target += poisson_log_lpmf(y | intercept_sweep + group_r_sweep[groups]);
}

generated quantities {
  real mean_group_r;
  vector[N_groups] group_r;
  real intercept;
  real m_R;
  real sigma_R;
  
  m_R = (intercept_sweep - mu_alpha)*(group_sd^2) / (N_groups * sigma_alpha^2 + group_sd^2);
  sigma_R = sqrt(inv(inv(sigma_alpha^2) + N_groups / (group_sd ^ 2)));

  mean_group_r = normal_rng(m_R, sigma_R);
  group_r = group_r_sweep + mean_group_r;
  intercept = intercept_sweep - mean_group_r;
}

And here’s the resulting pairs plots:

Here’s the SBC ecdf diff plot after 500 fits:

There was quite a lot of fits with divergences, so that’s probably something to investigate. Most likely some sort of non-centering would help, but the most obvious way to decenter the varying intercepts induces terrible geometry.

And here’s the SBC code for @spinkney or anyone else interested:

SBC validation code
library(SBC)
library(future)
plan(multisession)
options(SBC.min_chunk_size = 5)

SBC_cache_dir <- "_intercept-correlations-sbc-cache"
if(!dir.exists(SBC_cache_dir)) {
  dir.create(SBC_cache_dir)
}

model_suff_c <- cmdstan_model("intercept_suff_centered.stan")

generator <- function(N, N_groups) {

  groups <- rep(1:N_groups, length.out = N)

  intercept <- rnorm(1, mean = 3, sd = 1) 
  group_sd <- abs(rnorm(1))
  group_r <- rnorm(N_groups, sd = group_sd)

  mu <- intercept + group_r[groups]
  y <- rpois(N, exp(mu))

  list(parameters = list(
    intercept = intercept,
    group_sd = group_sd,
    group_r = group_r
  ),
       generated = list(N = N,
                    N_groups = N_groups,
                    groups = groups,
                    y = y)
  )
}

set.seed(5465524)

generator_c <- SBC_generator_function(generator, N = 50, N_groups = 8)
datasets_c <- generate_datasets(generator_c, n_datasets = 500)

suff_c_backend <- SBC_backend_cmdstan_sample(model_suff_c, chains = 2)
SBC_res_suff_c <- compute_results(datasets_c, suff_c_backend, 
                                  gen_quants = generated_quantities(
                                    mu_r = mean(group_r),
                                    group_r_c = group_r - mean(group_r),
                                    intercept_c = intercept + mean(group_r),
                                    sd_c = sd(group_r),
                                    log_lik = sum(dpois(y, exp(intercept + group_r[groups]), log = TRUE))),
                                  keep_fits = FALSE,
                                  cache_mode = "results",
                                  cache_location = file.path(SBC_cache_dir, "suff_c_datasets_c.rds"))


plot_ecdf_diff(SBC_res_suff_c)
plot_ecdf_diff(SBC_res_suff_c[SBC_res_suff_c$backend_diagnostics$n_divergent == 0])
plot_rank_hist(SBC_res_suff_c)
plot_rank_hist(SBC_res_suff_c[SBC_res_suff_c$backend_diagnostics$n_divergent == 0])

Luckily for me \mathbf{\bar{\Sigma}} contains elements proportional to \sigma^2_\alpha that end up propagating to \sigma_R and it seems (I still didn’t do the math properly to be able to say for sure) that we get: \sigma_R^2 = \frac{1}{\frac{1}{\sigma_\alpha^2} + \frac{K}{\sigma_\beta^2}} which satisfies this limiting relation.

3 Likes