Poisson latent Gaussian model - covariance structure is not estimated

The paradox here is that the model seems to do a better job with less data.

Consider a typical latent Gaussian model:
\phi \sim \pi \\ \theta_i \sim N(0, \Sigma_\phi) \\ y_{j \in g(i)} \sim \mathrm{Poisson}(e^{\theta_i})

with M groups and N data points. How do we generate a data point? We first sample the \theta's. Next, we randomly assign a new point to one of M groups, indexed i. Then we generate y_j from a \mathrm{Poisson}(e^{\theta_i}). This means there are not the same amount of observations for each group.

Now \phi determines the variance in the Normal and the correlation is set to 0.9.

I started with M = 10 and N = 100. In this regime, I cannot recover the correlation structure. The pairs plot suggest the \theta's are seemingly uncorrelated. If M = N, I do a better job describing the correlation structure (albeit a slight asymmetry which shouldn’t be there). See the pairs plot.

For N > M:

For N = M:

The paradox here is that you expect that with more data you would do a better job recovering the true data generating process. So what exactly is going on?

The data is simulated on R:

  mu = rep(0, M)

  # construct covariance matrix
  # (not an optimal implementation but easy to code)
  Covariance <- matrix(NA, nrow = M, ncol = M)
  for (i in 1:M) {
    for (j in 1:(i - 1)) {
      Covariance[i, j] = corr * phi
      Covariance[j, i] = corr * phi
    Covariance[i, i] = phi

  x = mvrnorm(n = 1, mu, Covariance)

  index = sample(x = 1:M, size = N, replace = TRUE)
  y = rpois(N, exp(x[index]))

  data <- list(y = y,
               index = index,
               M = M,
               N = N)
  if (corr_is_fixed) data$rho <- corr
  if (phi_is_fixed) data$phi <- phi

  # output to data file
  with(data, stan_rdump(ls(data), file =
                          file.path(data.dir, paste0('data_', M, '.R'))))

The Stan model is:

data {
  int N;  // number of observations
  int M;  // number of groups
  int y[N];
  int<lower = 1, upper = M> index[N];
  real<lower = -1, upper = 1> rho;

parameters {
  // global parameters
  real<lower = 0> phi;

  // local parameter
  vector[M] theta;

transformed parameters {
  cov_matrix[M] Sigma;
  cholesky_factor_cov[M] L;

  for (i in 1:M) {
    for (j in 1:(i - 1)) {
        Sigma[i, j] = rho * phi;
        Sigma[j, i] = rho * phi;
    Sigma[i, i] = phi;

  L = cholesky_decompose(Sigma);

model {
  phi ~ lognormal(-1, 0.5);

  theta ~ multi_normal_cholesky(rep_vector(0.0, M), L);
  for (i in 1:N) y[i] ~ poisson_log(theta[index[i]]);

Did you find an answer? Is it possible that the data is overruling the prior?

1 Like

No, haven’t worked it out yet. I agree with you a disagreement between the prior and the data may be the explanation. But the data at hand is simulated by a model “consistent” (to use a technical term loosely) with the prior. So, some of kind of non-identifiability must also kick in.

@avehtari and @anon75146577, any ideas? This is related to the computer experiment for the Laplace approximator.

I had an enlightning conversation with @avehtari, @anon75146577, and @betanalpha, all of whom are currently in New York.

The first thing to realize is that the correlation between the \theta_i's is NOT captured by the correlation displayed in the pairs plot. The pairs plot shows the samples from the posterior distribution of \theta. Bearing in mind the hierarchical structure of the model, when we are in the sparse data regime, the partial pooling plays an important part. That is, the inference on \theta_i is strongly influenced by the inference on \theta_j, where i \neq j. Hence the correlation. On the other hand, in the large data regime, the data (rather than the hyperprior) dominates the inference. Since we use different data points for each group, the inference on the \theta_i's appears independent.

Does this means we cannot make any inference on the correlation between the \theta_i's? Well, we can, through our estimation of \phi, which controls the covariance matrix \Sigma_\phi.

Note this is all related to the discussion on central and non-central parametrization, and which one to use when either the data or the prior dominates. Of course, HMC is more effective when the posteriors for the \theta_i's appear independent.