The paradox here is that the model seems to do a better job with less data.
Consider a typical latent Gaussian model:
\phi \sim \pi \\ \theta_i \sim N(0, \Sigma_\phi) \\ y_{j \in g(i)} \sim \mathrm{Poisson}(e^{\theta_i})
with M groups and N data points. How do we generate a data point? We first sample the \theta's. Next, we randomly assign a new point to one of M groups, indexed i. Then we generate y_j from a \mathrm{Poisson}(e^{\theta_i}). This means there are not the same amount of observations for each group.
Now \phi determines the variance in the Normal and the correlation is set to 0.9.
I started with M = 10 and N = 100. In this regime, I cannot recover the correlation structure. The pairs plot suggest the \theta's are seemingly uncorrelated. If M = N, I do a better job describing the correlation structure (albeit a slight asymmetry which shouldn’t be there). See the pairs plot.
For N > M:
For N = M:
The paradox here is that you expect that with more data you would do a better job recovering the true data generating process. So what exactly is going on?
The data is simulated on R
:
mu = rep(0, M)
# construct covariance matrix
# (not an optimal implementation but easy to code)
Covariance <- matrix(NA, nrow = M, ncol = M)
for (i in 1:M) {
for (j in 1:(i - 1)) {
Covariance[i, j] = corr * phi
Covariance[j, i] = corr * phi
}
Covariance[i, i] = phi
}
x = mvrnorm(n = 1, mu, Covariance)
index = sample(x = 1:M, size = N, replace = TRUE)
y = rpois(N, exp(x[index]))
data <- list(y = y,
index = index,
M = M,
N = N)
if (corr_is_fixed) data$rho <- corr
if (phi_is_fixed) data$phi <- phi
# output to data file
with(data, stan_rdump(ls(data), file =
file.path(data.dir, paste0('data_', M, '.R'))))
The Stan
model is:
data {
int N; // number of observations
int M; // number of groups
int y[N];
int<lower = 1, upper = M> index[N];
real<lower = -1, upper = 1> rho;
}
parameters {
// global parameters
real<lower = 0> phi;
// local parameter
vector[M] theta;
}
transformed parameters {
cov_matrix[M] Sigma;
cholesky_factor_cov[M] L;
for (i in 1:M) {
for (j in 1:(i - 1)) {
Sigma[i, j] = rho * phi;
Sigma[j, i] = rho * phi;
}
Sigma[i, i] = phi;
}
L = cholesky_decompose(Sigma);
}
model {
phi ~ lognormal(-1, 0.5);
theta ~ multi_normal_cholesky(rep_vector(0.0, M), L);
for (i in 1:N) y[i] ~ poisson_log(theta[index[i]]);
}