I am trying to validate a model, and thus generating simulated data and running it through two models, one with and without a covariate. This seems to work well, as in the covariate model estimated parameters(theta, beta, type) are very close to the true parameters, and the baseline model estimated parameters (theta and beta) are not. I am generating my data according to the covariate model and running that data through both, so this result makes sense as the baseline model isnt able to account for the effect of type. However, when i compare them using LOO, they do not seem to be any different in predictive accuracy.
model_covariate <- stan_model(model_code = "data {
int<lower=0> I;
int<lower=0> J;
int<lower=0> K;
int<lower=1> N;
int<lower=1,upper=I> ii[N];
int<lower=1,upper=J> jj[N];
int<lower=1,upper=K> kk[N];
int<lower=0> y[N];
}
parameters {
vector[J] theta;
vector[I-1] b_free;
vector[K-1] k_free;
}
transformed parameters{
vector[I] beta = append_row(b_free, -sum(b_free));
vector[K] type = append_row(k_free, -sum(k_free));
}
model {
theta ~ normal(0, 1);
target += normal_lpdf(beta | 0, 3);
target += normal_lpdf(type | 0, 3);
y ~ poisson_log(theta[jj] - beta[ii] - type[kk]);
}
generated quantities{
vector[N] log_lik;
for(n in 1:N){
log_lik[n] = poisson_lpmf( y[n] | exp(theta[jj[n]] - beta[ii[n]] - type[kk[n]]));
}
}
")
The only difference between my covariate model above and the baseline is the absence of “type”. I am assuming there is overfitting going on? I am not sure. Below is my data generation code if needed. And here are the LOO results.
> loo_baseline
Computed from 8000 by 8000 log-likelihood matrix
Estimate SE
elpd_loo -11373.2 103.6
p_loo 228.6 7.5
looic 22746.3 207.3
Monte Carlo SE of elpd_loo is 0.2.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 7991 99.9% 856
(0.5, 0.7] (ok) 9 0.1% 234
(0.7, 1] (bad) 0 0.0% <NA>
(1, Inf) (very bad) 0 0.0% <NA>
-------------------------------------------------
> loo_covariate
Computed from 8000 by 8000 log-likelihood matrix
Estimate SE
elpd_loo -11372.7 103.6
p_loo 228.2 7.5
looic 22745.3 207.2
------
Monte Carlo SE of elpd_loo is 0.2.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 7990 99.9% 644
(0.5, 0.7] (ok) 10 0.1% 296
(0.7, 1] (bad) 0 0.0% <NA>
(1, Inf) (very bad) 0 0.0% <NA>
> loo::loo_compare(loo_covariate,loo_baseline)
elpd_diff se_diff
model1 0.0 0.0
model2 -0.5 0.3
I <- 40;
J <- 200;
K <- I/10
theta <- rnorm(J, 0, 1)
beta <- rnorm(I, 0, 1)
type <- rnorm(K, 0, 1)
N <- I*J
ii <- rep(1:I, times = J)
jj <- rep(1:J, each = I)
kk <- rep(1:K, times = N/K)
y <- numeric(N)
for(n in 1:N){
y[n] <- rpois(1,exp(theta[jj[n]] - beta[ii[n]] - type[kk[n]]))
}
fit_covariate <- sampling(model_covariate, data = list(I = I,
J = J,
K = K,
N = N,
ii = ii,
jj = jj,
kk = kk,
y = y),
iter = 4000)
fit_baseline <- sampling(model_baseline, data = list(I = I,
J = J,
N = N,
ii = ii,
jj = jj,
y = y),
iter = 4000)
loo_baseline <- loo::loo(fit_baseline)
loo_covariate <- loo::loo(fit_covariate)
loo::loo_compare(loo_covariate,loo_baseline)