Dear all,
I have conducted a simulation study to check how loo works for an item response theory (IRT) model. Specifically, I generated datasets from an IRT model where there are six ordinal indicators which are considered as indicators of a latent variable. For each simulated dataset, I generate 200 subjects, and as said, each subject has data for six ordinal indicators.
Then in R, I fit the true model to check how loo works. The R codes and stan model are included at the end of this post. When calling loo, I received the following output.
Pareto k diagnostic values:
Count Pct. Min. ESS
(-Inf, 0.7] (good) 1187 98.9% 219
(0.7, 1] (bad) 13 1.1% <NA>
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
(For many simulated datasets, I even saw a bigger number for (0.7, 1] (bad) such as 18, 42, or even 60).
p_loo for this simulated dataset is 194.7.
Computed from 7500 by 1200 log-likelihood matrix.
Estimate SE
elpd_loo -968.7 25.4
p_loo 194.7 8.2
looic 1937.5 50.8
If we count the total number of parameters in the parameter block in the Stan code, we get 227 (= 27 + 200 for latent variable xi[N]). Thus we fall in this category (\widehat{k}>0.7 and p_loo <p). As discussed here by @avehtari, “This can happen even when using the true model, in which case there is now way to improve the model [I guess this is a typo, it should be no way], and we just have to improve the computation”.
Because I simulated from an IRT and I fitted the same model, I did not expect any problem with the dataset and the model. After receiving the loo output, I then would like to have several questions
-
Why do we obtain high Pareto values here even we are working with the simulated data and the true model?
-
How does leave-one-out cross-validation approach work exactly in this case? I am asking this because the input (int Y[N, K]) is a matrix of size [200, 6]. If we call (i) an entry of that matrix as an observation, we then have 1200 observations and that we leave each observation out, (ii) or do we call a row (= a subject) as an observation, and each time with loo, we leave one row out? In Stan generated quantities block, I specified the lok-lig as
generated quantities { matrix[N, K] log_lik; for (i in 1 : N){ log_lik[i, 1] = ordered_logistic_lpmf(Y[i, 1] | lambda_matrix[1] * xi[i], theta1); log_lik[i, 2] = ordered_logistic_lpmf(Y[i, 2] | lambda_matrix[2] * xi[i], theta2); log_lik[i, 3] = ordered_logistic_lpmf(Y[i, 3] | lambda_matrix[3] * xi[i], theta3); log_lik[i, 4] = ordered_logistic_lpmf(Y[i, 4] | lambda_matrix[4] * xi[i], theta4); log_lik[i, 5] = ordered_logistic_lpmf(Y[i, 5] | lambda_matrix[5] * xi[i], theta5); log_lik[i, 6] = ordered_logistic_lpmf(Y[i, 6] | lambda_matrix[6] * xi[i], theta6); } }
and in R I call,
log_lik_m1a = extract_log_lik(model1a) loo_1a = loo(log_lik_m1a)
So now I am confused how loo treats “one observation”? This is because pareto_k_values(loo_1a) has length 1200, thus, if I understand correctly, loo calls one entry of the input matrix (int Y[N, K]) as an observation but I am not sure.
-
Related to the previous point, as seen here “data partition should match the predictive task”. From the codes I have posted, it is not clear to me how Stan and loo perform the data partition and the predictive tasks.
-
“we just have to improve the computation” Does it mean that PSIS loo does not work well here and that we need to improve it eve for a well-known model, I mean, IRT model.
-
I have tried with a bigger sample size but because the number of parameters increases with the number of subjects, I guess the same problem of high Pareto-k values will appear.
The stan model is here
// Simul 40
data {
int N; // Sample size
int K; // Number of items
int R; // Number of latent factors
int Y[N, K];
int ncate1;
int ncate2;
int ncate3;
int ncate4;
int ncate5;
int ncate6;
}
parameters {
ordered[ncate1-1] theta1;
ordered[ncate2-1] theta2;
ordered[ncate3-1] theta3;
ordered[ncate4-1] theta4;
ordered[ncate5-1] theta5;
ordered[ncate6-1] theta6;
real mu_theta;
real<lower=0.000001> sigma_theta;
real<lower=0.000001> lambda11;
real<lower=0.000001> lambda21;
real<lower=0.000001> lambda31;
real<lower=0.000001> lambda41;
real<lower=0.000001> lambda51;
real<lower=0.000001> lambda61;
real<lower=0.000001> sigma_lambda;
vector[R] xi[N];
}
transformed parameters {
matrix[K, R] lambda_matrix;
lambda_matrix[1, 1]=lambda11;
lambda_matrix[2, 1]=lambda21;
lambda_matrix[3, 1]=lambda31;
lambda_matrix[4, 1]=lambda41;
lambda_matrix[5, 1]=lambda51;
lambda_matrix[6, 1]=lambda61;
}
model{
// Prior
theta1 ~ normal(mu_theta, sigma_theta);
theta2 ~ normal(mu_theta, sigma_theta);
theta3 ~ normal(mu_theta, sigma_theta);
theta4 ~ normal(mu_theta, sigma_theta);
theta5 ~ normal(mu_theta, sigma_theta);
theta6 ~ normal(mu_theta, sigma_theta);
mu_theta ~ cauchy(0, 5);
sigma_theta ~ cauchy(0, 5);
lambda11 ~ normal(1, sigma_lambda);
lambda21 ~ normal(1, sigma_lambda);
lambda31 ~ normal(1, sigma_lambda);
lambda41 ~ normal(1, sigma_lambda);
lambda51 ~ normal(1, sigma_lambda);
lambda61 ~ normal(1, sigma_lambda);
sigma_lambda ~ cauchy(0, 5);
for (i in 1 : N){
xi[i] ~ normal(0, 1); // mean vector = rep(0, R)
}
// likelihood
for (i in 1 : N){
{Y[i, 1] ~ ordered_logistic(lambda_matrix[1, ] * xi[i], theta1);}
{Y[i, 2] ~ ordered_logistic(lambda_matrix[2, ] * xi[i], theta2);}
{Y[i, 3] ~ ordered_logistic(lambda_matrix[3, ] * xi[i], theta3);}
{Y[i, 4] ~ ordered_logistic(lambda_matrix[4, ] * xi[i], theta4);}
{Y[i, 5] ~ ordered_logistic(lambda_matrix[5, ] * xi[i], theta5);}
{Y[i, 6] ~ ordered_logistic(lambda_matrix[6, ] * xi[i], theta6);}
}
}
generated quantities {
matrix[N, K] log_lik;
for (i in 1 : N){
log_lik[i, 1] = ordered_logistic_lpmf(Y[i, 1] | lambda_matrix[1] * xi[i], theta1);
log_lik[i, 2] = ordered_logistic_lpmf(Y[i, 2] | lambda_matrix[2] * xi[i], theta2);
log_lik[i, 3] = ordered_logistic_lpmf(Y[i, 3] | lambda_matrix[3] * xi[i], theta3);
log_lik[i, 4] = ordered_logistic_lpmf(Y[i, 4] | lambda_matrix[4] * xi[i], theta4);
log_lik[i, 5] = ordered_logistic_lpmf(Y[i, 5] | lambda_matrix[5] * xi[i], theta5);
log_lik[i, 6] = ordered_logistic_lpmf(Y[i, 6] | lambda_matrix[6] * xi[i], theta6);
}
}
The R codes are here
#################################
### model1a ###
#################################
num.iter = 5000
data1a = list(N=N, K=K, Y=Y, ID=blirtm$ID,
R=1,
ncate1=4,
ncate2=4,
ncate3=4,
ncate4=4,
ncate5=4,
ncate6=4)
init1a = list(list(lambda11=1),
list(lambda11=1),
list(lambda11=1))
model1a = stan(file="model1a.stan", data = data1a,
iter=num.iter, chains = 3, init = init1a,
pars = c('theta1', 'theta2', 'theta3', 'theta4', 'theta5', 'theta6',
'mu_theta', 'sigma_theta',
'lambda_matrix', 'sigma_lambda',
"xi", "log_lik"))
result1a=summary(model1a)$summary
round(result1a[1:30, c(1, 4, 8, 9, 10)], 4)
log_lik_m1a = extract_log_lik(model1a)
# ***********
# DIC
# ***********
# First, we calculate log likelihood evaluated at the posterior mean of the parameters
# We need to calculate for each i then sum over i (here i used to indicate y_i)
# From the model, y ~ normal(theta, sigma),
# the likelihood for y[i] is the pdf of a normal dist
result1a_post_mean = result1a[, "mean"]
ll_at_post_1a = 0
for (k in 1 : K){
for (i in 1 : N){
ll_at_post_1a = ll_at_post_1a +
log_ordered_logistic(k = Y[i, k],
eta = result1a_post_mean[c(paste0("lambda_matrix[", k, ",1]"))] *
result1a_post_mean[c(paste0("xi[", i, ",1]"))]
,
c = c(result1a_post_mean[c(paste0("theta", k, "[1]"))],
result1a_post_mean[c(paste0("theta", k, "[2]"))],
result1a_post_mean[c(paste0("theta", k, "[3]"))]))
}
}
# Now calculate p_DIC
# First, we need to extract log_lik
# log_lik_m1a = extract_log_lik(model1a)
# str(log_lik_m1a)
# head(log_lik_m1a)
#dim(log_lik_m1a)
# It is a matrix of size [number of iterations / 2 * 4 (chains)] * [J]
p_DIC_1a = 2 * (ll_at_post_1a - sum(colMeans(log_lik_m1a)))
# This colMeans(log_lik_m1a) is to calculate for each y_i, then we sum over i
DIC_1a = -2 * ll_at_post_1a + 2 * p_DIC_1a
write.table(x=DIC_1a, file = paste('DIC_1a', chain_id, '.txt', sep = ''),
quote = F, sep = ' ', row.names = FALSE, col.names = FALSE)
write.table(x=p_DIC_1a, file = paste('p_DIC_1a', chain_id, '.txt', sep = ''),
quote = F, sep = ' ', row.names = FALSE, col.names = FALSE)
## End of DIC
# WAIC and loo
waic_1a = waic(log_lik_m1a)
loo_1a = loo(log_lik_m1a)
print(paste0("chain_id = ", chain_id))
The first ten rows of the dataset
Y1 Y2 Y3 Y4 Y5 Y6 ID No
1 0 2 1 1 2 1 1
2 1 0 0 1 2 2 2
0 1 0 0 1 1 3 3
1 1 1 1 2 2 4 4
1 2 2 3 2 2 5 5
1 1 1 0 1 2 6 6
1 0 0 0 0 1 7 7
1 1 0 0 1 2 8 8
1 1 1 0 1 2 9 9
2 2 3 3 3 2 10 10
Convergence was OK
mean 2.5% 97.5% n_eff Rhat
theta1[1] -3.0714 -3.7536 -2.4626 5215.107 1.0010
theta1[2] -0.9162 -1.3278 -0.5381 3920.249 1.0008
theta1[3] 3.0726 2.4509 3.7605 5936.204 0.9999
theta2[1] -3.5100 -4.3645 -2.7548 3568.218 1.0000
theta2[2] -1.4112 -1.9784 -0.8822 2842.017 1.0010
theta2[3] 3.0202 2.3651 3.7563 4900.481 1.0006
theta3[1] -3.2293 -4.2909 -2.3047 2649.794 1.0013
theta3[2] -0.3721 -1.0963 0.3385 1859.145 1.0025
theta3[3] 2.7762 1.9415 3.7609 2656.568 1.0010
theta4[1] -3.0630 -4.8289 -1.7927 1543.400 1.0045
theta4[2] -0.7148 -1.9183 0.3200 1612.288 1.0035
theta4[3] 2.0314 0.8831 3.5105 1866.499 1.0005
theta5[1] -4.6385 -5.9427 -3.5438 3098.388 1.0005
theta5[2] -1.0495 -1.7885 -0.3762 2101.662 1.0014
theta5[3] 4.6874 3.6068 6.0165 3956.021 1.0002
theta6[1] -4.2235 -5.2482 -3.3566 5286.935 1.0002
theta6[2] 0.9351 0.4779 1.4175 3143.651 1.0006
theta6[3] 5.5038 4.3080 6.9252 7079.035 0.9997
mu_theta -0.2349 -1.8639 1.3884 7584.857 0.9998
sigma_theta 3.3552 2.3834 4.7763 8590.206 1.0001
lambda_matrix[1,1] 1.4160 1.0183 1.8540 5353.988 0.9999
lambda_matrix[2,1] 2.2921 1.7527 2.9183 4273.519 0.9997
lambda_matrix[3,1] 4.0638 3.0747 5.2749 3288.835 0.9999
lambda_matrix[4,1] 6.4429 4.2837 9.9314 1428.458 1.0022
lambda_matrix[5,1] 3.6115 2.7600 4.6732 3203.823 0.9999
lambda_matrix[6,1] 1.9932 1.4760 2.5837 4944.440 1.0005
sigma_lambda 3.4081 1.7567 6.7359 4320.694 1.0007
xi[1,1] 0.0982 -0.3571 0.5636 7602.082 0.9999
xi[2,1] 0.6375 0.0921 1.2381 8381.351 0.9999
xi[3,1] 1.0401 0.4225 1.7570 10673.522 1.0000