High Pareto-k values for an IRT model with simulated datasets

Dear all,

I have conducted a simulation study to check how loo works for an item response theory (IRT) model. Specifically, I generated datasets from an IRT model where there are six ordinal indicators which are considered as indicators of a latent variable. For each simulated dataset, I generate 200 subjects, and as said, each subject has data for six ordinal indicators.

Then in R, I fit the true model to check how loo works. The R codes and stan model are included at the end of this post. When calling loo, I received the following output.


Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     1187  98.9%   219     
   (0.7, 1]   (bad)        13   1.1%   <NA>    
   (1, Inf)   (very bad)    0   0.0%   <NA>    
See help('pareto-k-diagnostic') for details.

(For many simulated datasets, I even saw a bigger number for (0.7, 1] (bad) such as 18, 42, or even 60).

p_loo for this simulated dataset is 194.7.

Computed from 7500 by 1200 log-likelihood matrix.

         Estimate   SE
elpd_loo   -968.7 25.4
p_loo       194.7  8.2
looic      1937.5 50.8

If we count the total number of parameters in the parameter block in the Stan code, we get 227 (= 27 + 200 for latent variable xi[N]). Thus we fall in this category (\widehat{k}>0.7 and p_loo <p). As discussed here by @avehtari, “This can happen even when using the true model, in which case there is now way to improve the model [I guess this is a typo, it should be no way], and we just have to improve the computation”.

Because I simulated from an IRT and I fitted the same model, I did not expect any problem with the dataset and the model. After receiving the loo output, I then would like to have several questions

  • Why do we obtain high Pareto values here even we are working with the simulated data and the true model?

  • How does leave-one-out cross-validation approach work exactly in this case? I am asking this because the input (int Y[N, K]) is a matrix of size [200, 6]. If we call (i) an entry of that matrix as an observation, we then have 1200 observations and that we leave each observation out, (ii) or do we call a row (= a subject) as an observation, and each time with loo, we leave one row out? In Stan generated quantities block, I specified the lok-lig as

    generated quantities {
    	matrix[N, K] log_lik;
     
    	for (i in 1 : N){
    
    		log_lik[i, 1] = ordered_logistic_lpmf(Y[i, 1] | lambda_matrix[1] * xi[i], theta1);
    		log_lik[i, 2] = ordered_logistic_lpmf(Y[i, 2] | lambda_matrix[2] * xi[i], theta2);
    		log_lik[i, 3] = ordered_logistic_lpmf(Y[i, 3] | lambda_matrix[3] * xi[i], theta3);
    		log_lik[i, 4] = ordered_logistic_lpmf(Y[i, 4] | lambda_matrix[4] * xi[i], theta4);
    		log_lik[i, 5] = ordered_logistic_lpmf(Y[i, 5] | lambda_matrix[5] * xi[i], theta5);
    		log_lik[i, 6] = ordered_logistic_lpmf(Y[i, 6] | lambda_matrix[6] * xi[i], theta6);
    
    	}
    }
    

    and in R I call,

    log_lik_m1a = extract_log_lik(model1a)
    loo_1a = loo(log_lik_m1a)
    

    So now I am confused how loo treats “one observation”? This is because pareto_k_values(loo_1a) has length 1200, thus, if I understand correctly, loo calls one entry of the input matrix (int Y[N, K]) as an observation but I am not sure.

  • Related to the previous point, as seen here “data partition should match the predictive task”. From the codes I have posted, it is not clear to me how Stan and loo perform the data partition and the predictive tasks.

  • “we just have to improve the computation” Does it mean that PSIS loo does not work well here and that we need to improve it eve for a well-known model, I mean, IRT model.

  • I have tried with a bigger sample size but because the number of parameters increases with the number of subjects, I guess the same problem of high Pareto-k values will appear.


The stan model is here

// Simul 40

data {
	int N;					// Sample size
	int K;					// Number of items
	int R;					// Number of latent factors

	int Y[N, K];

	int ncate1;
	int ncate2;
	int ncate3;
	int ncate4;
	int ncate5;
	int ncate6;
}

parameters {

	ordered[ncate1-1] theta1;
	ordered[ncate2-1] theta2;
	ordered[ncate3-1] theta3;
	ordered[ncate4-1] theta4;
	ordered[ncate5-1] theta5;
	ordered[ncate6-1] theta6;

	real mu_theta;
	real<lower=0.000001> sigma_theta;

	real<lower=0.000001> lambda11;
	real<lower=0.000001> lambda21;
	real<lower=0.000001> lambda31;
	real<lower=0.000001> lambda41;
	real<lower=0.000001> lambda51;
	real<lower=0.000001> lambda61;

	real<lower=0.000001> sigma_lambda;

	vector[R] xi[N];
}

transformed parameters {

	matrix[K, R] lambda_matrix;

	lambda_matrix[1, 1]=lambda11;
	lambda_matrix[2, 1]=lambda21;
	lambda_matrix[3, 1]=lambda31;
	lambda_matrix[4, 1]=lambda41;
	lambda_matrix[5, 1]=lambda51;
	lambda_matrix[6, 1]=lambda61;
}

model{
	// Prior

	theta1 ~ normal(mu_theta, sigma_theta);
	theta2 ~ normal(mu_theta, sigma_theta);
	theta3 ~ normal(mu_theta, sigma_theta);
	theta4 ~ normal(mu_theta, sigma_theta);
	theta5 ~ normal(mu_theta, sigma_theta);
	theta6 ~ normal(mu_theta, sigma_theta);

	mu_theta ~ cauchy(0, 5);
	sigma_theta ~ cauchy(0, 5);

	lambda11 ~ normal(1, sigma_lambda);
	lambda21 ~ normal(1, sigma_lambda);
	lambda31 ~ normal(1, sigma_lambda);
	lambda41 ~ normal(1, sigma_lambda);
	lambda51 ~ normal(1, sigma_lambda);
	lambda61 ~ normal(1, sigma_lambda);

	sigma_lambda ~ cauchy(0, 5);

	for (i in 1 : N){
		
		xi[i] ~ normal(0, 1); // mean vector = rep(0, R)
	}

	// likelihood

	for (i in 1 : N){

		{Y[i, 1] ~ ordered_logistic(lambda_matrix[1, ] * xi[i], theta1);}
		{Y[i, 2] ~ ordered_logistic(lambda_matrix[2, ] * xi[i], theta2);}
		{Y[i, 3] ~ ordered_logistic(lambda_matrix[3, ] * xi[i], theta3);}
		{Y[i, 4] ~ ordered_logistic(lambda_matrix[4, ] * xi[i], theta4);}
		{Y[i, 5] ~ ordered_logistic(lambda_matrix[5, ] * xi[i], theta5);}
		{Y[i, 6] ~ ordered_logistic(lambda_matrix[6, ] * xi[i], theta6);}
	}
}

generated quantities {
	matrix[N, K] log_lik;
 
	for (i in 1 : N){

		log_lik[i, 1] = ordered_logistic_lpmf(Y[i, 1] | lambda_matrix[1] * xi[i], theta1);
		log_lik[i, 2] = ordered_logistic_lpmf(Y[i, 2] | lambda_matrix[2] * xi[i], theta2);
		log_lik[i, 3] = ordered_logistic_lpmf(Y[i, 3] | lambda_matrix[3] * xi[i], theta3);
		log_lik[i, 4] = ordered_logistic_lpmf(Y[i, 4] | lambda_matrix[4] * xi[i], theta4);
		log_lik[i, 5] = ordered_logistic_lpmf(Y[i, 5] | lambda_matrix[5] * xi[i], theta5);
		log_lik[i, 6] = ordered_logistic_lpmf(Y[i, 6] | lambda_matrix[6] * xi[i], theta6);

	}
}

The R codes are here


#################################
###      model1a              ###
#################################
num.iter = 5000

data1a = list(N=N, K=K, Y=Y, ID=blirtm$ID,
              R=1,
              
              ncate1=4,
              ncate2=4,
              ncate3=4,
              ncate4=4,
              ncate5=4,
              ncate6=4)

init1a = list(list(lambda11=1),
              list(lambda11=1),
              list(lambda11=1))

model1a = stan(file="model1a.stan", data = data1a,
               iter=num.iter, chains = 3, init = init1a,
               pars = c('theta1', 'theta2', 'theta3', 'theta4', 'theta5', 'theta6',
                        'mu_theta', 'sigma_theta',
                        'lambda_matrix', 'sigma_lambda',
                        
                        
                        "xi", "log_lik"))

result1a=summary(model1a)$summary
round(result1a[1:30, c(1, 4, 8, 9, 10)], 4)


log_lik_m1a = extract_log_lik(model1a)

# ***********
# DIC
# ***********

# First, we calculate log likelihood evaluated at the posterior mean of the parameters
# We need to calculate for each i then sum over i (here i used to indicate y_i)
# From the model, y ~ normal(theta, sigma),
# the likelihood for y[i] is the pdf of a normal dist

result1a_post_mean = result1a[, "mean"]

ll_at_post_1a = 0

for (k in 1 : K){
  for (i in 1 : N){
    ll_at_post_1a = ll_at_post_1a +
      log_ordered_logistic(k = Y[i, k],
                           
                           
                           eta = result1a_post_mean[c(paste0("lambda_matrix[", k, ",1]"))] * 
                             result1a_post_mean[c(paste0("xi[", i, ",1]"))]
                           
                           ,
                           c = c(result1a_post_mean[c(paste0("theta", k, "[1]"))],
                                 result1a_post_mean[c(paste0("theta", k, "[2]"))],
                                 result1a_post_mean[c(paste0("theta", k, "[3]"))]))
  }
}


# Now calculate p_DIC
# First, we need to extract log_lik
# log_lik_m1a = extract_log_lik(model1a)

#  str(log_lik_m1a)
# head(log_lik_m1a)
#dim(log_lik_m1a)

# It is a matrix of size [number of iterations / 2 * 4 (chains)] * [J]

p_DIC_1a = 2 * (ll_at_post_1a - sum(colMeans(log_lik_m1a)))
# This colMeans(log_lik_m1a) is to calculate for each y_i, then we sum over i

DIC_1a = -2 * ll_at_post_1a + 2 * p_DIC_1a


write.table(x=DIC_1a, file = paste('DIC_1a', chain_id, '.txt', sep = ''),
            quote = F, sep = ' ', row.names = FALSE, col.names = FALSE)

write.table(x=p_DIC_1a, file = paste('p_DIC_1a', chain_id, '.txt', sep = ''),
            quote = F, sep = ' ', row.names = FALSE, col.names = FALSE)

## End of DIC

# WAIC and loo
waic_1a = waic(log_lik_m1a)
loo_1a = loo(log_lik_m1a)

print(paste0("chain_id = ", chain_id))

The first ten rows of the dataset


Y1 Y2 Y3 Y4 Y5 Y6 ID No
1 0 2 1 1 2 1 1
2 1 0 0 1 2 2 2
0 1 0 0 1 1 3 3
1 1 1 1 2 2 4 4
1 2 2 3 2 2 5 5
1 1 1 0 1 2 6 6
1 0 0 0 0 1 7 7
1 1 0 0 1 2 8 8
1 1 1 0 1 2 9 9
2 2 3 3 3 2 10 10

Convergence was OK

 mean    2.5%   97.5%     n_eff   Rhat
theta1[1]          -3.0714 -3.7536 -2.4626  5215.107 1.0010
theta1[2]          -0.9162 -1.3278 -0.5381  3920.249 1.0008
theta1[3]           3.0726  2.4509  3.7605  5936.204 0.9999
theta2[1]          -3.5100 -4.3645 -2.7548  3568.218 1.0000
theta2[2]          -1.4112 -1.9784 -0.8822  2842.017 1.0010
theta2[3]           3.0202  2.3651  3.7563  4900.481 1.0006
theta3[1]          -3.2293 -4.2909 -2.3047  2649.794 1.0013
theta3[2]          -0.3721 -1.0963  0.3385  1859.145 1.0025
theta3[3]           2.7762  1.9415  3.7609  2656.568 1.0010
theta4[1]          -3.0630 -4.8289 -1.7927  1543.400 1.0045
theta4[2]          -0.7148 -1.9183  0.3200  1612.288 1.0035
theta4[3]           2.0314  0.8831  3.5105  1866.499 1.0005
theta5[1]          -4.6385 -5.9427 -3.5438  3098.388 1.0005
theta5[2]          -1.0495 -1.7885 -0.3762  2101.662 1.0014
theta5[3]           4.6874  3.6068  6.0165  3956.021 1.0002
theta6[1]          -4.2235 -5.2482 -3.3566  5286.935 1.0002
theta6[2]           0.9351  0.4779  1.4175  3143.651 1.0006
theta6[3]           5.5038  4.3080  6.9252  7079.035 0.9997
mu_theta           -0.2349 -1.8639  1.3884  7584.857 0.9998
sigma_theta         3.3552  2.3834  4.7763  8590.206 1.0001
lambda_matrix[1,1]  1.4160  1.0183  1.8540  5353.988 0.9999
lambda_matrix[2,1]  2.2921  1.7527  2.9183  4273.519 0.9997
lambda_matrix[3,1]  4.0638  3.0747  5.2749  3288.835 0.9999
lambda_matrix[4,1]  6.4429  4.2837  9.9314  1428.458 1.0022
lambda_matrix[5,1]  3.6115  2.7600  4.6732  3203.823 0.9999
lambda_matrix[6,1]  1.9932  1.4760  2.5837  4944.440 1.0005
sigma_lambda        3.4081  1.7567  6.7359  4320.694 1.0007
xi[1,1]             0.0982 -0.3571  0.5636  7602.082 0.9999
xi[2,1]             0.6375  0.0921  1.2381  8381.351 0.9999
xi[3,1]             1.0401  0.4225  1.7570 10673.522 1.0000

I’m not in a good position to answer your technical questions. That’s @avehtari’s domain. But in my experience, this can happen when you have few upper-level groups. In your case, that would be the 6 items. My bet is you’d have fewer high k values if you simulated data with, say, 30 items.

Assuming this isn’t time series data and each row represents a group, then both approaches are valid, but they’re answering different questions. For (i), the question is whether you can predict a new observation from the same group, and for (ii), it’s whether you can predict a whole new group given observations of other groups. For example, Scikit.learn’s model selection API supports both.

These are estimates. @avehtari will be better able to answer, but I’d consieder 1187 reasonable values and 13 mildly bad ones a win. I don’t know how many we’d expect if everything’s well calibrated.

This is up to you. You save a variable log_lik of a given size and how you populate that determines the data partition.

P.S. Some comments on the stan code.

  1. You shouldn’t need those <lower=0.000001>. If you’re continually running into 0 values, you have bigger problems with the model that you don’t want to sweep under the rug this way.

  2. You don’t need to define scalars to combine into a matrix, you can just define the matrix with a constraint as follows.

parameters {
  matrix<lower=0>[K, R] lambda;
}
...
model {
  lambda ~ normal(1, sigma_lambda);

Why are you centering around 1 here?

  1. It looks like all of the theta1 to theta6 are different sizes?

  2. You’ don’t need all those braces around sampling statements in Stan—they don’t do anything when wrapping a single statement.

  3. Those Cauchy priors are super wide tailed, so if you simulate from them, you will run into problems.

  4. You an turn the prior on xi into a one-liner

to_vector(xi) ~ std_normal();
  1. You can declare xi as a matrix and then vectorize the Y sampling statements as, e.g., with the new matrix lambda,
  matrix[N, R] xi;
  ...
  Y[ , 1] ~ ordered_logistic(lambda[1] * xi, theta1);
  1. Doesn’t the log_lik variable have to be a scalar? Or is there implicit group-level comparison if you give it a matrix?

Thanks! Fixed.

CV-FAQ says: “it is likely that the model is so flexible or the population prior so weak that it’s difficult to predict the left out observation”. Another way to say this is that, some of the leave-one-out posteriors are so different from the full data posterior that the importance weights have problematic distribution (indicated by Pareto-k).

This depends on how you compute log_lik. You have now computed so that you have 1200 separate observations. loo() output did also indicate this with Computed from 7500 by 1200 log-likelihood matrix.. If you want you can sum together log_lik values for each subject, to make leave-one-subject-out cross-validation.

You have yourself defined the partition in the generated quantities.

PSIS-LOO can also fail for even more well-known models. It just about how well known the models are, but how different the full data posterior and LOO-CV posteriors are.

I added to CV-FAQ “Sometimes it helps to switch to priors that put less prior mass for very flexible models.” (I do have a plan to write a book, which will have longer explanations, but that will take some years). So there is one option which does not require better computation.

  1. It is possible that PSIS-LOO would work in case of better priors. Maybe that independent normal prior on xi is too wide (and gets worse with increasing number of dimensions). Also Cauchy(0, 5) priors are weak, especially as Cauchy has very thick tail.

Better computation options are

  1. As you have only a few high Pareto-khat’s, it is possible that it is just by chance as finite number of posterior draws is used to compute the importance weights. Pareto smoothed importance sampling paper discusses how to compute the posterior for Pareto-k, so you can asses the unecrtainty. You can try running more iterations and check whether these Pareto-k’s get below 0.7.
  2. You can use mixture importance sampling LOO
  3. You can use moment matching LOO
  4. You can run MCMC only for the problematic LOO-CV folds
  5. You can use K-fold-CV

There is lots of good advice already, I would just add that Section 4.2 of the paper below has an IRT example where we compute PSIS-LOO using a likelihood that is marginal over the person parameters. The marginalization usually seems desirable for this type of model.

Also, if you are willing to use a graded response model with probit link, the blavaan package will calculate this marginal version of PSIS-LOO for you (though it can get computationally intensive). Some additional detail is at the link below, especially see the “likelihood computations” section.

Ordinal Models in blavaan • blavaan

Thanks @edm, I thought about mentioning this approach, but I did not remember you had specifically IRT example there!

I also hesitated mentioning marginalization as implementation in case of more than one dimension is complicated if there is no software support. Great to know that blavaan supports this!

1 Like