Issues fitting a multidimensional IRT model with K-fold cross-validation

rjc10 · January 29, 2019, 11:19pm

Dear Stan users:

Hi, I’m picking up a project that I had to shelve for a few months. I’m fitting multidimensional Item Response Theory models, and am having trouble with large Pareto k values during leave-one-out cross-validation. This problem occurs even when I am using exactly the right model because I generated the data myself using an IRT model with the same number of dimensions as the model I am fitting to the data. The reason, I believe, is that by random chance some respondents will answer a difficult test item correctly despite having low ability, and IRT models have so many parameters that a few data points are almost guaranteed to show up as outliers, with high Pareto k values. Additionally, I am using priors to identify the models (setting the origin and the units and orientation of the axes in the coordinate system describing the model, as per Reckase (2009)), and it is possible (as I am new to multidimensional IRT models) that while my choice of priors works for model identification, the priors may still be weak enough that the outliers have enough room to work their mischief, leading to the high Pareto k values. I probably have more research to do concerning how other researchers have chosen specific priors for specific models, and my gut tells me that at least half of this sort of knowledge just comes with experience.

Dr. Vehtari recommended K-fold cross-validation (in particular, 10-fold cross-validation) as a work-around for the large Pareto k values encountered with LOO, so I have gone ahead and implemented K-fold cross-validation for multidimensional IRT models. Those models appear to be working, generating apparently valid results (as I show below). It is a small victory for me, as I am new to both Stan and to multidimensional IRT models and the associated issues with prior selection and label switching within and between chains, issues not encountered with unidimensional models. The problem is, my three-dimensional IRT model with K-fold cross-validation appears to be fitting two-dimensional simulated data better than the two-dimensional K-fold cross-validation model does:

########################
# K-fold cross-validation comparison
#
#                2-d model      3-d model
# elpd_kfold     -22758.8        -22492.77
# se_elpd_kfold    107.67          105.96
#
# [pseudocode for a] custom function comparing elpd results for k-fold cross-validation models
# that I hacked together from functions in loo_compare.R in the 'loo' package
#
#> compare_kfold([2-d model], [3-d model])
#                 elpd_diff     se 
#                    266.0      19.3
#
########################

Because I was curious and as a check on my coding, I went ahead and calculated LOO cross-validation statistics using the log-likelihood for the holdout data from each of the K folds (I have placed my code calculating that log-likelihood at the end of this post). The resulting LOO comparisons replicate results that I had obtained for my models that did not go through the extra step of splitting my data into 10 folds and combining the likelihoods for the holdout data of each of the resulting models. That, I believe, is at least an oblique confirmation that my implementation of the k-fold cross-validation models is working. Here are those results, but note that I don’t completely trust the LOO results anyway because of the high Pareto k values:

########################
# LOO comparison (using likelihood assembled from k-fold cross-validation models)
#             elpd_diff se_diff  elpd_loo p_loo    looic   
# 2-d model      0.0       0.0 -24975.9   2217.1  49951.9
# 3-d model    -84.9      26.6 -25060.8   2568.0  50121.6
# 
# LOO comparison (using likelihoods from individual models)
#            elpd_diff se_diff  elpd_loo p_loo    looic   
# 2-d model      0.0       0.0 -24581.8   1971.3  49163.7
# 3-d model    -23.6       3.8 -24605.5   2291.3  49210.9
# 1-d model   -586.6      33.5 -25168.5   1231.1  50336.9
########################

(I accidentally ran the 1-d K-fold model using the Stan file that does not implement K-fold cross-validation, so I’ll have to re-run that model and could report the results, but my question is about the discrepancy concerning the two-dimensional and three-dimensional models.)

The LOO results in both cases show that the two-dimensional model fits the data best, beating the three-dimensional model by a close but statistically significant margin. That would be great, except that I just can’t trust those results when the Pareto k values are large. Here are those values for the LOO statistics computed from the K-fold cross-validation models:

> loo18 [2-d model]

Computed from 1500 by 51300 log-likelihood matrix

         Estimate    SE
elpd_loo -24975.9 123.6
p_loo      2217.1  21.2
looic     49951.9 247.2
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     51036 99.5%   165       
 (0.5, 0.7]   (ok)         249  0.5%   99        
   (0.7, 1]   (bad)         15  0.0%   33        
   (1, Inf)   (very bad)     0  0.0%   <NA>      
See help('pareto-k-diagnostic') for details.

> loo19 [3-d model]

Computed from 1500 by 51300 log-likelihood matrix

         Estimate    SE
elpd_loo -25060.8 124.0
p_loo      2568.0  23.6
looic     50121.6 248.0
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     50866 99.2%   150       
 (0.5, 0.7]   (ok)         412  0.8%   89        
   (0.7, 1]   (bad)         20  0.0%   56        
   (1, Inf)   (very bad)     2  0.0%   12        
See help('pareto-k-diagnostic') for details.

These models, using simulated data, are already a best-case scenario in which relatively few observations are associated with high Pareto k values. Actual data exhibits more skew, with a few respondents with low ability improbably answering very difficult items correctly.

In short, in all appearances I appear to not have a bug in my code implementing the K-fold cross-validation models, but using that approach instead of LOO cross-validation would lead me to select a model that is inconsistent with the actual dimensionality of the data, whereas LOO would lead me to choose the correct one!

For these models, in order for me to be able to get a statistically significant difference between the two- and three-dimensional models, I’m working with about 1,400 simulated respondents responding to about 40 simulated test items. On my four-core processor it takes a number of hours to run a single chain, which translates to about a week to do 10-fold cross-validation with three chains per fold. Tweaking my model parameters and re-fitting a model is therefore costly, which is why I am posting. I’m not sure how to proceed; one hypothesis that I have is that I might need to run the model with a larger simulated data set, because when I run the model on 90% of the data I the signal is slightly weaker, such that the 3-dimensional model ends up fitting more of the noise and therefore out-performing the 2-dimensional model. But I don’t like the idea of running a model for two weeks just in case that a hunch is correct. A second possibility is simply that I have a bug in my code somewhere, but my replication of the LOO results suggests that it’s not so.

Previously, I had to deal with the issues of label switching between chains and, for 5-dimensional models and higher, within chains as well. I wonder if perhaps something is going wrong to that effect during cross-validation, such that when I combine the likelihoods for models across the multiple folds, I am accidentally combining results for chains that have different labels as if they had the same label, leading to a large amount of noise that the three-dimensional model is fitting better than the two-dimensional model simply because it has more parameters. But if that is so, then why do my LOO results for the models using 10-fold cross-validation match those for the models run without it?

One odd behavior that I’ve observed is that the two-dimensional model in many cases takes much longer to finish running than the three-dimensional model. The variance in the running time for each chain is larger, such that it’s really just a few models that take awhile to complete (ranging from three hours to as many as three days for a single chain), for a total of nine days using four cores of a four-core CPU. With the three-dimensional model, the running time was always much shorter (about five hours per chain, such that thirty chains on three CPUs finished in about two days). I thought at first that for the two-dimensional model, using all four cores was slowing the computation due to a large number of context switches generated by the operating system that was fighting for computation time – but repeating the effort with three cores also is producing a high variance in the time for a chain to finish. I don’t know what could be producing this behavior.

I’m not sure how to proceed. I definitely need the model comparisons to work properly with the simulated data before I would be able to trust my conclusions with my actual data.

Below is my code for the one-dimensional and multi-dimensional Stan models using k-fold cross-validation, as well as the code that creates the folds and that computes the log-likelihood. Per the forum guidelines, at the end I’m also including the code that simulates the data and that actually runs the models, but as a fair warning these models took 11 days to run and 23Gb of storage space! If anyone has any ideas about how to proceed, I’d greatly appreciate it as it could take quite some time for me to deduce what is happening here on my own.

A last point: to deal with the skew in my actual data, I’ve written a (definitely suboptimal) heuristic to separate all responses into two groups for the purpose of stratified assignment of observations to folds using the Caret package. For my first analysis with the simulated data (shown here), assignment to each group was random, and each group contained half the data. That approach led to the three-dimensional model fitting best to the two-dimensional data, but I doubt that the way I was doing the stratification would explain what is going on. Nonetheless, the second round of analyses currently underway uses a different heuristic, which is designed to take the roughly 20% of observations that are least likely to occur (i.e. the source of the skew) and place them into one group, such that these outliers are evenly distributed across all folds. The simulated data does not exhibit any skew, so this detail should be irrelevant, but I’m mentioning that for the purpose of completeness. The second round should be complete in five days or so.

Thank you very much for your time! I hope this post might help someone else dealing with this type of problem for k-fold cross-validation with data-hungry models.

Implementation details concerning the k-fold cross-validation: I’m using OpenMP for parallelization. As it stands, each core computes 10 out of 30 chains, then stops. It would be great to know how to get the core(s) that finish first to run the Stan models for additional chains, because currently at the end I am waiting for one straggling core to handle two or three models in serial. But that’s a separate thread.

Best,

Richard

#### STAN MODEL

## Stan file for the multidimensional IRT model with K-fold cross-validation:

// see Reckase(2009) and https://rfarouni.github.io/assets/projects/BayesianMIRT/BayesianMIRT.html
// This uses a (0,1) normal prior on the alphas to identify their scale. 
data{
	int<lower=1> n_respondents;
	int<lower=1> n_items;
	int<lower=0,upper=1> Y[n_respondents,n_items];
	int<lower=1> D; // number of dimensions
	int<lower=0,upper=1> holdout[n_respondents,n_items];
}
transformed data{
	cov_matrix[D] Sigma;
	row_vector [D] mu_theta = rep_row_vector(0.0, D);
	Sigma = rep_matrix(0.0, D, D);
	for(i in 1:D){
		Sigma[i,i] = 1.0;
	} // = [ [1, 0, 0], [0, 1, 0], [0, 0, 1] ]; // Identity matrix as per Reckase (2009) and Chun Wang (2015)
}
parameters{
	vector [D] theta [n_respondents];
	matrix<lower=0>[n_items,D] alpha;
	real mu_alpha; 
	vector[n_items] beta;
	real<lower=0> sigma_alpha;
	real<lower=0> sigma_beta;
	real mu_beta;
}

model{
	// hyperpriors
	sigma_alpha ~ cauchy(0,5); // RJC: could allow the prior mean for alpha to vary as well
	mu_alpha ~ normal(0,4); // weak but not uninformative
	sigma_beta ~ cauchy(0,5); // weak but not uninformative
	mu_beta ~ normal(0.0, 4); // weak but not uninformative

	// priors
	for(i in 1:n_respondents){
		theta[i] ~ multi_normal(mu_theta, Sigma);  // Reckase 2009
	}
	beta ~ normal(mu_beta,sigma_beta);
	to_vector(alpha) ~ lognormal(mu_alpha, sigma_alpha);

	// likelihood
	// More efficient to write loops column by column as matrices are stored internally in column-major order (Stan manual p. 327)
	//int total_data_counter=0;
	//int sample_data_counter=0;
	for(j in 1:n_items){
		for(i in 1:n_respondents){
			if(holdout[i,j]==0){
				real p;
				p = inv_logit(row(alpha,j)*theta[i]+beta[j]);
				target += bernoulli_lpmf(Y[i,j] | p);
				//sample_data_counter += 1;
			}
			//total_data_counter += 1;
			// print("alpha[j] = ", alpha[j]);
			// print("theta[i] = ", theta[i]);
		}
	}
	//print("sample_data_counter = ", sample_data_counter); // for debugging
	//print("total_data_counter = ", total_data_counter); // for debugging
}
generated quantities{
	vector[n_items] log_lik[n_respondents];
	vector[n_items] pY[n_respondents];
	for(j in 1:n_items){
		for(i in 1:n_respondents){
			real p;
			pY[i,j] = inv_logit(row(alpha,j)*theta[i]+beta[j]);
			log_lik[i,j] = bernoulli_lpmf(Y[i,j] | pY[i,j]);
		}
	}
}

#### R FUNCTIONS

# Simple function to take the results from a K-fold cross-validation and return the loo value.
# Of course, in practice the whole point for doing a K-fold cross-validation is that you can't calculate loo 
# directly. The only point of this function is for checking that the results from both K-fold and 
# loo are consistent for models for which loo would work. It's a check that my code works.
get_loo_from_kfold <- function(kfold){
  log_lik <- kfold$ee_for_loo
  r_eff <- relative_eff(exp(log_lik))
  result <- loo(log_lik, r_eff=r_eff)
  return(result)
}

## This function takes in a data set (for input to a normal Stan object), 
## the Stan file that normally gets run, a number of folds (default = 10), 
## and a variable with levels along which we'll do the stratification using caret. 
## It generates the folds, runs Stan K*nchains times, gets the corresponding Stan objects,
## and extracts the log likelihood. It returns a single matrix of log-likelihoods for all the hold-out data. 
## I'll then need to run kfold() on the results.
## One assumption that I make is that the data I pass in has the actual outcomes stored in data$Y.
do_K_fold_cross_validation_memory_friendly <- function(data, 
                                                       stan_file, 
                                                       save_file_stem,
                                                       do_stratification=TRUE, 
                                                       n_folds=10, 
                                                       n_chains=3, 
                                                       n_cores=3,
                                                       merge_chains=FALSE,
                                                       iter=2000,
                                                       control=list(max_treedepth = 15),
                                                       desired_stratification_percentage=0.015,
                                                       truncation_percentage=0.02,
                                                       debug=TRUE)
{
  library(rstan)
  rstan_options(auto_write = TRUE)
  library(pbmcapply)
  library(loo)
  library(caret)
  
  K <- n_folds
  
  ## TODO: Make this its own function. (Get holdout data in flattened and unflattened formats)
  flattened.data <- as.vector(as.matrix(data$Y))
  if(do_stratification){
    ones_and_twos <- get_group_membership(data, 
                                          desired_group_1_percentage=desired_stratification_percentage, 
                                          truncation_percentage=truncation_percentage, 
                                          debug=TRUE)    
  }else{
    ones_and_twos <- rep(1, length(flattened.data))
    ones_and_twos[((length(flattened.data)/2)+1):length(flattened.data)] <- 2    
  }

  folds <- createFolds(ones_and_twos, n_folds, list=FALSE)
  
  holdout_K_from_caret <- list()
  for(k in 1:K)
  {
    holdout_vector <- rep(0,nrow(data$Y)*ncol(data$Y))
    holdout_vector[which(folds==k)] <- 1
    #holdout_vector <- flattened.data[folds[[k]]]
    holdout_K_from_caret[[k]] <- unflatten(holdout_vector,nrow=nrow(data$Y),ncol=ncol(data$Y))
    #unflatten(holdout_vector,nrow=nrow(data_irt$Y),ncol=ncol(data_irt$Y))
  }
  holdout_K_from_caret_flattened <- list()
  for(i in 1:K){
    holdout_K_from_caret_flattened[[i]] <- flatten(holdout_K_from_caret[[i]])
  }
  
  #run the functions
  I <- dim(data)[1]
  J <- dim(data)[2]
  #data_m <- list(n_respondents=I,n_items=J,Y=data)
  data_m <- data
  #create a list of data list
  data_l <- rep(list(data),K)
  #add the holdout index to it
  #for(i in 1:10) data_l[[i]]$holdout <- holdout_10[[i]]
  for(i in 1:K) data_l[[i]]$holdout <- holdout_K_from_caret[[i]]
  

  cat("In do_K_fold_cross_validation: Running K memory-friendly Stan models...\n")
  stan_kfold_memory_friendly(file=stan_file,
                             save_file_stem=save_file_stem,
                             data_l,
                             chains=n_chains,
                             cores=n_cores,
                             iter=iter,
                             control=control)
  # This doesn't return a list of Stan fit objects. So I want to pass in the 
  # save file stem to extract_log_lik_K.
  
  
  cat("In do_K_fold_cross_validation: extracting log-likelihood...\n")
  ee_for_loo <- extract_log_lik_K_memory_friendly(save_file_stem=save_file_stem,
                                                  n_folds=K,
                                                  list_of_holdout=holdout_K_from_caret_flattened,
                                                  merge_chains=merge_chains)
  ee_for_kfold <- extract_log_lik_K_memory_friendly(save_file_stem=save_file_stem,
                                                    n_folds=K,
                                                    list_of_holdout=holdout_K_from_caret_flattened)
  #r_eff <- relative_eff(exp(ee))
  
  cat("In do_K_fold_cross_validation: calculating ELPD and standard error...\n")
  kk <- kfold(ee_for_kfold)
  result <- list()
  result$kk <- kk
  result$ee_for_loo <- ee_for_loo
  result$ee_for_kfold <- ee_for_kfold
  return(result)
}


# Adapted from a function https://www.r-bloggers.com/k-fold-cross-validation-in-stan/ to use less memory
# all at once
extract_log_lik_K_memory_friendly <- function(save_file_stem, n_folds, list_of_holdout, ...){
  require(loo)
  K <- n_folds
  #list_of_log_liks <- plyr::llply(1:K, function(k){
  #  extract_log_lik(list_of_stanfits[[k]],...)
  #})
  
  load(paste(save_file_stem,"k=1",sep="")) # "merged_model"
  log_lik_1 <- extract_log_lik(merged_model,...)
  # A little inefficient -- I could probably save one read with clever coding -- but fine for now

  # `log_lik_heldout` will include the loglike of all the held out data of all the folds.
  # We define `log_lik_heldout` as a (samples x N_obs) matrix
  # (similar to each log_lik matrix)
  log_lik_heldout <- log_lik_1 * NA
  for(k in 1:K){
    load(paste(save_file_stem,"k=1",sep="")) # "merged_model"
    log_lik <- extract_log_lik(merged_model,...)
    samples <- dim(log_lik)[1] 
    N_obs <- dim(log_lik)[2]
    # This is a matrix with the same size as log_lik_heldout
    # with 1 if the data was held out in the fold k
    heldout <- matrix(rep(list_of_holdout[[k]], each = samples), nrow = samples)
    # Sanity check that the previous log_lik is not being overwritten:
    if(any(!is.na(log_lik_heldout[heldout==1]))){
      warning("Heldout log_lik has been overwritten!!!!")
    }
    # We save here the log_lik of the fold k in the matrix:
    log_lik_heldout[heldout==1] <- log_lik[heldout==1]
    rm(merged_model)
    gc()
  }
  return(log_lik_heldout)
}

## Revised version of stan_kfold that isn't as efficient, but which isn't as much of a memory hog.
## TODO: Error handling (low priority)
stan_kfold_memory_friendly <- function(file, save_file_stem, list_of_datas, chains, cores,iter,control,...){
  library(pbmcapply)
  badRhat <- 1.1 # don't know why we need this?
  n_fold <- length(list_of_datas)
  model <- stan_model(file=file)
  generate_sample_chain <- function(i){
    # Fold number:
    k <- ceiling(i / chains)
    s <- sampling(model, data = list_of_datas[[k]], 
                  chains = 1, chain_id = i,
                  iter=iter,
                  control=control)
    save(s, file=paste(save_file_stem,"i=",i,"_k=",k,sep=""))
    rm(s)
    gc()
    return(NULL)
  }
  #for(i in 1:(n_fold*chains)){
  #  s <- generate_sample_chain(i)
  #  rm(s) 
  #  gc() # Freeing up memory
  #}

  # First parallelize all chains:
  sflist <- 
    mclapply(1:(n_fold*chains), mc.cores = cores, 
               generate_sample_chain)
  # The sflist object is extraneous here.
  # NOTE: RStudio will mask the output of the individual calls to Stan().
  #       If I want to see that output, run everything directly from the console!
  #       It will be a bit of a mess to read, but provides finer-grain information 
  #       about the progress of parallelized model fitting for really big models.
  
  # Then merge the K * chains to create K stanfits, but save them to disk instead of returning in a list:
  for(k in 1:n_fold){
    sflist <- list()
    for(i in 1:chains){
      load(file=paste(save_file_stem,"i=",(chains*(k-1)+i),"_k=",k,sep=""))
      sflist[[i]] <- s
      rm(s)
      gc()
    }
    inchains <- 1:chains
    
    #  Merge `chains` of each fold
    merged_model <- sflist2stanfit(sflist[inchains])
    save(merged_model, file=paste(save_file_stem,"k=",k,sep=""))
    
    # Delete the intermediate files I just created
    for(i in 1:chains){
      filename <- paste(save_file_stem,"i=",(chains*(k-1)+i),"_k=",k,sep="")
      if(file.exists(filename)){
        file.remove(filename)
      }
    }
  }
  
  # At this point the directory referenced to by file_stem should contain K stan_fit files, 
  # each containing n_chains chains. We don't have those loaded into memory. 
  return() 
}

######################################################################################
#### Data simulation
## Gen2DMC2PL_model53.R
## Generate 2-dimensional (multidimensional) compensatory 2PL model parameters and data

library(MASS)

remove(list=ls())
set.seed(12345) # model 16
N <- 1350
K <- 38 # 100 # 38 # 125 # 200 # 500 (broke R) # 38
ndim <- 2
# D <- 1.0

mu_theta <- c(0.00,0.00)
sigma_theta <- matrix(data=c(1.00, 0.00, 0.00, 1.00), nrow=2, ncol=2)
mu_alpha <- 0.0 # log(1.0)
sigma_alpha <- 0.5
mu_beta <- 1.0
sigma_beta <- 0.5

outfile <- "~/Documents/IRT_Research/Data/IRT_simulated_data/for_R/mod53_data.csv"
params_directory <- "~/Documents/IRT_Research/Data/IRT_simulated_data/Model53/"

# Simulate model parameters
thetas <- mvrnorm(N, mu=mu_theta, Sigma = sigma_theta)
alphas <- matrix(data=rlnorm(K*ndim, meanlog = mu_alpha, sdlog = sigma_alpha),nrow=K,ncol=ndim)
betas <- rnorm(K, mean=mu_beta, sd=sigma_beta)

# Compute probability that respondent i answers question j correctly
pY <- matrix(data=NA,nrow=N,ncol=K)
for(i in 1:nrow(pY))
{
  for(j in 1:ncol(pY))
  {
    num <- exp(alphas[j,]%*%thetas[i,]+betas[j])
    pY[i,j] <- num/(1+num)
  }
}

# Simulate observed responses
Y <- matrix(data=NA,nrow=N,ncol=K)
for(i in 1:nrow(pY))
{
  for(j in 1:ncol(pY))
  {
    if(runif(1,min=0,max=1)<pY[i,j])
    {
      Y[i,j] = 1.0
    }
    else
    {
      Y[i,j] = 0.0
    }
  }
}
length(which(is.na(Y)))

# Label the questions
col.labels <- rep(NA,ncol(Y))
for(j in 1:ncol(Y))
{
  col.labels[j] <- paste("Q",j,sep="")
}
colnames(Y) <- col.labels

# Check results: responses and probabilities should be correlated at about 0.5
cor(as.vector(Y),as.vector(pY))

# Output results
write.csv(Y, file=outfile,quote=FALSE)
write.csv(thetas, 
          file=paste(params_directory,"thetas.csv",sep=""))
write.csv(alphas, 
          file=paste(params_directory,"alphas.csv",sep=""))
write.csv(betas, 
          file=paste(params_directory,"betas.csv",sep=""))
write.csv(pY, 
          file=paste(params_directory,"pY.csv",sep=""))
##################################################################################

###############################################3
# Actually running the model
## Import libraries
source('~/Documents/IRT_Research/R_work/Stan_IRT_utilities.R')
options(mc.cores = parallel::detectCores())
library(rstan)
rstan_options(auto_write = TRUE)
library(loo)
library(bayesplot)
library(caret)
source('~/Documents/IRT_Research/R_work/k_fold_functions.R') # relevant functions are above
source('~/Documents/IRT_Research/R_work/loo_compare.R') 
# This is from the R package loo, minor changes made to allow simple comparisons of two 
# models fit with K-fold cross-validation

##  Fit a K-fold 2-D model to the 2-D dataset
data <- read.table("~/Documents/IRT_Research/Data/IRT_simulated_data/for_R/mod53_data.csv",header=TRUE,sep=",",row.names=1)
I = dim(data)[1]
J = dim(data)[2]
data_irt <- list(n_respondents=I,n_items=J,D=2,Y=data)
kfold18 <- do_K_fold_cross_validation_memory_friendly(data=data_irt,
                                                      stan_file="~/Documents/IRT_Research/Stan/Simulated_data/mod40/stan_simulated_data_model_40_kfold_cv.stan",
                                                      save_file_stem <- "~/Documents/IRT_Research/Data/IRT_simulated_data/cv_test/Model5/saved_models/2D_model/",
                                                      stratification_factor=NULL,
                                                      n_folds=10,
                                                      n_chains=3,
                                                      n_cores=4,
                                                      merge_chains=FALSE,
                                                      iter=1000,
                                                      control=list(max_treedepth = 15))
save(kfold18, file="~/Documents/IRT_Research/Data/IRT_simulated_data/cv_test/Model5/2d_k_fold_elpd.dta")
load(file="~/Documents/IRT_Research/Data/IRT_simulated_data/cv_test/Model5/2d_k_fold_elpd.dta")
loo18 <- get_loo_from_kfold(kfold18)
save(loo18, file="~/Documents/IRT_Research/Data/IRT_simulated_data/cv_test/Model5/2d_k_fold_loo.dta")
load(file="~/Documents/IRT_Research/Data/IRT_simulated_data/cv_test/Model5/2d_k_fold_loo.dta")
library(loo)
plot(loo18)
# This last model took nine days to run, ouch. 


##  Fit a K-fold 3-D model to the 2-D dataset
data <- read.table("~/Documents/IRT_Research/Data/IRT_simulated_data/for_R/mod53_data.csv",header=TRUE,sep=",",row.names=1)
I = dim(data)[1]
J = dim(data)[2]
data_irt <- list(n_respondents=I,n_items=J,D=3,Y=data)
kfold19 <- do_K_fold_cross_validation_memory_friendly(data=data_irt,
                                                      stan_file="~/Documents/IRT_Research/Stan/Simulated_data/mod40/stan_simulated_data_model_40_kfold_cv.stan",
                                                      save_file_stem <- "~/Documents/IRT_Research/Data/IRT_simulated_data/cv_test/Model5/saved_models/3D_model/",
                                                      stratification_factor=NULL,
                                                      n_folds=10,
                                                      n_chains=3,
                                                      n_cores=3,
                                                      merge_chains=FALSE,
                                                      iter=1000,
                                                      control=list(max_treedepth = 15))
save(kfold19, file="~/Documents/IRT_Research/Data/IRT_simulated_data/cv_test/Model5/3d_k_fold_elpd.dta")
load(file="~/Documents/IRT_Research/Data/IRT_simulated_data/cv_test/Model5/3d_k_fold_elpd.dta")
loo19 <- get_loo_from_kfold(kfold19)
#Warning message:
#  In log(z) : NaNs produced
save(loo19, file="~/Documents/IRT_Research/Data/IRT_simulated_data/cv_test/Model5/3d_k_fold_loo.dta")
load(file="~/Documents/IRT_Research/Data/IRT_simulated_data/cv_test/Model5/3d_k_fold_loo.dta")
load(file="~/Documents/IRT_Research/Data/IRT_simulated_data/cv_test/Model5/2d_k_fold_loo.dta")
compare(loo18, loo19)

plot(loo18)
plot(loo19)
load(file="~/Documents/IRT_Research/Data/IRT_simulated_data/cv_test/Model5/3d_k_fold_elpd.dta")
load(file="~/Documents/IRT_Research/Data/IRT_simulated_data/cv_test/Model5/2d_k_fold_elpd.dta")
source('~/Documents/IRT_Research/R_work/loo_compare.R')
kfold18_struct <- structure(kfold18$kk, class="loo")
kfold19_struct <- structure(kfold19$kk, class="loo")
compare(loo18, loo19)

avehtari · January 30, 2019, 4:25pm

Nice detailed question. I have time to write more comments tomorrow, but one quick comment

That get_loo_from_kfold doesn’t make sense. In 10-fold-cross-validation you should compute log_liks using the left out data, and then they do not need importance sampling correction anymore. It is sufficient to compute simply the log_mean_exp for those. Can you do that and include the result? (sorry for the short message, need to go now)

rjc10 · January 31, 2019, 7:01am

Thank you, sir! Yes, I agree that the correct approach is just to get the log_col_means that you mentioned. Actually those are in fact the elpd_kfold values that I present at the top of the post. It’s understandable that you might have missed that beause despite the detail in the post, I still forgot to include some code. In the R code listing,do_K_fold_cross_validation_memory_friendly() calls the function kfold() which was not included with my first post. Here it is:

# This is from https://www.r-bloggers.com/k-fold-cross-validation-in-stan/
#compute ELPD
kfold <- function(log_lik_heldout)  {
  library(matrixStats)
  logColMeansExp <- function(x) {
    # should be more stable than log(colMeans(exp(x)))
    S <- nrow(x)
    colLogSumExps(x) - log(S)
  }
  # See equation (20) of @VehtariEtAl2016
  pointwise <-  matrix(logColMeansExp(log_lik_heldout), ncol= 1)
  colnames(pointwise) <- "elpd"
  # See equation (21) of @VehtariEtAl2016
  elpd_kfold <- sum(pointwise)
  se_elpd_kfold <-  sqrt(ncol(log_lik_heldout) * var(pointwise))
  out <- list(
    pointwise = pointwise,
    elpd_kfold = elpd_kfold,
    se_elpd_kfold = se_elpd_kfold)
  #structure(out, class = "loo")
  return(out)
}

The reason for my get_loo_from_kfold() function is just to attempt to reproduce the loo results using the log likelihoods from the k-fold cross-validation. I recognize that using the Pareto Scaled Importance Sampling is not necessary in that case, but I figured that if I could at least come close to replicating the results from my models that did not use k-fold cross-validation, then I would be more confident that I had not simply written a bug when creating the folds or implementing my Stan model for k-fold cross-validation. So mathematically I agree that this was an unnecessary approach, but it seemed like a reasonable check to do. (Actually I was also initially misled by this blog, which runs loo on the likelihoods returned by k-fold cross-validation for the purpose of model comparison. So it appears that this confusion may still be prevalent in the Stan community.)

It looks like in my first post I also omitted the very last few lines of code that actually display the elpd_kfold values. Here’s what that looks like, for clarity:

load(file="~/Documents/IRT_Research/Data/IRT_simulated_data/cv_test/Model5/3d_k_fold_elpd.dta")
load(file="~/Documents/IRT_Research/Data/IRT_simulated_data/cv_test/Model5/2d_k_fold_elpd.dta")

kfold18$kk$elpd_kfold
# [1] -22758.8 [compare to top of post]
kfold19$kk$elpd_kfold
# [1] -22492.77

kfold18$kk$se_elpd_kfold
# 107.6672
kfold19$kk$se_elpd_kfold
# 105.9595

# I'm artificially and misleadingly assigning the class "loo" to these results so that I can 
# take functions from loo_compare.R in the loo package, only modify a few lines of code, and use 
# them to compare the results for the kfold analyses. It's a complete hack, written for expediency.
kfold18_struct <- structure(kfold18$kk, class="loo") 
kfold19_struct <- structure(kfold19$kk, class="loo")
compare_kfold(kfold18_struct, kfold19_struct)
# elpd_diff        se 
# 266.0           19.3

# The rest of this listing is all code from loo_compare.R, except for just a few lines, and I've 
# changed the function names...
compare_kfold <- function(..., x = list()) {
  dots <- list(...)
  if (length(dots)) {
    if (length(x)) {
      stop("If 'x' is specified then '...' should not be specified.",
           call. = FALSE)
    }
    nms <- as.character(match.call(expand.dots = TRUE))[-1L]
  } else {
    if (!is.list(x) || !length(x)) {
      stop("'x' must be a list.", call. = FALSE)
    }
    dots <- x
    nms <- names(dots)
    if (!length(nms)) {
      nms <- paste0("model", seq_along(dots))
    }
  }

  #if (!all(sapply(dots, is.loo))) {
  #  stop("All inputs should have class 'loo'.")
  #}
  if (length(dots) <= 1L) {
    stop("'compare' requires at least two models.")
  } else if (length(dots) == 2L) {
    loo1 <- dots[[1]]
    loo2 <- dots[[2]]
    comp <- compare_two_models_kfold(loo1, loo2)
    return(comp)
  } else {
    Ns <- sapply(dots, function(x) nrow(x$pointwise))
    if (!all(Ns == Ns[1L])) {
      stop("Not all models have the same number of data points.", call. = FALSE)
    }

    x <- sapply(dots, function(x) {
      est <- x$estimates
      setNames(c(est), nm = c(rownames(est), paste0("se_", rownames(est))) )
    })
    colnames(x) <- nms
    rnms <- rownames(x)
    comp <- x
    ord <- order(x[grep("^elpd", rnms), ], decreasing = TRUE)
    comp <- t(comp)[ord, ]
    patts <- c("elpd", "p_", "^waic$|^looic$", "^se_waic$|^se_looic$")
    col_ord <- unlist(sapply(patts, function(p) grep(p, colnames(comp))),
                      use.names = FALSE)
    comp <- comp[, col_ord]

    # compute elpd_diff and se_elpd_diff relative to best model
    rnms <- rownames(comp)
    diffs <- mapply(elpd_diffs, dots[ord[1]], dots[ord])
    elpd_diff <- apply(diffs, 2, sum)
    se_diff <- apply(diffs, 2, se_elpd_diff)
    comp <- cbind(elpd_diff = elpd_diff, se_diff = se_diff, comp)
    rownames(comp) <- rnms
    class(comp) <- c("compare.loo", class(comp))
    comp
  }
}

compare_two_models_kfold <- function(loo_a, loo_b, return = c("elpd_diff", "se"), check_dims = TRUE) {
  if (check_dims) {
    #if (dim(loo_a)[2] != dim(loo_b)[2]) {
    if (dim(loo_a[[1]])[1]!=dim(loo_b[[1]])[1]){
      stop(paste("Models don't have the same number of data points.",
                 "\nFound N_1 =", dim(loo_a[[1]])[1], "and N_2 =", dim(loo_b[[1]])[1]),
           call. = FALSE)
    }
  }

  diffs <- elpd_diffs_kfold(loo_a, loo_b)
  comp <- c(elpd_diff = sum(diffs), se = se_elpd_diff_kfold(diffs))
  structure(comp, class = "compare.loo")
}

elpd_diffs_kfold <- function(loo_a, loo_b) {
  pt_a <- loo_a$pointwise
  pt_b <- loo_b$pointwise
  elpd <- grep("^elpd", colnames(pt_a))
  pt_b[, elpd] - pt_a[, elpd]
}
se_elpd_diff_kfold <- function(diffs) {
  N <- length(diffs)
  sqrt(N) * sd(diffs)
}

Thank you again for your time, sir!

And this question is for anyone: in a related post, Ben Goodrich mentioned that “if you are getting variance on the order of days, then your Markov chain is not efficient enough to be doing anything reliable with the draws anyways.” I feel as though I do not know enough about how No U-Turn Hamiltonian Monte Carlo explores the posterior distribution to understand when I can draw inferences from the results and when I cannot – I may have mistakenly thought that as long as the standard convergence diagnostics look okay (checking autocorrelations and making sure the chains mix), I am safe and the difference in running times is not relevant. What’s the proper literature to refer to?

Thanks,

Richard

avehtari · January 31, 2019, 5:44pm

There is quite much of code and functions, and I wasn’t able to find the error, but I think it’s likely that you have error in some part of the kfold computation. The kfold values are quite close to posterior log predictive densities like if you would not be computing those with held out data. The kfold values should be much closer and with lower values compared to loo log predictive densities.

Can you tell how many parameters each model has? I didn’t notice that mentioned

rjc10 · January 31, 2019, 6:21pm

I appreciate your time just the same, sir. The two- and three-dimensional models have 2,814 and 4,202 parameters, respectively.

avehtari · February 1, 2019, 3:59pm

A bit more. If we look at the lpd values (eq 2 in [1507.04544] Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC)

1: -23937.4
2: -22610.5
3: -22314.2

we see that the values get higher with increasing model complexity

Your elpd_loo results say that 2-d and 3-d models are clearly better than 1-d. The difference between 2d- and 3-d is small and although pairwise SE is small, too, the problem is with high Pareto k’s as then loo accuarcy is suspectible.
In case of high Pareto k values, the elpd_loo is usually too optimistic, ie has too high value and

# 2-d model      0.0       0.0 -24581.8   1971.3  49163.7
# 3-d model    -23.6       3.8 -24605.5   2291.3  49210.9
# 1-d model   -586.6      33.5 -25168.5   1231.1  50336.9

You listed the number of high Pareto k values only for the case where
likelihood was assembled from k-fold cross-validation models, but if
the Pareto k values for actual loos are similar, then it’s likely that
the true difference between 2-d and 3-d would be similar or slightly
larger in advantage of 2-d model. In Bayesian inference it quite
common that more complex model than the true data generating model is
not much worse for the predictions as priors and integration over the
posterior “regularize”.

Now if we look at the kfold values you report

2-d: -22758.8
3-d: -22492.8

If kfold would work, these should be close to elpd_loo values and very
likely lower as only 90% of data is used for fitting the model. These
are however quite close to your lpd values, even so as within 10% of
the p_loo values. Furthermore, the difference between you kfold
values is 266, and the difference between lpd values is 296, where the
the difference is again just 10%. These hint that you are computing
average of lpd and kfold test lpd. I recommend to re-check your kfold
computations.

Since p_loos are 70% and 54% from the total number of parameters we are in the regime where importance sampling in PSIS-LOO has difficulties, but its ogg that p_loos are not larger than the total number of parameters which would indicate possibility of severe model misspecification.

rjc10 · February 7, 2019, 5:31pm

Thank you very much! Yes, I checked my code and found no fewer than three bugs. For the record:

In extract_log_lik_K_memory_friendly(), I should have the lines:

  for(k in 1:K){
    load(paste(save_file_stem,"k=",k,sep="")) # "merged_model"

instead of

  for(k in 1:K){
     load(paste(save_file_stem,"k=1",sep="")) # "merged_model"

That alone would produce the behavior you described, because 90% of the log likelihoods that I considered “holdout data” were actually computed from training data. Secondly, it turns out that in do_K_fold_cross_validation_memory_friendly(), I extracted the holdout likelihood correctly for the purpose of computing the expected log posterior densities, but (perhaps ironically) extracted it incorrectly when my goal was to return the data in a different format in order to pass into my get_loo_from_kfold() function for the purpose of catching bugs. Here’s the corrected code in case anyone else wants to use this:

do_K_fold_cross_validation_memory_friendly <- function(data, 
                                                       stan_file, 
                                                       save_file_stem,
                                                       do_stratification=TRUE, 
                                                       n_folds=10, 
                                                       n_chains=3, 
                                                       n_cores=3,
                                                       #merge_chains=FALSE,
                                                       iter=2000,
                                                       control=list(max_treedepth = 15),
                                                       desired_stratification_percentage=0.015,
                                                       truncation_percentage=0.02,
                                                       debug=TRUE,
                                                       recalculate=FALSE)
{
  library(rstan)
  rstan_options(auto_write = TRUE)
  library(pbmcapply)
  library(loo)
  library(caret)
  
  K <- n_folds
  if(!(recalculate)){
    ## TODO: Make this its own function. (Get holdout data in flattened and unflattened formats)
    flattened.data <- as.vector(as.matrix(data$Y))
    ## For now, the stratification is done locally.
    if(do_stratification){
      ones_and_twos <- get_group_membership(data, 
                                            desired_group_1_percentage=desired_stratification_percentage, 
                                            truncation_percentage=truncation_percentage, 
                                            debug=TRUE)    
    }else{
      ones_and_twos <- rep(1, length(flattened.data))
      ones_and_twos[((length(flattened.data)/2)+1):length(flattened.data)] <- 2    
    }
    
    folds <- createFolds(ones_and_twos, n_folds, list=FALSE)
    # TODO: Testing my work: I want the number of "yes" responses in each fold to be about the same.
    
    holdout_K_from_caret <- list()
    for(k in 1:K)
    {
      holdout_vector <- rep(0,nrow(data$Y)*ncol(data$Y))
      holdout_vector[which(folds==k)] <- 1
      holdout_K_from_caret[[k]] <- unflatten(holdout_vector,nrow=nrow(data$Y),ncol=ncol(data$Y))
    }
    holdout_K_from_caret_flattened <- list()
    for(i in 1:K){
      holdout_K_from_caret_flattened[[i]] <- flatten(holdout_K_from_caret[[i]])
    }
    save(holdout_K_from_caret_flattened, file=paste(save_file_stem,"holdout_indices",sep=""))
    
    #run the functions
    I <- dim(data)[1]
    J <- dim(data)[2]
    data_m <- data
    #create a list of data list
    data_l <- rep(list(data),K)
    #add the holdout index to it
    for(i in 1:K) data_l[[i]]$holdout <- holdout_K_from_caret[[i]]
    
    cat("In do_K_fold_cross_validation: Running K memory-friendly Stan models...\n")
    stan_kfold_memory_friendly(file=stan_file,
                               save_file_stem=save_file_stem,
                               data_l,
                               chains=n_chains,
                               cores=n_cores,
                               iter=iter,
                               control=control)
    # This doesn't return a list of Stan fit objects. So I want to pass in the 
    # save file stem to extract_log_lik_K.
  }
  
  if(recalculate){
    load(paste(save_file_stem,"holdout_indices",sep="")) # holdout_K_from_caret_flattened
  }
  
  cat("In do_K_fold_cross_validation: extracting log-likelihood...\n")
  ees <- extract_log_lik_K_memory_friendly(save_file_stem=save_file_stem,
                                           n_folds=K,
                                           list_of_holdout=holdout_K_from_caret_flattened)
  ee_for_loo <- ees$log_lik_loo
  ee_for_kfold <- ees$log_lik_kfold
  #ee_for_loo <- extract_log_lik_K_memory_friendly(save_file_stem=save_file_stem,
  #                                                n_folds=K,
  #                                                list_of_holdout=holdout_K_from_caret_flattened,
  #                                                merge_chains=merge_chains,
  #                                                debug=debug)
  #ee_for_kfold <- extract_log_lik_K_memory_friendly(save_file_stem=save_file_stem,
  #                                                  n_folds=K,
  #                                                  list_of_holdout=holdout_K_from_caret_flattened,
  #                                                  debug=debug)
  #r_eff <- relative_eff(exp(ee))
  
  cat("In do_K_fold_cross_validation: calculating ELPD and standard error...\n")
  kk <- kfold(ee_for_kfold)
  result <- list()
  result$kk <- kk
  result$ee_for_loo <- ee_for_loo
  result$ee_for_kfold <- ee_for_kfold
  return(result)
}

  #ee_for_loo <- extract_log_lik_K_memory_friendly(save_file_stem=save_file_stem,
  #                                                n_folds=K,
  #                                                list_of_holdout=holdout_K_from_caret_flattened,
  #                                                merge_chains=merge_chains,
  #                                                debug=debug)
  #ee_for_kfold <- extract_log_lik_K_memory_friendly(save_file_stem=save_file_stem,
  #                                                  n_folds=K,
  #                                                  list_of_holdout=holdout_K_from_caret_flattened,
  #                                                  debug=debug)

Perhaps most importantly, even with the previous bugs fixed, my code still would not work because I have label switching taking place between chains. I’ve seen that before and addressed the problem, but with K-fold cross-validation I have the added responsibility of making sure that the labels I am attaching to each chain are consistent across the model runs. Because this was not the case in my previous analysis, it’s no wonder the 3-D model appears to have the best fit, because I’m essentially just fitting noise!

I’m working on some plotting functions now to make sure that I can better visualize the assignment of labels to chains across all ten folds, in order to make my analysis less error-prone. I’ll post back once I’ve got the code working, it may be a couple of weeks as I’ll have to put this project down to do some other work. Thank you very much for your assistance, Dr. Vehtari! The feedback definitely helped me to catch these bugs.

Gratefully,

Richard

rjc10 · February 13, 2019, 1:20am

Follow-up: fixing the first two bugs solved the problem! As for the label switching between chains, that actually turns out to not be a concern because the likelihood is computed based on the matrix product of a row vector of alphas and a column vector of thetas. With label switching, for different chains the order in which the parameters appear within a vector will change, but the matrix product used to calculate the likelihood remains the same. In other words, for some chains I’ll be calculating inv_logit(alpha_{1,j} * theta_{i,1} + alpha_{2,j} * theta_{i,2} + beta_j, referring to respondent i answering question j, whereas for other chains I’ll be calculating inv_logit(alpha_{2,j} * theta_{i,2} + alpha_{1,j} * theta_{i,1} + beta_j because the assignment of labels to person ability and item difficulty dimensions is arbitrary. The fact that the labels change does not change the computed likelihood. What good news! It saved me quite a bit of time because I don’t have to fix the label switching in the model output when using K-fold cross-validation for model comparison. That’s kind of elegant, I thought I’d share.

Topic		Replies	Views
IRT: Troubleshooting likelihood-based model comparisons in the context of label switching within chains Modeling	0	431	March 18, 2019
Issues using rstan to estimate a Multidimensional IRT Model incorporates Response Time Modeling fitting-issues	1	358	July 17, 2019
Model Comparison when high Pareto k? General rstan , loo	4	104	May 20, 2025
Two questions about using rstan to estimate the Item response theory model parameters Modeling techniques	2	661	July 7, 2019
High Pareto-k values for an IRT model with simulated datasets Modeling loo	5	60	April 2, 2025

Issues fitting a multidimensional IRT model with K-fold cross-validation

Related topics