IRT: Troubleshooting likelihood-based model comparisons in the context of label switching within chains

Dear Stan users,

I’m comparing multidimensional IRT models to assess the dimensionality of respondents’ abilities and the item difficulties, and I’m using K-fold cross-validation to do so because the data are skewed enough that LOO cross-validation fails because too many Pareto K-values are too high (see this post). Just as I found with simulated data, I’m finding that as I increase the number of dimensions, I start to see multimodality within chains (i.e. within-chain label switching). In the present case I have a three-dimensional model that is a slightly better fit to the data than the two-dimensional model, but for which the improvement in fit is not statistically significant.

It is possible that I won’t have to care about within-chain label switching at this stage of the analysis, when I’m merely comparing models. (Please see the end of this post for the Stan code.) As I mentioned in a previous post, the likelihood is computed based on the matrix product of a row vector of D item difficulties (alphas) and a column vector of D respondent abilities (thetas), where D is the dimensionality of the latent abilities. (So, for the SAT, we might have D = 2, where the dimensions correspond to verbal and quantitative aptitude.) With label switching between chains, for different chains the order in which the parameters appear within a vector will change, but the matrix product used to calculate the likelihood remains the same. For a two-dimensional model for example, I’ll be calculating for some chains inv_logit(alpha_{1,j} * theta_{i,1} + alpha_{2,j} * theta_{i,2} + beta_j, referring to respondent i answering question j. But for other chains, I’ll be calculating inv_logit(alpha_{2,j} * theta_{i,2} + alpha_{1,j} * theta_{i,1} + beta_j because the assignment of labels to person ability and item difficulty dimensions is arbitrary. So whether I have to care about the label switching comes down to the convergence diagnostics for the computed probabilities that respondents correctly respond to questions.

This model uses K-fold cross-validation, so I’ve written a wrapper function for the Traceplot procedure to record the results for multiple folds. In (traceplots_p[3,19].pdf (1.2 MB)), let’s take a look at the traceplot for Fold 5, which corresponds to respondent #3 responding to item #19. (Here’s a screenshot.)

This particular item is designed to require expertise in only a single dimension to answer correctly. So in a way it’s reassuring to see that the estimated item difficulty is greater in one of the dimensions (about 4.0) than the other two (2.0). This respondent gets the question correct, and the results show that the respondent demonstrates an ability (theta) that is higher in the dimension corresponding to the higher item difficulty than in the other two dimensions. (That may be a bit hard to read because of the within-chain label switching going on.) Additionally, this respondent only answers one test question correctly, and it is this one. The theta values for two of the three dimensions are zero, corresponding to no ability in those dimensions.

We can see that beta is quite negative, meaning that the intercept for the item difficulty is compensating for the high, positive values of the three difficulty parameters alpha for each question. In the present case, the respondent has an ability of about 1.0 in the dimension associated with the greatest item difficulty (4.0), so subtracting the beta value of about 4.0 gives a resulting probability of response of about 0.75. Decreasing beta, however, decreases the probability of a correct response. Since beta[19] varies greatly between -4 and -6, p[y] also has a high variance for this respondent. The model therefore doesn’t do very well with prediction for this respondent, but at present my interest is only that the model is stable enough that I can infer the dimensionality of the latent variable.

My concern here is the autocorrelation plot for p[Y]. For the first chain, for example, at about the 1600th sample, p[Y] drops to a value close to zero and stays there for 100 samples. That region is associated with label switching within chains: whereas Dimension 3 appears to be the dimension that this question measures for other draws, that switches to Dimension 1 between samples 1600 and 1700. At the moment that theta[3,3] and alpha[19,3] (corresponding to Dimension 3) both decrease significantly for Dimension 3 in this region, theta and alpha only increase a small amount in Dimension 1, while beta drops to -4 and doesn’t fluctuate much for 100 samples. The fact that the relative increases in theta and alpha for Dimension 1 are less than the decrease for beta[19] would explain why p[Y] drops to zero for those samples.

The question is, what do I do about this? Do I need to stop the label switching within chains by specifying constraints, in order to be more confident in the stability of p[Y] for the purpose of extracting log likelihoods and comparing models? Is the label switching within chains itself an indication that I am already trying to squeeze too much out of my data, and three dimensions are not going to work? Is the prior for beta too relaxed? Should I consider something else? Thank you very much.

Here’s the Stan code:

// some ideas from from
// RJC: Note the (0,1) normal prior on the alphas to identify their scale. 
	int<lower=1> n_respondents;
	int<lower=1> n_items;
	int<lower=0,upper=1> Y[n_respondents,n_items];
	int<lower=1> D; // number of dimensions
	int<lower=0,upper=1> holdout[n_respondents,n_items];
transformed data{
	cov_matrix[D] Sigma;
	row_vector [D] mu_theta = rep_row_vector(0.0, D);
	Sigma = rep_matrix(0.0, D, D);
	for(i in 1:D){
		Sigma[i,i] = 1.0;
	}  // Identity matrix as per Reckase (2009) and Chun Wang (2015)
	vector [D] theta [n_respondents];
	matrix<lower=0>[n_items,D] alpha;
	real mu_alpha; // note that only the LOG of mu_alpha must be positive, so there's no lower bound at zero here.

	vector[n_items] beta;
	real<lower=0> sigma_alpha;

	real<lower=0> sigma_beta;
	real mu_beta;

	// hyperpriors
	sigma_alpha ~ cauchy(0,5); // RJC: could allow the prior mean for alpha to vary as well

	mu_alpha ~ normal(0,4); // weak but not uninformative
	sigma_beta ~ cauchy(0,5); // weak but not uninformative
	mu_beta ~ normal(0.0, 4); // weak but not uninformative

	// priors
	for(i in 1:n_respondents){
		theta[i] ~ multi_normal(mu_theta, Sigma);  // Reckase 2009
	beta ~ normal(mu_beta,sigma_beta);
	to_vector(alpha) ~ lognormal(mu_alpha, sigma_alpha);

	// likelihood
	// More efficient to write loops column by column as matrices are stored internally in column-major order (Stan manual p. 327)
	for(j in 1:n_items){
		for(i in 1:n_respondents){
				real p;
				p = inv_logit(row(alpha,j)*theta[i]+beta[j]);
				target += bernoulli_lpmf(Y[i,j] | p);
generated quantities{
	vector[n_items] log_lik[n_respondents];
	vector[n_items] pY[n_respondents];
	for(j in 1:n_items){
		for(i in 1:n_respondents){
			real p;
			pY[i,j] = inv_logit(row(alpha,j)*theta[i]+beta[j]);
			log_lik[i,j] = bernoulli_lpmf(Y[i,j] | pY[i,j]);
1 Like