Convergence problem with IRT models


#1

Hi all,

I am fitting an IRT model with three binary variables (K=3).

The Rcodes to generate the data is as follows:


# Rcodes to generate and fit IRT models
# NORMAL distribution

rm(list = ls())

# load libraries

library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

library(xlsx)

#####################################################
###            CASE 1: BINARY                     ###
#####################################################

#####################################################
###            Generating data                    ###
#####################################################

# Number of items
K=3

# Number of subjects
N=1000

# Parameters

# thres and slope

thres=runif(n=K, min=-1, max=1)
lambda=runif(n=K, min=0.4, max=2)

para=c(thres, lambda)

# latent traits

xi=rnorm(n=N, mean=0, sd=1)

# matrix to store probabilities
prob=matrix(data=NA, nrow = N, ncol = K)
# matrix to store the outcome

Y=matrix(data=NA, nrow = N, ncol = K)

for (i in 1 : N){
   for (k in 1: K){
      prob[i, k]=pnorm(thres[k] + lambda[k] * xi[i])
   
      Y[i, k]=ifelse(runif(1)<prob[i, k], 1, 0)
   }
}

write.table(x=Y, file = 'outcome.txt', quote = F, sep = ' ',
            row.names = F, col.names = T)

#####################################################
###            Fitting models                     ###
#####################################################

#################################
###      model1a              ###
#################################

data1a = list(N=N, K=K, Y=Y)

model1a = stan(file="model1a.stan", data = data1a,
              iter=5000, chains = 4,
              pars = c('thres', 'lambda',
                       'mu_thres', 'sigma_thres', 'sigma_lambda'))

# trace plots

pdf('thres1a.pdf')
stan_trace(model1a, pars = 'thres', inc_warmup = F)
dev.off()

pdf('lambda1a.pdf')
stan_trace(model1a, pars = 'lambda', inc_warmup = F)
dev.off()

pdf('mu_thres1a.pdf')
stan_trace(model1a, pars = 'mu_thres', inc_warmup = F)
dev.off()

pdf('sigma_thres1a.pdf')
stan_trace(model1a, pars = 'sigma_thres', inc_warmup = F)
dev.off()

pdf('sigma_lambda1a.pdf')
stan_trace(model1a, pars = 'sigma_lambda', inc_warmup = F)
dev.off()

# summary

summ1a=summary(model1a)$summary
write.xlsx(summ1a, file = 'summ1a.xlsx')

and below is stan model


data {
	int N;

	int K;

	int Y[N, K];
}

parameters {
	vector[K] lambda;
	real sigma_lambda;

	vector[K] thres;
	real mu_thres;
	real sigma_thres;

	vector[N] xi;
}

model{
	lambda ~ lognormal(0, sigma_lambda);
	sigma_lambda ~ cauchy(0, 5);

	thres ~ normal(mu_thres, sigma_thres);
	mu_thres ~ cauchy(0, 5);
	sigma_thres ~ cauchy(0, 5);

	xi ~ normal(0, 1);

	for (i in 1: N){
		for (k in 1 : K){
			Y[i, k] ~ bernoulli(Phi(thres[k] + lambda[k] * xi[i]));
		}
	}	
}

The trace plot does not show convergence for lambda2

lambda2_1a

It seems that lambda[2] is not identified from the model and data.

Do you have any idea? Thank you for your help!

Tran.

model1a.stan (552 Bytes)
No missing.R (2.2 KB)
outcome.txt (6.8 KB)


#2

If read your R code correctly, lambda should be between .4 and 2.

The priors you give the lambda is a lognormal with a cauchy scale. Both cauchy and lognormal distributions have long tails and that is causing all the spikes and probably make it difficult to converge. You will need tighter priors on your lambdas because you know that the lambdas can’t be very large.


#3

Hi @stijn,

That condition is only for generating lambda, it does not mean any restriction that I impose in the model.

I agree that lambda cannot be so large but the values I generated for lambda are 0.65 1.96 1.04 and they are plausible (not so high).

The priors I follow the Stan 2.16.0 manual in pages 143-144.

Tran.


#4

I still think that you need a tighter prior.

Or you could do the opposite. If you really want to test whether your Stan program works. You should generate your lambda not from the uniform as in your R program but from the lognormal like in your prior.

In addition you need to constraint the sigmas in your Stan model to be positive like.

real<lower=0> sigma_lambda;
real<lower=0> sigma_thres;

You probably want to do the same thing for the lambda if they need to be positive. I am just guessing this from your use of the log normal.


#5

Yes, you are correct that lambda~lognormal to keep them positive, following from the IRT models that they are discrimination parameters.

As you suggest that I should use tighter prior, do you have any idea of the plausible values that lambda (and theta (difficulty parameters)) should be?

Tran.


#6

This depends on what you know about the values in your model. The prior should be driven by what you know. If the prior is not consistent with the posterior, there will often be computational issues.


#7

Hi,

I would investigate how the priors look like so I run the following Stan code


data {
	int N;
}

model {
		
}

generated quantities {
	vector[N] sigma_lambda;
	vector[N] lambda;

	for (i in 1 : N){
		sigma_lambda[i] = cauchy_rng(0, 5);
		lambda[i] = lognormal_rng(0, sigma_lambda[i]);
	}
}

with R codes


# Simulated sample size
N=1000

data1a = list(N=N)

model1a = stan(file="model1a.stan", data = data1a,
               iter=5000, chains = 1,
               pars = c('lambda', 'sigma_lambda'))

but Stan gives an error

SAMPLING FOR MODEL ‘model1a’ NOW (CHAIN 1).
[1] "Error in sampler$call_sampler(args_list[[i]]) : “
[2] " Must use algorithm=“Fixed_param” for model that has no parameters.”
[1] “error occurred during calling the sampler; sampling not done”

Do you have an idea on that? Thanks a lot!

Tran.


#8

Hi,

I would have another question following the use of cauchy distributions.

As some of you discussed that cauchy distribution is slower than normal, I would ask what motivation that in the example of IRT models in Stan manual, you would suggest to use cauchy distribution instead of normal?

Thanks a lot!

Tran.


#9

The error you get also give you the solution. In your call to stan in R you need to specify algorithm="Fixed_param because you are not using any parameters in your stan file. Your stan generated quantities section won’t work though. You will generate negative sigma_lambdas and the log normal will give an error (nan I believe).

I think you can do the same thing in R with the following code but there has been some discussion on this forum which gave me pause. So use this with caution.

sigma_lambda = abs(rcauchy(5000, 0, 5))
lambda = exp(rnorm(5000, 0, sigma_lambda))

The advice to use a cauchy distribution for scales comes from this paper. I think the Stan team has moved to advising priors with less probability mass in the tails such as normal or t(3). More up-to-date advice on priors is given in the wiki.


#10

I suspect the main problem is caused by the small number of items (K=3). I think you would have much better results if you increased the number of items to 10 or more. Since the latent dimension along which the item-parameters are scaled (xi) is unobserved, there needs to be a sufficient number of items (and test-takers) to define this scale. I would also recommend increasing N (the number of test-takers) to about 100. If you get better results, then you can decrease the numbers of items and test-takers in subsequent runs to see where the model identification starts to break down. Probably around 5-6 items and 30 test-takers based on my experience.


#11

Thanks @segalldo!

Yes, I see in one simulated data set that it works better for K=4. Mathematically they said in the literature that K should be at least 3, and unfortunately my data set has only K=3 observed variables.

That is why I generate data with K=3 although I also thought that it might not be enough but I cannot work outside my data.

In this situation (K=3) I thought that the identification problem might be less serious if we have a good priors, i.e. the priors help to partially identify the item parameters. But what is a good prior in the IRT litterature?

Tran.


#12

For parameterizations that use a difficulty (rather than a threshold) parameter [ i.e., lambda[k] * (xi[i] - diff[k]) ], I have found that the same prior used on ability parameters xi ~ normal(0, 1) works well for the difficulty parameters: diff ~ normal(0, 1). And although the heirarchical setup you are using is very elegant (and very Bayesian), I think the hyper-parameters sigma_lambda, mu_thres, sigma_thres for the lambda and thres parameters are also poorly identified because of the small number of items. So I would suggest trying to fix these at specific values. Maybe figure out the values the lognormal parameters that result in a mean and variance of 1 for lambda, and then reparameterize lambda[k] * (xi[i] - diff[k]) and use diff ~ normal(0, 1).


#13

Hi Segalldo,

I will try to apply your suggestion!

Do I understand you correctly, I mean when

it leads to non-identification problems for lambda and thres parameters as a consequence?


#14

Yes, I believe that problems with the identification of lambda and thres parameters could be caused by their poorly identified hyper-parameter counterparts (sigma_lambda, mu_thres, sigma_thres). Good luck!


#15

There can also be problems if you do this hierarchically with centered parameterizations. It’s not non-identifiability so much as challenging posterior geometry. There’s a discussion of identifiability for all these IRT models in the extensive case studies Daniel Furr put together and in the Stan manual. (Also a great discussion in Gelman and Hill’s regression book, but don’t take their advice to use something unidentified in Gibbs and try to standardize it—that will not work in HMC and I’m not convinced it even works in Gibbs, though Andrew is convinced it works.)