Test zero inflated Graded response model

We want to estimate the parameters of a zero-inflated mixture IRT model, similar to the zero-inflated Poisson model (https://mc-stan.org/docs/2_20/stan-users-guide/zero-inflated-section.html) but with multivariate ordinal data. There are 9 items, each with 4 ordered response categories. A large proportion of people in the sample endorse 0 for all 9 items. In our simulated data, 160 of N = 500 people have all-0 response patterns.

The IRT component of the mixture model predicts some proportion of all-0 response patterns (sampling 0s) — that is, some of the 160 all-0 response patterns come from the IRT model. We will refer to the individuals whose responses come from this standard IRT model as belonging to Class 1. The remaining all-0 response patterns belong to a latent class that is degenerate, in that all members of this class respond 0 to each item with probability 1 (structural 0s). We will call this Class 0. It is the all-0 responses from Class 0 that are responsible for the zero inflation.

In our code, we use an indicator variable dat_bin[I], such that dat_bin[i] == 1 if the respondent endorses 0 for all 9 items. These individuals could belong to either Class 0 or Class 1. If dat_bin[i] == 0, the individual must belong to Class 1. The parameter theta_zi is the proportion of the entire sample (N = 500, all response patterns) that is estimated to belong to Class 0. It is not possible to identify which all-0 respondents belong to Class 0 vs. Class 1; however, we should be able to estimate the proportion of the entire sample that belongs to each class (theta_zi = 0.08 in our simulated data). We are able to do this with ML. When we use the attached code, the mean, sd, and credible interval for theta_zi all come to out be exactly 0. This suggests that none of the all-0 response patterns are being treated as structural 0s and that the zero-inflated part of the mixture model is being ignored.

This type of test zero inflation comes from this paper " IRT Modeling in the Presence of Zero-Inflation With Application to Psychiatric Disorder Severity", which does the test zero inflation 2PL model with ML in Mplus 0146621615588184.pdf (750.0 KB)

Here is the Stan model, and I am attaching simulated data TISimData.csv (9.3 KB) set , and full R code to replicate the issue
TestInflation.R (2.3 KB)

Any help is appreciated on how to specify the multivariate test inflation

Thank you


data{
int n; //number of subjects
int p; // number of items
int K; // number of categories
int x[n,p]; // categorical data matrix
int x_bin[n]; // indicator variable for all-0 responses
}

parameters{
real<lower=0, upper=5> alpha[p]; // discrimination for each item
ordered[K-1] kappa[p]; // category thresholds for each item
real theta[n]; // factor scores
real<lower=0, upper=1> theta_zi; // test zero inflation
real mu_kappa; //mean of the prior distribution of thresholds
real<lower=0> sigma_kappa; //sd of the prior distribution of thresholds
real<lower=0> sigma_alpha; //sd of the prior distribution of discrimination
}

transformed parameters{
vector[K-1] thresholds[p]; 

for(i in 1:p){
for(k in 1:(K-1)){
thresholds[i,k] = kappa[i,k]/alpha[i]; // IRT threshold parameterization
}
}
}

model{

for(i in 1:n){
for(j in 1:p){

if (x_bin[i] == 1){
target += log_sum_exp(bernoulli_lpmf(1 | theta_zi),
bernoulli_lpmf(0 | theta_zi) + ordered_logistic_lpmf(x[i,j] | alpha[j] * theta[i], kappa[j]));
}else{
target += bernoulli_lpmf(0 | theta_zi) + ordered_logistic_lpmf(x[i,j] | alpha[j] * theta[i], kappa[j] );
}
}
}

for(i in 1:p){
for(k in 1:(K-1)){
kappa[i,k] ~ normal(mu_kappa, sigma_kappa);
}
}

mu_kappa ~ normal(0,5);
sigma_kappa ~ cauchy(0,5);

alpha ~ lognormal(0, sigma_alpha);
sigma_alpha ~ cauchy(0,5);

theta ~ normal(0,1);
}
1 Like

I think the problem might lie in the loop structure at

for(i in 1:n){
for(j in 1:p){

if (x_bin[i] == 1){
target += log_sum_exp(bernoulli_lpmf(1 | theta_zi),
bernoulli_lpmf(0 | theta_zi) + ordered_logistic_lpmf(x[i,j] | alpha[j] * theta[i], kappa[j]));
}else{
target += bernoulli_lpmf(0 | theta_zi) + ordered_logistic_lpmf(x[i,j] | alpha[j] * theta[i], kappa[j] );
}
}
}

if I read the code correctly, you mix for each response individually (i.e. a person can have some items as “structural” zeros while other items as “non-structural” zeroes). This would also mean you add the bernoulli_lpmf(0 | theta_zi) term 9 times more often than you should which probably drives the posterior for theta_zi to 0. The mixing should probably happen for all the responses by a single participant at once, i.e. :

for(i in 1:n){
  vector[p] ordinal_response_lpmf;
  for(j in 1:p){
    ordinal_response_lpmf[j] = ordered_logistic_lpmf(x[i,j] | alpha[j] * theta[i], kappa[j]);
  }
  if (x_bin[i] == 1){
    target += log_sum_exp(bernoulli_lpmf(1 | theta_zi),
    bernoulli_lpmf(0 | theta_zi) + sum(ordinal_response_lpmf));
  }else{
    target += bernoulli_lpmf(0 | theta_zi) + sum(ordinal_response_lpmf);
  }
}

With that modification my estimates for theta_zi are:

                 mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
theta_zi         0.20    0.00 0.04  0.12  0.17  0.20  0.22  0.26  1000    1

This is still different from your theta_zi, but that also might be due to the very wide priors you’ve used - probably the values of other parameters you used in the simulation are not very representative of those priors…

Hope that helps.

BTW the way I discovered the bug was that I was trying to write an R function that would simulate the data by exactly mirroring the Stan code (sampling statements become RNG calls). This is generally a very helpful debugging step, and I can recommend it.

2 Likes

Thank you – that was exactly the problem! (I’m working on this project with Mauricio, who posted the question.)