# The trouble with parameter estimation in Cognitive Diagnostic Models (CDMs)

I am currently studying psychostatistics at postgraduate level.
I am using Stan to perform parameter estimation in cognitive diagnostic models (CDM), but I am having difficulty with parameter recovery in simulations and would very much appreciate your advice.

The simulation code and stan code are as follows.

rm(list = ls())
library(tidyverse)
library(rstan)
library(cmdstanr)

seed = 1234
setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) set.seed(seed) n <- 500 #### advance preparation #### N <- n #sample size Q <- matrix( c(1,0,0,0,0, 0,1,0,0,0, 0,0,1,0,0, 0,0,0,1,0, 0,0,0,0,1, 1,1,0,0,0, 0,1,1,0,0, 0,0,1,1,0, 0,0,0,1,1, 0,1,0,1,1, 1,1,1,0,0, 1,0,0,1,1, 1,0,1,1,1, 1,1,1,0,1, 1,1,1,1,1), nrow=15, ncol=5, byrow=TRUE) #Qmatrix J <- nrow(Q) #Number of items; determined by the number of rows in the Q matrix. K <- ncol(Q) #Number of attributes, determined by the number of columns in the Q matrix. L <- 1 #L=1 because this time there is only one ability parameter. #### ability parameter θ #### theta <- rnorm(N, 0, 1) #Generated from standard normal distribution ##### #### Attribute acquisition parameters #### #Set up following previous studies #The aim of this simulation is to recover these values lambda <- rep(1.5, 5) beta <- c(-1.0, -0.5, 0.0, 0.5, 1.0) ##### #### item parameters guess, slip #### #Slip, Guess all fixed at 0.1 guess <- rep(0.1, J) slip <- rep(0.1, J) ##### #### Attribute acquisition probability #### delta <- matrix(nrow = N, ncol = K) for (k in 1:K) { delta[,k] <- exp(lambda[k] * theta - beta[k]) / (1 + exp(lambda[k] * theta - beta[k])) } ##### #### item response probability #### p <- matrix(nrow = N, ncol = J) for (j in 1:J) { p[,j] <- apply(delta, 1, function(x) { return (guess[j] + (1-slip[j]-guess[j])*prod(x ** Q[j, ])) } ) } ##### #### Item response (observed) #### Y <- matrix(nrow = N, ncol = J) for (j in 1:J) { for (n in 1:N) { Y[n,j] <- rbinom(1, 1, p[n, j]) } } ##### #### estimate #### datastan <- list(N=N, J=J, K=K, Q=t(as.matrix(Q)), Y=t(as.matrix(Y))) model <- cmdstan_model("simu_highorder_stan_0610.stan") fit1 <- model$sample(data=datastan, iter_warmup=1000, iter_sampling=1000, chains=4, parallel_chains=4, refresh=200, seed = seed)

#### result ####
fit1$print('beta') fit1$print('lambda')

#### result ####
fit1$print('s') fit1$print('g')


data {
int<lower=0> N;
int<lower=0> J;
int<lower=0> K;
matrix[K,J] Q;
array[J, N] int Y;
}

transformed data {
vector[N] Ones = rep_vector(-1, N);
}

parameters {
vector<lower=0>[K] lambda;
vector[K] beta;

vector<lower=0, upper=1>[J] s;
vector<lower=0, upper=1>[J] g_base;
vector[N] theta;

real<lower=0> sigma_lam;
real<lower=0> sigma_beta;
}

transformed parameters {
vector<lower=0, upper=1>[J] g;
for (j in 1:J) {
if (g_base[j] > 1-s[j]) {
g[j] = 1-s[j]-1e-10;
} else {
g[j] = g_base[j];
}
}

matrix[N, 2] X = append_col(theta, Ones);
matrix[K, 2] b = append_col(lambda, beta);

matrix[N, K] delta;
matrix[N, K] X_bT = X * b';
delta = inv_logit(X_bT);

matrix[N, J] eta;
eta = exp(log(delta + 1e-8)*Q);

matrix[N,J] p;
for (n in 1:N) {
for (j in 1:J) {
p[n, j] = g[j] + (1 - s[j] - g[j]) * eta[n, j];
}
}
}

model {
sigma_lam ~ lognormal(0, 0.5);
sigma_beta ~ lognormal(0, 0.5);
lambda ~ lognormal(0, sigma_lam);
beta ~ normal(0, sigma_beta);

theta ~ normal(0, 1);

s ~ beta(2, 5);
g_base ~ beta(2, 5);

for (j in 1:J) {
Y[j,] ~ bernoulli_logit(p[,j]);
}
}


The data generation model is based on the “HO-PINC Model” from this paper.

In my simulations, the data generation model is as follows:

• δ_{nk} = \frac{\exp(\lambda_{k} \theta_{n} - \beta_{k})}{1+\exp(\lambda_{k} \theta_{n} - \beta_{k})}

• θ[N]: individual ability (from a standard normal distribution)
• δ[K]: probability of mastering each attribute (k)
• β[K]: intercept in the probability of learning each attribute (k)
• λ[K]: slope with respect to θ in the probability of mastering each attribute (k)
• δ is a probability value, so it takes values between 0 and 1.
• \eta_{nj} = \prod_{k=1}^K \delta_{nk}^{q_{ik}}

• q[N,J]: Q matrix values (binary variables)
• p_{nj} = g_{j} + (1-s_{j}-g_{j})\eta_{nj}

• slip[J]:slip parameter in the probability of correct response for each item (j)
• guess[J]:guess parameter of the probability of correct answer for each item (j)
• slip and guess are probability values and therefore take values between 0 and 1.
• The constraint “g_{j} < 1-s_{j}” is entered.
• Y_{nj} ～ Bernoulli(p_{nj})

• Y_{nj}: Individual item responses (binary variables)

In the aforementioned simulation, N=500, K=5 and J=15.

What we want to estimate this time are β and λ. The true values of these are as follows.

• β=(-1.0, -0.5, 0.0, 0.5, 1.0)
• λ=(1.5, 1.5, 1.5, 1.5, 1.5)

The estimation results are as follows, where β deviates significantly from the true value.The indicators suggest that there is convergence, but the estimates are far off.
I have changed the seed and run the simulation, but the result is the same, and I am trying to figure out why, but have not been able to solve the problem.