I am currently studying psychostatistics at postgraduate level.
I am using Stan to perform parameter estimation in cognitive diagnostic models (CDM), but I am having difficulty with parameter recovery in simulations and would very much appreciate your advice.
The simulation code and stan code are as follows.
rm(list = ls())
library(tidyverse)
library(rstan)
library(cmdstanr)
seed = 1234
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
set.seed(seed)
n <- 500
#### advance preparation ####
N <- n #sample size
Q <- matrix(
c(1,0,0,0,0,
0,1,0,0,0,
0,0,1,0,0,
0,0,0,1,0,
0,0,0,0,1,
1,1,0,0,0,
0,1,1,0,0,
0,0,1,1,0,
0,0,0,1,1,
0,1,0,1,1,
1,1,1,0,0,
1,0,0,1,1,
1,0,1,1,1,
1,1,1,0,1,
1,1,1,1,1),
nrow=15, ncol=5, byrow=TRUE) #Qmatrix
J <- nrow(Q) #Number of items; determined by the number of rows in the Q matrix.
K <- ncol(Q) #Number of attributes, determined by the number of columns in the Q matrix.
L <- 1 #L=1 because this time there is only one ability parameter.
#### ability parameter θ ####
theta <- rnorm(N, 0, 1) #Generated from standard normal distribution
#####
#### Attribute acquisition parameters ####
#Set up following previous studies
#The aim of this simulation is to recover these values
lambda <- rep(1.5, 5)
beta <- c(-1.0, -0.5, 0.0, 0.5, 1.0)
#####
#### item parameters guess, slip ####
#Slip, Guess all fixed at 0.1
guess <- rep(0.1, J)
slip <- rep(0.1, J)
#####
#### Attribute acquisition probability ####
delta <- matrix(nrow = N, ncol = K)
for (k in 1:K) {
delta[,k] <- exp(lambda[k] * theta - beta[k]) / (1 + exp(lambda[k] * theta - beta[k]))
}
#####
#### item response probability ####
p <- matrix(nrow = N, ncol = J)
for (j in 1:J) {
p[,j] <- apply(delta, 1, function(x) {
return (guess[j] + (1-slip[j]-guess[j])*prod(x ** Q[j, ]))
}
)
}
#####
#### Item response (observed) ####
Y <- matrix(nrow = N, ncol = J)
for (j in 1:J) {
for (n in 1:N) {
Y[n,j] <- rbinom(1, 1, p[n, j])
}
}
#####
#### estimate ####
datastan <- list(N=N, J=J, K=K, Q=t(as.matrix(Q)), Y=t(as.matrix(Y)))
model <- cmdstan_model("simu_highorder_stan_0610.stan")
fit1 <- model$sample(data=datastan, iter_warmup=1000, iter_sampling=1000, chains=4, parallel_chains=4, refresh=200, seed = seed)
#### result ####
fit1$print('beta')
fit1$print('lambda')
#### result ####
fit1$print('s')
fit1$print('g')
data {
int<lower=0> N;
int<lower=0> J;
int<lower=0> K;
matrix[K,J] Q;
array[J, N] int Y;
}
transformed data {
vector[N] Ones = rep_vector(-1, N);
}
parameters {
vector<lower=0>[K] lambda;
vector[K] beta;
vector<lower=0, upper=1>[J] s;
vector<lower=0, upper=1>[J] g_base;
vector[N] theta;
real<lower=0> sigma_lam;
real<lower=0> sigma_beta;
}
transformed parameters {
vector<lower=0, upper=1>[J] g;
for (j in 1:J) {
if (g_base[j] > 1-s[j]) {
g[j] = 1-s[j]-1e-10;
} else {
g[j] = g_base[j];
}
}
matrix[N, 2] X = append_col(theta, Ones);
matrix[K, 2] b = append_col(lambda, beta);
matrix[N, K] delta;
matrix[N, K] X_bT = X * b';
delta = inv_logit(X_bT);
matrix[N, J] eta;
eta = exp(log(delta + 1e-8)*Q);
matrix[N,J] p;
for (n in 1:N) {
for (j in 1:J) {
p[n, j] = g[j] + (1 - s[j] - g[j]) * eta[n, j];
}
}
}
model {
sigma_lam ~ lognormal(0, 0.5);
sigma_beta ~ lognormal(0, 0.5);
lambda ~ lognormal(0, sigma_lam);
beta ~ normal(0, sigma_beta);
theta ~ normal(0, 1);
s ~ beta(2, 5);
g_base ~ beta(2, 5);
for (j in 1:J) {
Y[j,] ~ bernoulli_logit(p[,j]);
}
}
The data generation model is based on the “HO-PINC Model” from this paper.
In my simulations, the data generation model is as follows:
-
δ_{nk} = \frac{\exp(\lambda_{k} \theta_{n} - \beta_{k})}{1+\exp(\lambda_{k} \theta_{n} - \beta_{k})}
- θ[N]: individual ability (from a standard normal distribution)
- δ[K]: probability of mastering each attribute (k)
- β[K]: intercept in the probability of learning each attribute (k)
- λ[K]: slope with respect to θ in the probability of mastering each attribute (k)
- δ is a probability value, so it takes values between 0 and 1.
-
\eta_{nj} = \prod_{k=1}^K \delta_{nk}^{q_{ik}}
- q[N,J]: Q matrix values (binary variables)
-
p_{nj} = g_{j} + (1-s_{j}-g_{j})\eta_{nj}
- slip[J]:slip parameter in the probability of correct response for each item (j)
- guess[J]:guess parameter of the probability of correct answer for each item (j)
- slip and guess are probability values and therefore take values between 0 and 1.
- The constraint “g_{j} < 1-s_{j}” is entered.
-
Y_{nj} ~ Bernoulli(p_{nj})
- Y_{nj}: Individual item responses (binary variables)
In the aforementioned simulation, N=500, K=5 and J=15.
What we want to estimate this time are β and λ. The true values of these are as follows.
- β=(-1.0, -0.5, 0.0, 0.5, 1.0)
- λ=(1.5, 1.5, 1.5, 1.5, 1.5)
The estimation results are as follows, where β deviates significantly from the true value.The indicators suggest that there is convergence, but the estimates are far off.
I have changed the seed and run the simulation, but the result is the same, and I am trying to figure out why, but have not been able to solve the problem.
I am very troubled.Please give me some advice on this problem.
Any advice on modifications to the code, the theory behind it, etc. would be appreciated.
Thank you in advance.