Hi all,
I am trying to create an IRT model in Stan with missing data by following this official example. However, I have trouble understanding what should the data even look like.
Let’s say that we have 500 respondents and 20 items and let’s also say that overally 20% of the data is missing. The example that I’m following states:
The data provided for an IRT model may be declared as follows to account for the fact that not every student is required to answer every question.
And then provides the following code for the data block:
data { int<lower=1> J; // number of students int<lower=1> K; // number of questions int<lower=1> N; // number of observations int<lower=1,upper=J> jj[N]; // student for observation n int<lower=1,upper=K> kk[N]; // question for observation n int<lower=0,upper=1> y[N]; // correctness for observation n }
For starters, I am confused about the following: Is N
the total number of observations including missing data (so, 500*20 = 10000), or is N
the total number of observations excluding missing data (8000)?
If N
is 10000, then I don’t understand how should the values of y
in the cases where the data is missing look like. Should they just be coded as NA
? But I’ve heard that Stan doesn’t accept this? And if N
is 8000, then I don’t understand how do we denote which observations are those that are not present in the data and how do we use that in a model?
EDIT:
To try and be more straightforward with my question, I’ve decided to include the following R code used for data simulation:
library(tidyverse)
library(magrittr)
theta <- rnorm(500, 0, 1)
b <- rnorm(20, 1.5, 1.1)
a <- rnorm(20, 1.3, 0.6)
d <- rbinom()
y <- list()
for(i in 1:20){
y[[i]] <- rbinom(500, 1, ((exp(a[[i]]*theta + b[[i]]))/(1 + exp(a[[i]]*theta + b[[i]]))))
y[[i]][which(rbinom(500, 1, .8) == 0)] <- NA
}
y %<>%
reduce(bind_cols)
I was wondering how can I prepare this data for the analysis with the Stan as used in the official example I linked above.