Missing data in Stan - some difficulties understanding

Hi all,

I am trying to create an IRT model in Stan with missing data by following this official example. However, I have trouble understanding what should the data even look like.

Let’s say that we have 500 respondents and 20 items and let’s also say that overally 20% of the data is missing. The example that I’m following states:

The data provided for an IRT model may be declared as follows to account for the fact that not every student is required to answer every question.

And then provides the following code for the data block:

data {
  int<lower=1> J;              // number of students
  int<lower=1> K;              // number of questions
  int<lower=1> N;              // number of observations
  int<lower=1,upper=J> jj[N];  // student for observation n
  int<lower=1,upper=K> kk[N];  // question for observation n
  int<lower=0,upper=1> y[N];   // correctness for observation n

For starters, I am confused about the following: Is N the total number of observations including missing data (so, 500*20 = 10000), or is N the total number of observations excluding missing data (8000)?

If N is 10000, then I don’t understand how should the values of y in the cases where the data is missing look like. Should they just be coded as NA? But I’ve heard that Stan doesn’t accept this? And if N is 8000, then I don’t understand how do we denote which observations are those that are not present in the data and how do we use that in a model?


To try and be more straightforward with my question, I’ve decided to include the following R code used for data simulation:


theta <- rnorm(500, 0, 1) 

b <- rnorm(20, 1.5, 1.1)

a <- rnorm(20, 1.3, 0.6)

d <- rbinom()

y <- list()

for(i in 1:20){
  y[[i]] <- rbinom(500, 1, ((exp(a[[i]]*theta + b[[i]]))/(1 + exp(a[[i]]*theta + b[[i]]))))
  y[[i]][which(rbinom(500, 1, .8) == 0)] <- NA

y %<>%

I was wondering how can I prepare this data for the analysis with the Stan as used in the official example I linked above.

The missing values should be simply omitted from y

So then N is 8000?

Oh, sorry, yes N=8000. The combination of labels for which participant and item is associated with each response implicitly handles the scenario where the data don’t have all participant-by-item combinations present.

Thank you very much for the clarification, but I am still trying to wrap my head around one thing.

Say that the 1st respondent has not responded to item number 5 and has responded to all other 19 items. His vector of responses will, then, be of length 19. However, if the skipped question was question number 8 or 14 or any other number, that vector would still be of length 19. Therefore, I don’t understand how can the computer - if we have simply dropped missing data instead of using a missing data indicator (NA) in place of the response of the respondent 1 to item 5 - know that the misingness ocurred speficically in that location and not somewhere else, i.e. how can it know that the person didn’t skip the response to, say 19th item.

Because jj And kk tell the model which student and item effects are influencing each response.

I think I get it! So if the 5th item of the 1st respondent is missing, there will be one 1 less in jj and there will not be a 5 among the first 20 entries of kk.