Missing data in Stan - some difficulties understanding

JohnDoe · August 15, 2021, 7:32pm

Hi all,

I am trying to create an IRT model in Stan with missing data by following this official example. However, I have trouble understanding what should the data even look like.

Let’s say that we have 500 respondents and 20 items and let’s also say that overally 20% of the data is missing. The example that I’m following states:

The data provided for an IRT model may be declared as follows to account for the fact that not every student is required to answer every question.

And then provides the following code for the data block:

data {
  int<lower=1> J;              // number of students
  int<lower=1> K;              // number of questions
  int<lower=1> N;              // number of observations
  int<lower=1,upper=J> jj[N];  // student for observation n
  int<lower=1,upper=K> kk[N];  // question for observation n
  int<lower=0,upper=1> y[N];   // correctness for observation n
}

For starters, I am confused about the following: Is N the total number of observations including missing data (so, 500*20 = 10000), or is N the total number of observations excluding missing data (8000)?

If N is 10000, then I don’t understand how should the values of y in the cases where the data is missing look like. Should they just be coded as NA? But I’ve heard that Stan doesn’t accept this? And if N is 8000, then I don’t understand how do we denote which observations are those that are not present in the data and how do we use that in a model?

EDIT:

To try and be more straightforward with my question, I’ve decided to include the following R code used for data simulation:

library(tidyverse)
library(magrittr)

theta <- rnorm(500, 0, 1) 

b <- rnorm(20, 1.5, 1.1)

a <- rnorm(20, 1.3, 0.6)

d <- rbinom()

y <- list()

for(i in 1:20){
  y[[i]] <- rbinom(500, 1, ((exp(a[[i]]*theta + b[[i]]))/(1 + exp(a[[i]]*theta + b[[i]]))))
  y[[i]][which(rbinom(500, 1, .8) == 0)] <- NA
}

y %<>%
  reduce(bind_cols)

I was wondering how can I prepare this data for the analysis with the Stan as used in the official example I linked above.

mike-lawrence · August 15, 2021, 7:47pm

The missing values should be simply omitted from y

JohnDoe · August 15, 2021, 7:49pm

So then N is 8000?

mike-lawrence · August 15, 2021, 7:58pm

Oh, sorry, yes N=8000. The combination of labels for which participant and item is associated with each response implicitly handles the scenario where the data don’t have all participant-by-item combinations present.

JohnDoe · August 15, 2021, 9:44pm

Thank you very much for the clarification, but I am still trying to wrap my head around one thing.

Say that the 1st respondent has not responded to item number 5 and has responded to all other 19 items. His vector of responses will, then, be of length 19. However, if the skipped question was question number 8 or 14 or any other number, that vector would still be of length 19. Therefore, I don’t understand how can the computer - if we have simply dropped missing data instead of using a missing data indicator (NA) in place of the response of the respondent 1 to item 5 - know that the misingness ocurred speficically in that location and not somewhere else, i.e. how can it know that the person didn’t skip the response to, say 19th item.

mike-lawrence · August 15, 2021, 9:55pm

Because jj And kk tell the model which student and item effects are influencing each response.

JohnDoe · August 16, 2021, 6:19am

I think I get it! So if the 5th item of the 1st respondent is missing, there will be one 1 less in jj and there will not be a 5 among the first 20 entries of kk.

Topic		Replies	Views
Fail to run due to relatively high number of missing observations Modeling	7	938	February 28, 2018
Missing data in a 2PL (IRT) model Modeling	37	4291	October 22, 2017
Missing response model (section 10.3 of Stan manual) Modeling	11	2443	May 24, 2017
Can't understand an example for handling missing value in rstan Modeling rstan , missing-data	1	825	June 26, 2022
How to circumvent defining a integer array in transformed parameter block Modeling specification , ecology , capture-recapture	3	4547	March 7, 2018

Missing data in Stan - some difficulties understanding

Related topics