I would definitely recommend @jsocolar’s suggestions here, I’ll just add a little more info.
As an example on a smaller scale, let’s say you assessed 10 people with up to 5 questions. The ‘wide’ data would look like:
> library(tidyverse)
> set.seed(2021)
>
> dat_wide = structure(sample(c(0,1,NA),50,replace = T),dim=c(10,5))
> dat_wide
[,1] [,2] [,3] [,4] [,5]
[1,] NA 0 1 0 NA
[2,] 1 NA 1 NA NA
[3,] 1 NA 0 1 1
[4,] 1 0 1 1 0
[5,] NA NA NA NA 1
[6,] 1 NA 1 NA NA
[7,] 1 1 NA 1 1
[8,] NA NA 0 1 NA
[9,] 1 0 1 1 0
[10,] 1 NA 0 1 0
Given that there are several person-question combinations that have no responses, it’s computationally wasteful to be estimating the p, q, and theta parameters for these combinations. So the natural choice is to not do that (although this is only appropriate when these parameters are independent between person-item combinations).
To reduce this to a ‘long’ format, you have one row per response with unused combinations omitted:
> dat_long =
+ dat_wide %>%
+ data.frame() %>%
+ mutate(ID = 1:nrow(.)) %>%
+ pivot_longer(X1:X5, values_drop_na=T,
+ names_to = "Q", values_to="k") %>%
+ mutate(Q = as.integer(factor(Q)))
>
> dat_long
# A tibble: 32 x 3
ID Q k
<int> <int> <dbl>
1 1 2 0
2 1 3 1
3 1 4 0
4 2 1 1
5 2 3 1
6 3 1 1
7 3 3 0
8 3 4 1
9 3 5 1
10 4 1 1
# … with 22 more rows
The corresponding Stan model for this construction is then:
model2 <- "
data {
int<lower=0> M; // Total number responses
int<lower=0> N_ID; // Total number of individuals
int<lower=0> N_Q; // Total number of questions
int ID[M];
int Q[M];
int k[M];
}
parameters {
vector<lower=0,upper=1>[N_ID] p;
vector<lower=0,upper=1>[N_Q] q;
}
model {
// Priors For People and Questions
p ~ beta(1, 1);
q ~ beta(1, 1);
k ~ bernoulli(p[ID] .* q[Q]);
}
"
Which can be called with the above data via:
dat_list = list(
M = nrow(dat_long),
N_ID = length(unique(dat_long$ID)),
N_Q = length(unique(dat_long$Q)),
ID = dat_long$ID,
Q = dat_long$Q,
k = dat_long$k)
mod = stan(model_code = model2,
data = dat_list)