I would definitely recommend @jsocolar’s suggestions here, I’ll just add a little more info.
As an example on a smaller scale, let’s say you assessed 10 people with up to 5 questions. The ‘wide’ data would look like:
> library(tidyverse)
> set.seed(2021)
>
> dat_wide = structure(sample(c(0,1,NA),50,replace = T),dim=c(10,5))
> dat_wide
[,1] [,2] [,3] [,4] [,5]
[1,] NA 0 1 0 NA
[2,] 1 NA 1 NA NA
[3,] 1 NA 0 1 1
[4,] 1 0 1 1 0
[5,] NA NA NA NA 1
[6,] 1 NA 1 NA NA
[7,] 1 1 NA 1 1
[8,] NA NA 0 1 NA
[9,] 1 0 1 1 0
[10,] 1 NA 0 1 0
Given that there are several person-question combinations that have no responses, it’s computationally wasteful to be estimating the p
, q
, and theta
parameters for these combinations. So the natural choice is to not do that (although this is only appropriate when these parameters are independent between person-item combinations).
To reduce this to a ‘long’ format, you have one row per response with unused combinations omitted:
> dat_long =
+ dat_wide %>%
+ data.frame() %>%
+ mutate(ID = 1:nrow(.)) %>%
+ pivot_longer(X1:X5, values_drop_na=T,
+ names_to = "Q", values_to="k") %>%
+ mutate(Q = as.integer(factor(Q)))
>
> dat_long
# A tibble: 32 x 3
ID Q k
<int> <int> <dbl>
1 1 2 0
2 1 3 1
3 1 4 0
4 2 1 1
5 2 3 1
6 3 1 1
7 3 3 0
8 3 4 1
9 3 5 1
10 4 1 1
# … with 22 more rows
The corresponding Stan model for this construction is then:
model2 <- "
data {
int<lower=0> M; // Total number responses
int<lower=0> N_ID; // Total number of individuals
int<lower=0> N_Q; // Total number of questions
int ID[M];
int Q[M];
int k[M];
}
parameters {
vector<lower=0,upper=1>[N_ID] p;
vector<lower=0,upper=1>[N_Q] q;
}
model {
// Priors For People and Questions
p ~ beta(1, 1);
q ~ beta(1, 1);
k ~ bernoulli(p[ID] .* q[Q]);
}
"
Which can be called with the above data via:
dat_list = list(
M = nrow(dat_long),
N_ID = length(unique(dat_long$ID)),
N_Q = length(unique(dat_long$Q)),
ID = dat_long$ID,
Q = dat_long$Q,
k = dat_long$k)
mod = stan(model_code = model2,
data = dat_list)