Issues with chi square ppmc for sparse data?

I am attempting to replicate Daniel C. Furr et al’s syntax for PPMC here: http://mc-stan.org/users/documentation/case-studies/tutorial_twopl.html

However, I have encountered some problems when taking steps to calculate replicated chi square values. I am using wide formatted data, so my syntax is slightly different from the original document:

score.rep = apply(y_rep, 1, function(x) rowSums(x) replicated = apply(score.rep, 2, function(x) table(x)

Ideally, this would create a table for each replicated data set. For my case, I have 9 items (i.e., 10 score groups), so this would be my ‘replicated’ object if I had 3 replicated data sets (sorry for the poor formatting):

> replicated iterations [,1] [,2] [,3] 0 30 37 29 1 40 28 46 2 37 38 26 3 22 23 25 4 13 17 20 5 22 20 22 6 16 20 12 7 12 9 8 8 4 5 8 9 4 3 4
However, if I have a replicated data set with zero cases of a particular score group, the replicated object looks like this:

> replicated [[1]] x 0 1 2 3 4 5 6 7 8 33 43 32 25 23 19 10 11 4 [[2]] x 0 1 2 3 4 5 6 7 8 9 30 40 37 22 13 22 16 12 4 4 [[3]] x 0 1 2 3 4 5 6 7 8 9 37 28 38 23 17 20 20 9 5 3
I have figured out how to deal with the different structure to calculate chi square values; however, I am concerned that my chi square values for some of the replicated data sets will be inaccurate, since the expected scores would be based on 9 score groups (0-8) rather than 10 score groups (0-9).

Should this be a concern?

Thank you!
Danny

I missed this before. Even if you have zero cases in the replication, that doesn’t mean you can’t run Chi-square. You just plug in the zero. It’s only if the expected number of zeros is zero that you’ll run into trouble.

@danielcfurr may have more to say. I’m not sure he was on Discourse when you posted this originally (and I’m just catching up on some messages I missed earlier).

Thanks! My original questions wasn’t exactly about whether I could run chi-square when I have 0 cases, it was more about whether the formatting discrepancies (in which a score group isn’t represented, even by a 0) would cause issues.

It turns out it caused major problems with computation, but I actually figured out how to resolve the issue. I was able to restore my summaries to the original format from @danielcfurr’s tutorial by taking the length of each score group (as opposed to creating a table of total scores), which would return 0 if there was no representation of the group (instead of just not representing the group). It required a for-loop, so it wasn’t as pretty as the original code, but it did the trick.

If anyone else encounters this issue, here’s what I did (again, this is wide-formatted data, which deviates from the referenced tutorial, so I’m using rowSums to create total scores):


#observed score data frame
score.obs = rowSums(data)
observed = data.frame(matrix(nrow = 10, ncol = 2))
observed[1:10,1] = 0:9
for (r in 1:10) {
observed[r,2] = length(score.obs[score.obs == r-1])
}
colnames(observed) <- c("score", "obs")
observed$score = as.numeric(as.character(observed$score))
# replicated scores matrix
score.rep = apply(y_rep, 1, function(x) rowSums(x))
replicated = matrix(nrow = 10, ncol = 2500)
for (r in 1:10) {
for (c in 1:2500) {
replicated[r,c] = length(score.rep[score.rep[,(c)] == r-1,(c)])
}
}
expected = apply(replicated, 1, mean)
rep.chi2 = apply(replicated, 2, chi.test, expected = expected)
obs.chi2 = chi.test(observed$obs, expected)
PPP <- mean(obs.chi2 <= rep.chi2)


Thanks for clarifying and reporting back with a solution.