Generated quantities with hold out data and new levels


In playing with my latest model, I often pass a “hold out” data set in, so that I can use the generated quantities block to give me posterior draws on this new data set. (Borrowing from the old-school idea of a training dataset and a testing dataset) .

y_rep[i] = normal_rng( theta[student[i]], sigma)

Normally this works nicely. But my current model has new challenge. It is a mixed effects model, but the hold out data may contain an effect level not in the training data. This is a longitudinal study of students and test scores. (repeated measures per student) . The hold-out data has a mix of old students (in the training dataset), and new students.

But, if we have a new student in the hold-out data, then student[i] doesn’t exist and Stan will report an error.

Does anybody have any suggestions on how to handle this indexing issue?


There’s a conditional in the language. I think what makes sense is to generate new thetas for the levels in the held out data set, then use that. If there’s only one, that’s straightforward, but not much harder to do with multiple.