I’m going to frame this in an item response theory (IRT) framework, but I’m open to alternative modeling approaches if this general kind of problem has been addressed in other literature bases.
General Background
Suppose a two parameter logistic model for the response of person j to item i described as follows:
Assuming that I didn’t mess up any of the notation, this is a random item model where each item parameter is given a random intercept (\gamma_i) in addition to a fixed intercept (\beta_0). Item discriminations, \alpha_i, are constrained to be positive via exponentiation.
Current Question
This model is perfectly fine in the case where item responses are fixed in length (e.g., scoring a particular test with X number of questions), but I’m interested in a case where the response length is not necessarily the same as the test length. Specifically, I’m thinking of the case of list learning tests in which an individual is read a list of words, which I consider to be the test length, and then repeat as many of the words as they can (order doesn’t matter). The response length may thus be anywhere between zero (no words recalled) and the test length itself.
Previously, I have treated every word recalled as a correct response (y_{ij} = 1) and then every word not recalled an incorrect response (y_{ij} = 0). These models have worked perfectly fine thus far, but I feel that there is information being left out of the response process that I’d like to account for in an appropriate manner. Specifically, these are the response process elements that I’m trying to identify an appropriate modeling approach to incorporate:

Dependency of Responses  I anticipate that recalling certain words is, in effect, a prime or cue for recalling other words on the list (either by semantic similarity or from the presentation of words in a particular order). This is a local independence assumption violation for the IRT model that I’d like to incorporate as a factor that dynamically shifts the prediction of words yet to be recalled conditional on words recalled so far.

Strategy Induced Dependency  this is really just a special case of the prior point, but I think it is important enough to merit a separate note. Individuals typically utilize one of two strategies when learning the list: serial encoding (trying to memorize the words in order) or semantic clustering (trying to group the words based on similarities). I currently am treating intrinsic dependency of words as separate from induced dependency from individual recall strategies. I anticipate that accounting for strategy will involve a mixture model in which the relative serial and semantic dependency effects of the words vary depending on which strategy/mixture component the person is modeled as belonging to.

Probability of Stopping Recall  I anticipate that the probability that someone stops recalling additional words is a function of (a) how many words have already been recalled, (b) the relatedness/priming strength of words already recalled versus those yet to be recalled, and (c) the discrepancy between the difficulty of recalling the remaining words and the person’s ability level.

Certainty or Confidence Thresholds  I also expect that individuals differ in regards to a kind of internal level of confidence in the possibility of a response they have in mind as being a correct response or not. In this respect, nonrecalled words are not necessarily forgotten or incorrect words but may instead simply be words for which the individual was not sufficiently confident were on the list to guess/provide. I’m currently thinking of this as a kind of personlevel effect where there is some threshold of response probability that a word must exceed before they will provide it as a response (i.e., a simple dichotomous “if p(word) > 0.50, then guess” rule is misleading).
Help Sought From This Post
Ultimately, I’m hoping to gain a deeper and better understanding of how to go about modeling these kinds of complexities. I don’t necessarily need the answer so much as direction, recommendations, resources, and examples of how to handle these kinds of considerations. These kinds of modeling problems seem initially rather straightforward since I literally am listing out parameters and processes that I think need to be modeled; however, I invariably find nonidentifiability, impossible to sample posteriors, prohibitively long estimations, and many other practical difficulties as I try to address these kinds of topics.
Some thoughts that I’ve had as to how to approach this thus far:

Joint LongitudinalSurvival Model  the longitudinal model here is the IRT model with the probability of the recall trial stopping being estimated via the survival model. Associating the two components via the ability and the average difficulty of the remaining items would seem reasonable, but I haven’t see a joint model along these lines before

Hidden Markov Model  whether a word has been learned seems like a relatively intuitive hidden latent state that the observed responses indicate. I’ve personally never used HMMs, and I don’t feel particularly confident regarding any assumptions or conditions necessary for their appropriate use. Additionally, I worry about an HMM potentially removing intuitive clinical/scoring interpretations of the model parameters

Context Maintenance and Retrieval Model  this is a specific model developed in cognitive research settings for list learning. It is a relatively simple neural network model, but I know that Stan doesn’t play too nicely with these kinds of models. An implementation in Stan is available (ncms_toolbox/models/cmr_stan at master · vucml/ncms_toolbox · GitHub), and I’ve adapted it to my own use case. Unfortunately, I cannot obtain reliable posterior samples from it, and I don’t entirely know whether that’s because the model is simply not ideal for Stan estimation or because I’ve messed something up along the way.
Thanks in advance for anyone who responds, provides insights, shares resources, or assists in anyway!