Modeling Likelihood of both Response Pattern and Length

I’m going to frame this in an item response theory (IRT) framework, but I’m open to alternative modeling approaches if this general kind of problem has been addressed in other literature bases.

General Background
Suppose a two parameter logistic model for the response of person j to item i described as follows:

y_{ij} \sim Bernoulli(\eta_{ij})
\eta_{ij} = \alpha_i\theta_j + b_i
\theta_j \sim N(0, 1)
\alpha_i = e^{a_i}
a_i = \beta^{a}_{o} + \gamma^{a}_i
b_i = \beta^{b}_{o} + \gamma^{b}_i
\begin{pmatrix} \gamma^a \\ \gamma^b \end{pmatrix} \sim MVN(0, \Sigma)
\Sigma = \begin{pmatrix} \sigma_{a, b} & \rho \\ \rho & \sigma_{a, b} \end{pmatrix}

Assuming that I didn’t mess up any of the notation, this is a random item model where each item parameter is given a random intercept (\gamma_i) in addition to a fixed intercept (\beta_0). Item discriminations, \alpha_i, are constrained to be positive via exponentiation.

Current Question
This model is perfectly fine in the case where item responses are fixed in length (e.g., scoring a particular test with X number of questions), but I’m interested in a case where the response length is not necessarily the same as the test length. Specifically, I’m thinking of the case of list learning tests in which an individual is read a list of words, which I consider to be the test length, and then repeat as many of the words as they can (order doesn’t matter). The response length may thus be anywhere between zero (no words recalled) and the test length itself.

Previously, I have treated every word recalled as a correct response (y_{ij} = 1) and then every word not recalled an incorrect response (y_{ij} = 0). These models have worked perfectly fine thus far, but I feel that there is information being left out of the response process that I’d like to account for in an appropriate manner. Specifically, these are the response process elements that I’m trying to identify an appropriate modeling approach to incorporate:

  1. Dependency of Responses - I anticipate that recalling certain words is, in effect, a prime or cue for recalling other words on the list (either by semantic similarity or from the presentation of words in a particular order). This is a local independence assumption violation for the IRT model that I’d like to incorporate as a factor that dynamically shifts the prediction of words yet to be recalled conditional on words recalled so far.

  2. Strategy Induced Dependency - this is really just a special case of the prior point, but I think it is important enough to merit a separate note. Individuals typically utilize one of two strategies when learning the list: serial encoding (trying to memorize the words in order) or semantic clustering (trying to group the words based on similarities). I currently am treating intrinsic dependency of words as separate from induced dependency from individual recall strategies. I anticipate that accounting for strategy will involve a mixture model in which the relative serial and semantic dependency effects of the words vary depending on which strategy/mixture component the person is modeled as belonging to.

  3. Probability of Stopping Recall - I anticipate that the probability that someone stops recalling additional words is a function of (a) how many words have already been recalled, (b) the relatedness/priming strength of words already recalled versus those yet to be recalled, and (c) the discrepancy between the difficulty of recalling the remaining words and the person’s ability level.

  4. Certainty or Confidence Thresholds - I also expect that individuals differ in regards to a kind of internal level of confidence in the possibility of a response they have in mind as being a correct response or not. In this respect, non-recalled words are not necessarily forgotten or incorrect words but may instead simply be words for which the individual was not sufficiently confident were on the list to guess/provide. I’m currently thinking of this as a kind of person-level effect where there is some threshold of response probability that a word must exceed before they will provide it as a response (i.e., a simple dichotomous “if p(word) > 0.50, then guess” rule is misleading).

Help Sought From This Post
Ultimately, I’m hoping to gain a deeper and better understanding of how to go about modeling these kinds of complexities. I don’t necessarily need the answer so much as direction, recommendations, resources, and examples of how to handle these kinds of considerations. These kinds of modeling problems seem initially rather straightforward since I literally am listing out parameters and processes that I think need to be modeled; however, I invariably find non-identifiability, impossible to sample posteriors, prohibitively long estimations, and many other practical difficulties as I try to address these kinds of topics.

Some thoughts that I’ve had as to how to approach this thus far:

  • Joint Longitudinal-Survival Model - the longitudinal model here is the IRT model with the probability of the recall trial stopping being estimated via the survival model. Associating the two components via the ability and the average difficulty of the remaining items would seem reasonable, but I haven’t see a joint model along these lines before

  • Hidden Markov Model - whether a word has been learned seems like a relatively intuitive hidden latent state that the observed responses indicate. I’ve personally never used HMMs, and I don’t feel particularly confident regarding any assumptions or conditions necessary for their appropriate use. Additionally, I worry about an HMM potentially removing intuitive clinical/scoring interpretations of the model parameters

  • Context Maintenance and Retrieval Model - this is a specific model developed in cognitive research settings for list learning. It is a relatively simple neural network model, but I know that Stan doesn’t play too nicely with these kinds of models. An implementation in Stan is available (ncms_toolbox/models/cmr_stan at master · vucml/ncms_toolbox · GitHub), and I’ve adapted it to my own use case. Unfortunately, I cannot obtain reliable posterior samples from it, and I don’t entirely know whether that’s because the model is simply not ideal for Stan estimation or because I’ve messed something up along the way.

Thanks in advance for anyone who responds, provides insights, shares resources, or assists in anyway!

Hi, @wgoette. These open-ended modeling questions are hard to answer on the forum unless they ring a bell with someone who knows some related work.

Was there a Stan question in there somewhere we can answer in the meantime? I did try to address Stan and neural networks below, but you’re right they don’t play nicely together (it’s not actually Stan—Bayes doesn’t play nicely with NNs because posterior integration is intractable).

This model has to be incredibly challenging to identify. It’s hard enough with just an IRT 2PL model, though you do have the first step here, which is pinning the thetas to standard normal and factoring out the discrimination (alpha) so it only applies to ability (theta). Whenever you add extra degrees of freedom that can explain the same thing (maybe it’s not a bad student, it’s just a hard problem is the original IRT 1PL problem); when you add item-level and rater-level effects here it’s going to get very challenging. Don’t you want your a[I] centered around zero rather than being offset? That way, the discriminations are centered around 1, which helps identify there.

This sounds like it may lead to a combinatorial explosion when trying to evaluate log densities. Anything that involves reasoning over sets is challenging. The kind of ballistics models people use in psycholinguistics for reading time are super-expensive to compute.

Again, you are probably thinking about learning sets of words, which will be combinatorially prohibitive. You also wind up introducing a state squared transition matrix.

It’s not so much Stan as that these ML models can’t be made to pass simulation-based calibration tests. Their posteriors tend to be so multimodal that traditional sampling won’t work—that is, you can’t technically compute posterior intervals. So all kinds of approximations get used. Primarily, it’s the stochastic gradient descent that seems to be able to solve the optimization relatively well. You can then sample around a mode, which some people like to do, but it’s not giving you the proper Bayesian posterior inference.

I’m afraid the 300 line Stan model is a bit much for me to read in full, but it doesn’t look like a neural network model insofar as there are cholesky factors of covariance matrices involved! One of the things you need to do in models with lots of random effects is try to identify by either pinning one of the value to zero (asymmetric, pushing effect into the intercept), or you need to enforce a sum-to-zero constraint on the parameters, at which point you can use ordinary priors, though they don’t quite act the way you might expect because of the reduced degree of freedom.