This is really more of a conceptual question that I’m hoping to get clarity on before trying to make the modeling work. In the past, I’ve tried getting a finite mixture IRT model to work, but I never really cracked how to specify these models without serious problems for the posterior sampling.
Context of Problem
My interest is in fitting item response theory models to list learning tests where a set of words is read to a participant several times with the participant having to say back as many words as they can remember between each repetition.
I’m preparing to work on one such test where all the words can be grouped into one of four categories. The existing literature suggests that the strategy that someone uses on this test contributes to differences in test performance with a categorical clustering of the words being the most efficient followed by a serial clustering (i.e., trying to remember the words in the order they are presented),and then an unorganized strategy (e.g., just saying whatever comes to mind as soon as it does).
The Problem
The crux of the modeling question, to my mind, is how to most accurately and efficiently adapt the IRT estimation to account for interindividual differences in strategy. The primary concern is that failing to account for the strategy used on this test will result in an estimation of a latent trait that is contaminated by nonmemory related information that arises just from the use of a more or less efficient learning strategy.
General Code Caveat: There’s definitely some issues like local independence violations and such that aren’t included in all the model code. I’m aware of that, and I really just want to give a general idea of the specification of the model so I’m fine with those details missing to make the code presentations more consistent with just my specific questions.
Possibility A: Treat Strategy as Continuous DIF
My first thought was to approach the matter as one of differential item functioning wherein the item properties are characteristically different depending on what strategy is used. To run this model, some kind of indicator variable would have to be created from the data to count the number of words that are recalled within categories on each trial (category
) and then the number of words that are recalled in serial order on each trial (serial
). In this case the model looks something like this:
brm(bf(Resp ~ beta + exp(logalpha) * theta,
nl = TRUE, decomp = "QR",
theta ~ 0 + (1  ID),
beta ~ 0 + Items + Items:category + Items:serial,
logalpha ~ 0 + Items + Items:category + Items:serial),
data = df, family = brmsfamily("bernoulli", link = "logit"),
prior = priors, iter = 3000, warmup = 1000,
backend = "cmdstanr")
The interactions are probably not specified right there, but it gets the idea across that the interest is just adding something to let the item parameters change as a function of how much one strategy is used.
Possibility B: Treat Strategy as Conditional DIF
The issue with treating this as a pure continuous DIF problem is that it ignores the introduction of nonindependence from an individual who is using the categorical clustering strategy. In other words, if someone recognizes that words can be grouped into a category and begins doing so, then items within a learned category are no longer independent of one another. This could be modeled using local dependence matrices (0 = item recall independent of previous response, 1 = item recall dependent/related to previous response). In this case, a categorical dependence matrix (category_dep
) and serial dependence matrix (serial_dep
) would be needed and can be used to interact with items to create DIF only when one strategy is being used:
brm(bf(Resp ~ beta + exp(logalpha) * theta,
nl = TRUE, decomp = "QR",
theta ~ 0 + (1  ID),
beta ~ 0 + Items + Items:category_dep + Items:serial_dep,
logalpha ~ 0 + Items + Items:category_dep + Items:serial_dep),
data = df, family = brmsfamily("bernoulli", link = "logit"),
prior = priors, iter = 3000, warmup = 1000,
backend = "cmdstanr")
Possibility C: Model Strategies as Mixture Components
Ultimately, the issue that I have with the DIFrelated models is that it leaves information on the table and treats the strategies like something that is continuous (e.g., one may use more of a categorical than serial strategy), but this isn’t consistent with clinical knowledge. Instead, it is much more the case that individuals use one strategy to learn the list. Further, we know that certain kinds of individuals are more likely to use the categorical strategy over the serial strategy. Individuals with high verbal reasoning, semantic language, and problemsolving are much more likely to readily identify that the learning task can be simplified by reorganizing the list.
This information seems like something that could be passed to a mixture model through the mixing ratio like this:
brm(bf(Resp ~ beta + exp(logalpha) * theta,
nl = TRUE, decomp = "QR",
theta ~ 0 + (1  ID),
beta ~ 0 + Items ,
logalpha ~ 0 + Items,
theta1 ~ verbal + semantic + problem),
data = df, family = mixture("bernoulli", "bernoulli"),
prior = priors, iter = 3000, warmup = 1000,
backend = "cmdstanr")
The issue that I encounter in these kinds of specifications are the following:

How do I identify the clusters?
In a Gaussian mixture, the intercepts can be ordered so that one group has a lower value than the other(s). In a nonlinear model like this, I don’t know how to specify it so that the termsbeta
andlogalpha
are ordered to be different between the two classes. 
How do I check whether the clusters identified correspond to strategy over something else?
My understanding of this is that the model will try to identify two separate groups (or however many I specify in the mixture), but there’s no guarantee that the model splits those groups as I am anticipating. That’s fine if that’s the case – still good information – but it then leaves me in the place of trying to figure out whether there is still a need to fold in strategy as a factor affecting performance within these clusters.
Possibility D: Latent Change Model for Discretizing Strategy
The other alternative that I could think of to get learning strategy into specific groups was to use a latent change model using the category
and serial
count variables. My understanding of the latent change model would be that it tries to identify a kind of tipping point where a continuous variable could be discretized meaningfully. This seems like a promising application for this problem as this lets me get discrete groups for a more traditional DIF model without having to figure out how to identify a mixture model.
The problems for me with this idea are the following:

Is it possible to fit a latent change model in brms?
I don’t have to use brms, it’s just so dang convenient. I’ve tried implementing latent changes models in cmdstanr before, but the notation gets pretty confusing to me and I’ve never been successful in using it as a result. 
How can the strategy be discretized by the number of times the list has been presented already?
Individuals almost always begin with a serial strategy when they first hear the list (because why would you expect to be able to categorize the words?). It is usually on the second presentation of the list that people recognize the words can be grouped, and then it’s really that 3rd trial when people start using the strategy. This being said, factors like problemsolving skills and processing speed can affect the rate of adoption of a more efficient strategy. I imagine that there is some way of incorporating a trial indicator as a random effect somehow – e.g.,... + (cut_category + cut_serial  Trial)
, perhaps? 
How would a latent change model like this differ from a recursive partitioning tree model?
One of the more interesting (imo) ways of dealing with DIF from a variety of potential discrete and continuous covariates is through recursive partitioning methods. An example of this approach can be found here, and it is implemented in R for the Rasch model in the package psychotree. The major formative paper for the recursive Rasch tree models is here. As far as I know, these models have been examined and packaged for a very narrow subset of IRT models, but I think the methods have a lot of potential. So, if there’s a Bayesian way of getting this to work generally, then I think that’s great. It seems that the latent change model gets at the tree partitioning, but I don’t see how it could be extended to permit recursive DIF detection among a series of variables – maybe it’s possible with some kind of gaussian copula?
WrapUp
Any and all thoughts or suggestions are highly encouraged. Like I said, this is really more of a conceptual question for me as I start trying to think about what model captures the data generation process most accurately. That being said, there’s definitely a technical element to this too that is currently well beyond my previous experience with brms or cmdstanr, so I’m definitely eager to get any code examples or other kinds of help on that front. I guess, at the end of the day, just any way of thinking through the issue is helpful to me!