How to model a recognition experiment with two measurement time points?

Hello! In the course of my master’s degree in psychology we currently conducting a word recognition experiment online. The conditions are varied within-subjects. All participants have to pass both conditions, control and experimental condition on the first date (T1). On the first date the participants are receiving a treatment which is expected to improve the recognition performance in the experimental condition associated recognition task.
One week later the participants have to undergo control and experimental condition again (T2). We want to evaluate whether the treatment effect continues from T1 to T2, specifically that the recognition performance is still larger in the experimental condition.


I’m gonna use the Brms package to specifiy and calculate an ordinal probit model with heteroscedastic error.

The Variables are defined as follows:

Variables of the Model

item = factor variable with 2 levels [new;old]
condition = factor variable with 2 levels [K;E]
time2 = numeric with [0 = T1; 1= T2]
old = numeric with [0 = new; 1 = old]


But I am unsure about:

  1. The Interaction of condition x item is the difference (change in the probability of the categories likewise shift of the thresholds) in detection from old vs new items in the experimental group vs the control group for T1, isn’t it? This Interaction takes just the data from T1 and not the whole from T1 + T2(?).
  2. I used the discrimination parameter to allow the standart deviation of the old (previous learned items) items differ from 1. Do I have to include more discrimination parameters for condition and time?
  3. The Interaction (condition x time2 x item) is the difference of the detection from T1 to T2. Does a non significant coefficient of this interaction term imply an enduring effect of the treatment after 1 week for the case that the (condition x item) interaction was significant?
  4. What is modeled with | i | in this case? Is the benefit of this a more accurate estimation of the coefficients?

The Model I used:

uvsdt <- brm(
  bf(Response ~ 1 + item * condition * time2 + (1 + item * condition * time2 | i | ID_T1T2 ),
     disc ~ 0 + old + (0 + old | i | ID_T1T2 )),
  data = data_hypnomemory, family = cumulative("probit"),
  iter = 2500, inits = 0
)
1 Like

Hi,
sorry for taking quite long to respond:

One way to understand what interactions as condition * item mean is to expand the dummy coding that will be used in the model. Assuming the reference levels are new and K, we get four cases:

Case      Intercept  item     condition     item:condition
[new, K]  1          0        0             0
[new, E]  1          0        1             0
[old, K]  1          1        0             0
[old, E]  1          1        1             1

I.e. ignoring all other elements in the formula the linear predictor for [new, E] will equal 1 * b_Intercept + 0 * b_itemold + 1* b_conditionE + 0 * b_itemconditionoldE = b_conditionE. Similarly for old,E the linear predictor will equal 1 * b_Intercept + 1 * b_itemold + 1* b_conditionE + 1 * b_itemconditionoldE. With that in mind we can see that:

([old,E] - [new,E]) - ([old,K] - [new, K]) = 1 * b_itemconditionoldE

or equivalently

([old,E] - [old,K]) - ([new, K] - [new, E]) = 1 * b_itemconditionoldE

Thinking about how the actual linear predictors are built is a very general way to understand what the interactions means even in more complex scenarios (e.g. factors with more than two levels).

The condition * item interaction would involve both time points, condition * item * time2 would let you estimate it separately for each time point.

Hope that answers this part of you inquiry :-)

No. First I am not sure you are correctly describing what the interaction term actually does. Second you need to interpret the uncertainty in the coefficient. If by “non-significant” you mean something like “the 95% posterior interval includes 0” it is also important, how wide the interval is. If the interval is narrow around 0 then you can be somewhat confident that the interaction really is negligible. If it is wide it means your data aren’t enough to learn much useful about the interaction. Also note that “significant” is a frequentist term and does not directly translate to a Bayesian context.

Finally, you can’t ignore that you also have the (1 + item * condition * time2 | i | ID_T1T2 ) and for some inference tasks it might be important to consider this term as well. I discussed this in a slightly different context at: Inferences marginal of random effects in multi-level models

It means that all the terms that use | i | share a correlation matrix, e.g. if the 1 + item * condition * time2 coefficients tend to be similar/different for some values of ID_T1T2 then the coefficient for 0 + old will also be more likely to be similar/different for those values of ID_T1T2. The benefit depends on whether this assumption is correct. I find it slightly weird in your case, because I would not necessarily expect the disc and overall response to behave similarly.

Overall, this seems to be a very ambitious model that would require A LOT of data to learn anything useful about the parameters.

Best of luck with your project!

(Responding belatedly after a period of not having time to check here)
(Also, I suspect you may have a better understanding of this than I do, but I’ll summarize for others coming later)

As I think you have discerned, data like this (participants saying “old”-or-“new” to stimuli that are either old or new) are often modelled using a “Signal Detection Theory” framework that actually ends up being well-captured by standard hierarchical GLM. To do it properly, you model the participant’s response (literally what they said, not response accuracy) as a function of the response-associated stimulus identity (here, whether the stim was old or new) plus any other observed or manipulated variables. As I understand it, in your data, old reflects the response-associated stimulus identity, condition reflects the experimental-vs-control manipulation, time2 encodes the two different session times. Now, you also have a variable item that you also indicate has two levels, “new” and “old”, but if this is not perfectly redundant to the old variable I am confused as to what it actually is. Is it perchance the actual word token, in which case shouldn’t it have many levels, not merely two?

Leaving my uncertainty wrt the item aside and continuing with my summary (again, likely for others later) and ignoring for now the multiple participants, the model :"

Response ~ old * condition * time2 

has (interpretations assume intercept-independent contrasts like sum contrasts or half-sum contrasts):

  • intercept: overall response bias
  • main effect of old: overall discrimination (literally d’ if using a probit link)
  • main effect of condition: effect of condition on response bias
  • main effect of time2: effect of time on response bias
  • condition:time2: interaction between condition and time on response bias
  • old:condition: effect of condition on discrimination
  • old:time2: effect of time on discrimination
  • old:condition:time2: interaction between condition and time on discrimination

When there are multiple participants, the full hierarchical model (which you should definitely employ, following Barr’s “keep it maximal” dictate) would be:

Response ~ old*condition*time2 + (1+old*condition*time2 | participant)

where potential across-participant corrrelations among all the above denoted effects are modeled.

If I’m right that item is intended to encode the actual tokens used for the old/new judgement (ex. the actual words), and if tokens were randomly assigned to old/new identities across participants, then you may want to model systematic differences among items in the same way as the above models systematic differences among participants. This could be done via the model:

Response  ~ (
            old*condition*time2
            + (1+old*condition*time2 | participant)
            + (1+old*condition*time2 | item)
)

Finally, you mention employing an ordered probit response link. Is my understanding correct that this is to avoid the assumption of equal variance in the latent “psychological newness” distributions of old and new items? I haven’t seen this done this way, but I think that makes sense and cool if this is how it’s supposed to be done. Possibly this also explains why you are modelling a seemingly separate outcome, disc? If not, and disc is truly a data parameter that you’ve computed from the recognition data separately in R, then hopefully you’ll see from the above that this is unnecessary as inference on discrimination performance comes from the GLM automatically.

Feel free to correct me if I’m misunderstanding anything!

1 Like

Quick note: disc is AFAIK a parameter of the cumulative probit distribution.

Ah, thanks to @matti’s awesome tutorial series on SDT (particularly part 3), I now see that the disc parameter is for enabling unequal variance and effects of variables thereon. Cool!

1 Like

The discrimination parameter: Is a way to let the standart deviation differ from 1 for the old items (unequal variance signal detection theory model). If I am right, the discrimination parameter is the natural logarithm from the standart deviation.

1 Like

@martinmodrak

If I wanna calculate the expected value of the latent variable familiarity for the first measurement time point doesn’t that mean, that the estimates for itemOld and conditionE are estimated for the participants measured at the first time point?

E( Y' | condition = 1, item = 1, time2 = 0) = 
  1 * intercept + 1 * b_itemold +1 * b_conditionE + 1 * b_itemOld:conditionE + 
  0 * b_time2 + 0 * b_time2:conditionE+ 0 * b_itemold:time2 + 
  0 * b_itemold:time2:conditionE

The time dimension confuses me. I want to know if the treatment on the first time point is effective to enhance the recognition performance (detection). To evaluate this question I would consider the Interaction b_itemold:conditionE, wouldn’t I?

I am not sure I completely understand what you are asking. There might be some additional misunderstanding. We have itemOld and conditionE as population level effects - those are estimated once. Then we have the varying intercept from (1+old*condition*time2 | participant) which creates a separate set of parameters for each value participant can take. So in that sense, yes, if you have participants (call it p1) that were only measured at one time point, they will still have their own time2[p1] parameter, but this coefficient will never interact with the likelihood and your inferences should be the same as if this parameter didn’t exist.

Does that answer your question?

If the condition represents the treatment on the first time point, then yes. (if the values of condition for the same participant can differ between time points than you would need to introduce new value “condition on first time point” and use it instead).

Your description of the experiment is slightly confusing to me. You say there is “condition” (which is control or experimental), but then that there is “treatment” which is something else? Or how is the treatment represented in the model? And what does “old” actually mean? I think clarifying what exactly happened and what do the data mean could help us get faster to a good solution.

Best of luck!

1 Like

First of all: I appreciate all your efforts :).

Let’s try to clarify the experiment:

The general research question: Can we enhance the memory performance in a visual recognition word task through hypnosis. (For the case we have an effect) Is the improvement still measurable after one week?

We have 2 measurement time points. The second is one week after the initial measurement (= T2). All Participants attend both: experimental and control condition (within-subject).
The treatment is the hypnosis which is intended to improve the performance in the recognition task. During the first session all participants receive a hypnosis which is then bound to a post-hypnotic trigger (a note with an “E”) - which is intended to enhance the memory performance later on at the recognition task. There is another note as a control cue with a “K” on it.

Let’s demonstrate the whole experiment for one Participant Xi:
T1 (Initial measurement/session):

  1. The Participant Xi gets the controll note (‘K’). Then he receive the hypnosis during which he will get the post-hypnotic trigger (note with the “E”).
  2. Xi learns Words. After the learning phase he has to take one of the notes (random “K” or “E”) and wear it near his body. Then he is asked to evaluate presented words regarding their previous occurrence in the learning phase (from 1 to 6: 1 - surely a new word,2,3,4,5, 6 - surely seen before). During this phase all the previous learned words occur and the same amount of “new” items (not learned during the learning phase).
  3. Xi learns words again. He has to take the other note (if he had first “K”, he is now asked to wear the “E” note). He is again asked to rate the occurring words with regard to their previous occurrance.
    -> So Xi did both, control and experimental condition

T2(after one week):

  1. All Participants redo 2. and 3. from the initial measurement/session.

The idea is to model the data as an unequal variance signal detection model (varying standart deviation for the previous learned items). And this with the Brms package defined as an ordered probit model with heteroscedastic error. I hope this is a more comprehensible description.

The Variables:
item = previous learned (= old), not learned during the experiment (= new)
time2 = data measured at T1 (= 0), data measured at T2 (=1)
Response = the Rating for the presented word (1 to 6)
condition = control (=0), experimental (=1)

@martinmodrak The OP and Matti’s tutorial both use treatment contrasts, and I wonder if (1) that’s a source of confusion here (bc with interactions treatment contrasts are hard to think about IMO) and (2) if sum contrasts can still “work” for the unequal variances model (in the sense that the disc parameter is now not 1 for the new stimuli and x for the old stimuli but instead disc_intercept-x/2 for the new and disc_intercept+x/2 for the old). Unless I’m mistaken, fixing the disc value for the new stimuli at 1 is a means of ensuring identifiability amidst the cumulative probit, yes? In which case would a sufficiently narrow (or even fixed-at-1) prior on the disc_intercept achieve the same ends?

1 Like

I know @paul.buerkner wrote an inspiring paper on ordinal regression models. Maybe he has an idea if it’s better to model this via two models (one for T1 and one for T2) or in one and what discrimination parameters should be allowed to vary and how.

OK, this is clearer.

I think one way to make the data easier to analyze would be to recode the new words in reverse and ignore the item term in the model. This is not completely safe, but I think would help thinking about the model much easier, as all coefficients are now positive if they are associated with increase in performance and negative when performance decreases. Eventually, you will probably want to move back to the original model, but by then you IMHO might have a better sense of what is going on.

I am not completely sure your full model is justifiable. I would start with a much simpler model, i.e.

ResponseRecoded ~ time2 * condition + item + (1 | participant)

where ResponseRecoded is the response with new words reversed. This should be much easier to interpret… Or

Response ~ item*time2*condition + (1 + item | participant)

For the non-recoded version.

Then I would do a posterior predictve check (pp_check) for the sd of the response by various groups (stat_grouped) and only if I would see discrepancies would I model the disc as differing between subgroups.

Similarly you could do oher pp_checks to see if additional terms need to be added to the model.

Also unless you have a ton of participants (my guess would be at least a few hundreds), I don’t think you can reliably learn about all the parameters in the full model, so you IMHO need to simplify anyway. Even the simpler models I proposed might be too rich if you have just a few dozen participants…

Does that make sense?

Yeah it seems, that the sd for the previous learned words is indeed higher.

Is it possible to model different discriminations for the old (previous learned words) items only? My intend is to model discrimination parameters for the old items in the controll group (old), the experimental group (old x conditionE), for the old words to time point 2 (old x time2 = 1) and for the old items in the experimental condition to T2. If I try to model this I always get discrimination parameters for the new items as well.

My Idea is to compare the above model with the model that just have one parameter for the differing sd for the old items. OR. I do two models one for T1 and one for T2 for comparing the conditions, but then the estimated standard error is higher then.

Maybe you have a few ideas :)