Speeded-2afc modellers: preliminary model proposal for review

(n.b. Footnotes in square brackets)

The title of this post is likely to select for folks already familiar with the terminology, but for curious others: In cognitive science a common tool for mental chronometry is the speeded-two-alternative-forced-choice experiment whereby human pariticipants are instructed to respond as quickly and accurately as possible when a stimulus is shown, with one kind of stimulus mapped to one response method and another stimulus mapped to another response method. An example might be an experiment where participants are instructed to respond with their left hand when they see a white circle and with their left hand when they see a black circle. This experiment yields, for each response, two pieces of data: what response was made, and the duration (response time, RT) between the moment of stimulus onset and the moment of response registration[1]. The response identity is usually combined with stimulus identity to derive response accuracy[2], and often analysis of RT data is restricted to only the correct responses[3]. Furthermore, many such pairings of {response accuracy, RT} are obtained from each of multiple participants and usually associated with predictor variables that can vary either within participants, between participants, or both.

So we’re in the general world of hierarchical models as presented in SUG1.13, but need to accommodate both a binary and continuous outcome variable, with potentially important nuance to how to model the varying-variability/positive-skew manifest in typical RT data.

Now, there exist advanced process models (ex diffusion, LBA) that can achieve nuanced inference from such data, as well as more descriptive distribution-fitting approaches (ex. modelling RTs as an exGaussian, shifted-Weibull, etc). However these tools usually necessitate lots of data to achieve reasonable uncertainty reduction, and I wanted something that could be deployed in smaller-data scenarios yet still achieve more robust description/inference than approaches I’d seen.

A naive approach would be to generalize the SUG1.13 to have twice as many correlated variates, and associated half with effects on the mean RT (with the RT observations log-transformed for a closer approximation to a normal likelihood) and half to effects on the log-odds of error, with appropriate likelihoods included for each. Want to additionally account for effects of predictors on the scale of the RT distribution? Just add yet more correlated variates! Easy, right?

Well, that’s what I thought too, but then after obtaining inference from that approach that was dramatically at odds with my domain expertise, I discovered that as you blindly increase the size of the multivariate normal in this way, you self-sabotage the whole point of modelling everything in a single model. That is, presumably the motivation for simultaneous modelling of the error & RT data is so that they can mutually-inform, decreasing uncertainty of inference to the degree that they indeed contain substantially correlated information. But as the dimension of a multivariate normal increases, the space of correlation matrices available to capture/make-use-of any such shared information collapses towards the identity matrix, meaning that your measures become less able to mutually inform!

I’m still working on a write up of this phenomenon, as it makes sense in retrospect but certainly wasn’t obvious for me at the outset[4], but in the interim I thought I’d share efforts I’ve made on an alternative approach that I think is far more straightforward.

What I’ve come up with is a location-scale model for RT (intended for application to log-transformed RTs, but it would be trivial to apply to other descriptive distributions) and of course a bernoulli model of errors with a latent log-odds quantity of primary interest. The predictors have independent “fixed effects” on the across-participants means for location, scale and log-odds of error, and the across-participants variabilities in each effect are captured by a partial-pooling model of log-variance, where the partial pooling is applied within effects on location, scale and log-odds of error separately. So, for example, the variability across participants in their Intercept for location is partially pooled with the variability across participants in the EffectX for location, but not partially pooled with the variability across participants in the Intercept for scale.

For correlations, however, I employ a mix of a traditional multivariate for more “uncertain” correlations and then separately model a specific set of correlations I have decent domain expertise to expect to be substantially non-zero. Specifically, for a given predictor, I decided that it’s generally reasonable to expect substantially non-zero correlations among the influence of that predictor on location, scale and log-odds of error. For example, for a predictor encoding the Intercept, I think it’s reasonable to expect substantially non-zero correlations among the Intercept for location, the Intercept for scale and the Intercept for log-odds of error. Similarly, it’s usually reasonable to expect a substantially non-zero correlation among the EffectX on location, the EffectX on scale and the EffectX on log-odds of error. My formulation doesn’t necessitate/assert strong priors on the sign of the correlation, merely removes it from the lump of less certainly-non-zero correlations to be modelled by the multivariate normal, in turn giving the multivariate normal more freedom to explore correlation matrices consistent with high correlations itself.

So, with the trio of correlations among predictors’ influence on each of location, scale and log-odds of error accounted for, what’s left is for me to use the multivariate normal for “everything else”. There are a variety of factor-analytic strategies to achieve this, but an approach I found elegantly simple is to pick one of location, scale or log-odds of error, and model correlations among the predictors’ influence on that one latent-type, in which case the other cross-predictor-and-cross-type correlations are implied by the trio of correlations I’d already modelled separately. So, I chose the predictors’ effects on location as the set of variates to model with the multivariate normal, then for each such effect derive the variates associated with the scale and log-odds of error through their correlation with the location variates. With this formulation, there’s actually only two correlations being modelled for a given predictor’s variates, one between the predictor’s location variates and the predictor’s scale variates, and one between the predictor’s location variates and the predictor’s log-odds-of-error variates; the correlation between the predictor’s scale variates and the predictor’s log-odds-of-error variates is derivable in the posterior from the other two (though I haven’t added the math for that yet).

So that’s basically it. I think this makes sense as a reasonable balance between hand-coding a complex full structural equation model and being dangerously over-automated, but I’d be very keen to get folks’ thoughts; maybe there’s some pathology in this approach I haven’t considered!

I’ve started a repo with the model & R code for prepping data & sampling here. It has a couple other tricks I’ve collected for speeding up inference for hierarchical models with highly-redundant predictor matrices (ex. when an experimental manipulation has two levels). Those tricks necessitate some complex data-prep and transforms in the model, but hopefully it’s not too difficult to follow and certainly chime in here if you have any questions. I’ve written it for the no-between-participant-predictors case, but it’s a fairly straightforward extension to get to the mixed-within-and-between and between-only predictors case (preview of the kind of extending code here). And apologies for the partially-done (inc. outdated comments) nature of this code, but this has been a work-in-progress for a while and I frankly needed the motivational catharsis of posting what I have now that I have at least something that seems to work :)

[1] More data is possible; for example, if one uses a response mechanism that enables a continuous response (ex. xbox gamepad triggers), a full response trajectory (inc. derived peak force) is possible to record as well. More people need to add this simple data-rich method!
[2] The reduction of the response identity \times stimulus identity notably risks throwing away important information if there is possibility for response bias that could/should be characterized. It’s trivial to achieve by simply encoding the response identity as 0/1 and including stimulus identity as a predictor interacting with everything else; any effects that do not include the stimulus identity predictor reflect effects on bias while any effects that do reflect effects on discriminability (the thing that accuracy seeks to measure). To the degree that these may be correlated, they can mutually-inform (or inform on other correlated effects, like response time quantities), so failing to let the model potentially see the full information risks greater posterior uncertainty than the data truly warrant.
[3] The framework here could be easily adapted to model the distribution of error RTs as well, and as in the preceding footnote it seems silly to toss a source of information by not doing so.
[4] There’s also a neat thing where if you try to abuse the multivariate normal in this way for non-hierarchical data, HMC throws divergences super reliably, so it’s only thanks to being in a hierarchical context that we can fail to so clearly notice that we’ve fundamentally mis-specified things.

6 Likes

This paper seems relevant: Frontiers | HDDM: Hierarchical Bayesian estimation of the Drift-Diffusion Model in Python | Frontiers in Neuroinformatics

Yes - the Wiener first passage time distribution is also implemented in Stan, and there are quite a few resources for hierarchical estimation of the DDM (e.g. using brms, hBayesDM, StanDDM to name a few). But as @mike-lawrence points out, these process models typically require quite a lot of data to be identifiable.

2 Likes

I should say that I may have put things back in that realm with my model (and possibly should have done more validation work before posting) as I’m encountering issues getting it to sample well so far. In going back to basics I’ve discovered that the way I’m doing the SEM part might have fundamental issues I don’t quite understand yet.

1 Like