I am trying to fit a Bayesian logistic regression model for a psycholinguistic experiment using brms. I would very much appreciate feedback on my model specification, especially the formula. I am pretty new to using Bayesian models (and statistics in general), so there may be much better ways to do what I want that I am not aware of.
The experiment consists of 3 tasks for which correct (1) and incorrect (0) responses were recorded (response is NA in case of timeout):
- picture selection task (picsel) with 80 items,
- written grammaticality judgement task (written) with 40 items in both grammatical (G) and ungrammatical (U) conditions,
- spoken grammaticality judgment task (spoken) with 40 items in both grammatical (G) and ungrammatical (U) conditions,
- the items of written and spoken are identical.
There are 3 groups of participants, all participants saw all items from all tasks (in total 80+80+80=240):
- native speakers
- immersion learners
- classroom learners
I am interested in modelling the response accuracy or probability of a correct response for
each participant group (group) in each task, taking into account the item structure: repeated items in the written and spoken task, occurring in G and U conditions of gramm in each task. For the picsel task, grammaticality (gramm) does not apply; I added the dummy level “ps” to “G” and “U”.
This is the model that I have specified so far, including group effects for items (question) and participants across tasks.
fit <- brm(response ~ task + group +
task:group:gramm +
(1 | question) +
(1 | task:participant),
data = toydf,
family = "bernoulli",
prior = c(prior(normal(0, 4), class = b),
prior(normal(0, 4), class = Intercept)),
warmup = 500,
iter = 2000,
chains = 4,
inits= 0,
seed = 42)
This is a toy dataset with 2 participants per group, i.e. 6 in total that shows what the relevant parts of the data look like.
toydf.csv (94.5 KB)
The structure of the toy data set is:
> str(toydf)
'data.frame': 1440 obs. of 7 variables:
$ participant: Factor w/ 6 levels "00265018-0164-4417-8664-6eedd7839004.txt",..: 4 4 4 4 4 4 4 4 4 4 ...
$ response : int 0 1 0 1 1 1 0 1 1 0 ...
$ task : Factor w/ 3 levels "spoken","written",..: 1 1 1 1 1 1 1 1 1 1 ...
$ question : Factor w/ 120 levels "1","10","11",..: 8 31 27 5 17 30 9 39 29 19 ...
$ group : Factor w/ 3 levels "native","immersion",..: 3 3 3 3 3 3 3 3 3 3 ...
$ taskident : int 1 1 1 1 1 1 1 1 1 1 ...
$ gramm : Factor w/ 3 levels "G","ps","U": 3 1 3 1 1 1 3 3 1 3 ...
From what I can see in the results and the marginal/conditional_effects plots, the model seems to do what I want and gives me quite expected predictions for each group in each task in the sense that I can filter out the relevant predictions for the picsel task at gramm level “ps”, and the predictions for written/spoken tasks at both gramm levels “G” and “U”.
This is a plot showing the conditional effects:
In a traditional logistic regression model, I would probably have to specify two separate interaction terms for the population level effects that I am interested in
- response ~ task + group + task:group + task:group:taskident:gramm
where the four-way interaction term would include an indicator variable (taskident) that is 0
for picsel and 1 for written/spoken, so that the interaction term including gramm is 0 for picsel (since gramm does not apply to that task).
It seems that I can do without with brms. Is that really the case, should I use a different formula, or are there generally better ways to model the response given the structure of my predictors?
I am grateful for any suggestions or pointers to references on this issue!
- Operating System: ubuntu 18.04
- brms Version: 2.9.0