Help specifying logistic regressing with nested predictors in a three-way interaction

Hi!
I am trying to fit a Bayesian logistic regression model for a psycholinguistic experiment using brms. I would very much appreciate feedback on my model specification, especially the formula. I am pretty new to using Bayesian models (and statistics in general), so there may be much better ways to do what I want that I am not aware of.

The experiment consists of 3 tasks for which correct (1) and incorrect (0) responses were recorded (response is NA in case of timeout):

  • picture selection task (picsel) with 80 items,
  • written grammaticality judgement task (written) with 40 items in both grammatical (G) and ungrammatical (U) conditions,
  • spoken grammaticality judgment task (spoken) with 40 items in both grammatical (G) and ungrammatical (U) conditions,
  • the items of written and spoken are identical.

There are 3 groups of participants, all participants saw all items from all tasks (in total 80+80+80=240):

  • native speakers
  • immersion learners
  • classroom learners

I am interested in modelling the response accuracy or probability of a correct response for
each participant group (group) in each task, taking into account the item structure: repeated items in the written and spoken task, occurring in G and U conditions of gramm in each task. For the picsel task, grammaticality (gramm) does not apply; I added the dummy level “ps” to “G” and “U”.

This is the model that I have specified so far, including group effects for items (question) and participants across tasks.

fit <- brm(response ~ task + group + 
                task:group:gramm +
                (1 | question) +
                (1 | task:participant),
            data = toydf,
            family = "bernoulli",
            prior = c(prior(normal(0, 4), class = b),
                      prior(normal(0, 4), class = Intercept)),
            warmup = 500,
            iter = 2000,
            chains = 4,
            inits= 0,
            cores=4,
            seed = 42)

This is a toy dataset with 2 participants per group, i.e. 6 in total that shows what the relevant parts of the data look like.
toydf.csv (94.5 KB)

The structure of the toy data set is:

> str(toydf)
'data.frame':	1440 obs. of  7 variables:
 $ participant: Factor w/ 6 levels "00265018-0164-4417-8664-6eedd7839004.txt",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ response   : int  0 1 0 1 1 1 0 1 1 0 ...
 $ task       : Factor w/ 3 levels "spoken","written",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ question   : Factor w/ 120 levels "1","10","11",..: 8 31 27 5 17 30 9 39 29 19 ...
 $ group      : Factor w/ 3 levels "native","immersion",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ taskident  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ gramm      : Factor w/ 3 levels "G","ps","U": 3 1 3 1 1 1 3 3 1 3 ...

From what I can see in the results and the marginal/conditional_effects plots, the model seems to do what I want and gives me quite expected predictions for each group in each task in the sense that I can filter out the relevant predictions for the picsel task at gramm level “ps”, and the predictions for written/spoken tasks at both gramm levels “G” and “U”.

This is a plot showing the conditional effects:

In a traditional logistic regression model, I would probably have to specify two separate interaction terms for the population level effects that I am interested in

  • response ~ task + group + task:group + task:group:taskident:gramm

where the four-way interaction term would include an indicator variable (taskident) that is 0
for picsel and 1 for written/spoken, so that the interaction term including gramm is 0 for picsel (since gramm does not apply to that task).

It seems that I can do without with brms. Is that really the case, should I use a different formula, or are there generally better ways to model the response given the structure of my predictors?

I am grateful for any suggestions or pointers to references on this issue!

  • Operating System: ubuntu 18.04
  • brms Version: 2.9.0
1 Like

I don’t know that I’m following all the details for your four-way interaction. It might help if you spelled out your intended equation using LaTeX syntax. But generally speaking, your model formula and other arguments within brm() looked reasonable given your description.

Thank you for your reply, Solomon!

Maybe I wasn’t clear enough about my main question, I’ll try to reformulate:
I want to define an interaction for only a sub-portion of the data. In the end, I did that with an additional dummy variable in the interaction being 0 for the part of the data for which the interaction is not defined, and 1 for the part it is.
The model predictions for all defined interactions make sense and I generally see what I expected. The only thing I don’t understand is that I get predictions about interactions between levels for which this interaction is not defined, which I wanted to exclude with my dummy variable being 0 for those observations. Should I worry about predicting interactions between levels that I do not “need” because they may affect the model’s prediction of the others, or can I ignore them because this may simply be how a Bayesian model works?