# Help specifying logistic regressing with nested predictors in a three-way interaction

Hi!
I am trying to fit a Bayesian logistic regression model for a psycholinguistic experiment using brms. I would very much appreciate feedback on my model specification, especially the formula. I am pretty new to using Bayesian models (and statistics in general), so there may be much better ways to do what I want that I am not aware of.

The experiment consists of 3 tasks for which correct (1) and incorrect (0) responses were recorded (response is NA in case of timeout):

• picture selection task (picsel) with 80 items,
• written grammaticality judgement task (written) with 40 items in both grammatical (G) and ungrammatical (U) conditions,
• spoken grammaticality judgment task (spoken) with 40 items in both grammatical (G) and ungrammatical (U) conditions,
• the items of written and spoken are identical.

There are 3 groups of participants, all participants saw all items from all tasks (in total 80+80+80=240):

• native speakers
• immersion learners
• classroom learners

I am interested in modelling the response accuracy or probability of a correct response for
each participant group (group) in each task, taking into account the item structure: repeated items in the written and spoken task, occurring in G and U conditions of gramm in each task. For the picsel task, grammaticality (gramm) does not apply; I added the dummy level â€śpsâ€ť to â€śGâ€ť and â€śUâ€ť.

This is the model that I have specified so far, including group effects for items (question) and participants across tasks.

``````fit <- brm(response ~ task + group +
(1 | question) +
data = toydf,
family = "bernoulli",
prior = c(prior(normal(0, 4), class = b),
prior(normal(0, 4), class = Intercept)),
warmup = 500,
iter = 2000,
chains = 4,
inits= 0,
cores=4,
seed = 42)
``````

This is a toy dataset with 2 participants per group, i.e. 6 in total that shows what the relevant parts of the data look like.
toydf.csv (94.5 KB)

The structure of the toy data set is:

``````> str(toydf)
'data.frame':	1440 obs. of  7 variables:
\$ participant: Factor w/ 6 levels "00265018-0164-4417-8664-6eedd7839004.txt",..: 4 4 4 4 4 4 4 4 4 4 ...
\$ response   : int  0 1 0 1 1 1 0 1 1 0 ...
\$ task       : Factor w/ 3 levels "spoken","written",..: 1 1 1 1 1 1 1 1 1 1 ...
\$ question   : Factor w/ 120 levels "1","10","11",..: 8 31 27 5 17 30 9 39 29 19 ...
\$ group      : Factor w/ 3 levels "native","immersion",..: 3 3 3 3 3 3 3 3 3 3 ...
\$ taskident  : int  1 1 1 1 1 1 1 1 1 1 ...
\$ gramm      : Factor w/ 3 levels "G","ps","U": 3 1 3 1 1 1 3 3 1 3 ...
``````

From what I can see in the results and the marginal/conditional_effects plots, the model seems to do what I want and gives me quite expected predictions for each group in each task in the sense that I can filter out the relevant predictions for the picsel task at gramm level â€śpsâ€ť, and the predictions for written/spoken tasks at both gramm levels â€śGâ€ť and â€śUâ€ť.

This is a plot showing the conditional effects:

In a traditional logistic regression model, I would probably have to specify two separate interaction terms for the population level effects that I am interested in

where the four-way interaction term would include an indicator variable (taskident) that is 0
for picsel and 1 for written/spoken, so that the interaction term including gramm is 0 for picsel (since gramm does not apply to that task).

It seems that I can do without with brms. Is that really the case, should I use a different formula, or are there generally better ways to model the response given the structure of my predictors?

I am grateful for any suggestions or pointers to references on this issue!

• Operating System: ubuntu 18.04
• brms Version: 2.9.0
1 Like

I donâ€™t know that Iâ€™m following all the details for your four-way interaction. It might help if you spelled out your intended equation using LaTeX syntax. But generally speaking, your model `formula` and other arguments within `brm()` looked reasonable given your description.

Maybe I wasnâ€™t clear enough about my main question, Iâ€™ll try to reformulate:
I want to define an interaction for only a sub-portion of the data. In the end, I did that with an additional dummy variable in the interaction being 0 for the part of the data for which the interaction is not defined, and 1 for the part it is.
The model predictions for all defined interactions make sense and I generally see what I expected. The only thing I donâ€™t understand is that I get predictions about interactions between levels for which this interaction is not defined, which I wanted to exclude with my dummy variable being 0 for those observations. Should I worry about predicting interactions between levels that I do not â€śneedâ€ť because they may affect the modelâ€™s prediction of the others, or can I ignore them because this may simply be how a Bayesian model works?