Brms model formulation for dual hierachy

Hi folks,

I am seeking your advice on this question. I have a dataset of scientific publications and their citation counts, for which I know which working group published them and to which university the group belongs. I want to take this structure into account in a regression model. With the model I hope to be able to estimate to which the degrees the variance in the citation counts can be accounted for by the different levels (authors, working groups, universities). Because of that I want to estimate random effects for all units of these levels although I am not primarily interested in these unit scores as such, only in their variability within levels.

The difficulty in this is that besides this institutional structure there is a hierarchical relationship between authors and papers and I was unable to figure out how to correctly specify both hierachies simultaneously in a brms model. Typically authors participate in several papers and papers are written by several authors. The actual dataset looks something like this fictitious simplified example:

PAPER	AUTHOR_CNT	AUTHOR	WORKING_GROUP	UNIVERSITY	CITATIONS
P_A	3	        AU_1	U_1_WG_1	U_1	        20
P_A	3	        AU_2	U_1_WG_1	U_1	        20
P_A	3	        AU_3	U_2_WG_1	U_2	        20
P_B	5	        AU_1	U_1_WG_1	U_1	        44
P_B	5	        AU_4	U_1_WG_2	U_1	        44
P_B	5	        AU_5	U_2_WG_1	U_2	        44
P_B	5	        AU_3	U_2_WG_1	U_2	        44
P_B	5	        AU_6	U_1_WG_1	U_1	        44
P_C	6	        AU_7	U_2_WG_1	U_2	        5
P_C	6	        AU_8	U_2_WG_2	U_2	        5
P_C	6	        AU_9	U_2_WG_2	U_2	        5
P_C	6	        AU_2	U_1_WG_1	U_1	        5
P_C	6	        AU_3	U_2_WG_1	U_2	        5
P_C	6	        AU_1	U_1_WG_1	U_1	        5
P_D	7	        AU_2	U_1_WG_1	U_1	        11
P_D	7	        AU_4	U_1_WG_2	U_1	        11 
P_D	7	        AU_5	U_2_WG_1	U_2	        11
P_D	7	        AU_3	U_2_WG_1	U_2	        11
P_D	7	        AU_7	U_2_WG_1	U_2	        11
P_D	7	        AU_8	U_2_WG_2	U_2	        11
P_D	7	        AU_9	U_2_WG_2	U_2	        11

Each row is for one author in a paper, so papers have as many rows as they have authors.

The institutional hierachy for this example is like this:

UNIVERSITY	WORKING_GROUP	AUTHOR
U_1	        U_1_WG_1	AU_1
U_1	        U_1_WG_1	AU_2
U_1	        U_1_WG_1	AU_6
U_1	        U_1_WG_2	AU_4
U_2	        U_2_WG_1	AU_3
U_2	        U_2_WG_1	AU_5
U_2	        U_2_WG_1	AU_7
U_2	        U_2_WG_2	AU_8
U_2	        U_2_WG_2	AU_9

That is, there are several universities, each with several working groups, each of which again with several authors. They freely collaborate to write papers.

My first thought was simply to include PAPER as another level below AUTHOR:
CITATIONS | weights(1/AUTHOR_CNT) ~ 1 + (1 | UNIVERSITY/WORKING_GROUP/AUTHOR/PAPER)
But it occurred to me that the information on the dependency of observations due to PAPERs would be lost.

I then came up with this crossed structure:
CITATIONS | weights(1/AUTHOR_CNT) ~ 1 + (1 | UNIVERSITY/WORKING_GROUP/AUTHOR) + (1 | PAPER/AUTHOR)
But I believe this to be incorrect because part of the nested structure is lost, but I’m not sure about this.

I considered using a multiple group membership structure but from what I understand, this would mean to create a group variable for each PAPER and assign membership weights for each author to each of those variables, which seems excessive and difficult to arrange with the curretn structure fo the dataset.

I would very grateful for any advice or pointers to similar examples.

Best,
Paul.

  • Operating System: Windows 10
  • brms Version: 2.8.0

I will think about this further, since I am supposed to be fixing my own model problem now (of course). And if no one smarter responds before me I’ll take a whack at it. I come from the perspective of traditional multilevel/hierarchical/mixed models which are more limited in their capacity so I would have to think further about possibilities offered by brms.

In the interim a question and a preliminary recommendation.

Is there a way you have seen a model like this specified in any related literature?

I feel like I am mis-interpreting your question (the question you are asking your data, that is). So I’m going to rephrase it, two different variations and you can correct me.

You are interested in knowing if the number of citations is best explained by Author, University, Working group, or the paper itself.

OR if the variation in number of citations a paper written by a specific grouping factor is caused by a second grouping factor.

  • papers by Author A have 2, 3, 4
  • papers by Author B have 3, 12, 34
  • papers by Author C have 1, 5, 30
    They are all at the same university but different working groups (and maybe B,C are in a group together.)

Or I’m totally wrong!

A few off-the cuff comments and recommended steps if no one smarter comes along soon.

  • I suspect number of years since publication will have a non trivial effect on number of citations.
  • The larger the number of papers an author (or group or university) has written, the greater the variation in the number of citations will be (at least that is my general feeling)
  • You may be interested in some sort of interaction vs just the variation between levels.

Simulate some data for number of citations varied by some within-groups structure then see if the model you specify captures the variation you built (even if it isn’t the variation your data actually have that will help see if the way you have described the model is on the right track. To that end, build it up in parts. Eat your idiomatic elephant bite by bite (or climb your mountain step by step, etc.etc.etc.)

If I recall correctly, because your levels are coded explicitly, (1 | UNIVERSITY/WORKING_GROUP) = (1 | UNIVERSITY) + (1 | WORKING_GROUP).

You can also then nest each within paper or author.
(1 | PAPER : UNIVERSITY) + (1 | PAPER : WORKING_GROUP)

Or some structure like ~ AUTHOR + (1 | PAPER : UNIVERSITY) + (1 | PAPER : WORKING_GROUP) .

Ack that was more than I meant to write!

I’ll hope someone has good advice to give beyond my musings and check back later to see.

This is not any easy question, so I am not sure if my suggestions make much sense.

  1. As @MegBallard noted already, since, for instance, working groups are coded uniquely across university (i.e., U_1_WG_1 appears only in university U_1) the nested structure
    (1 | university/working_group) is equivalent to (1 | university) + (1 | working_group) and I personally prefer the latter way of writing this down. You may even write it short as (1 | university + working_group) but this is just some syntactical sugar I added in brms.

  2. What if we just used (1 | university) + (1 | working_group) + (1 | paper) + (1 | authors)? That way all the levels are taken into account and we can directly compare how much variance is explained by the different levels.