Multiple membership models when replicates belong to different numbers of groups

I would like to fit a model wherein replicates belong to different groups (a multiple membership model), but importantly the replicates can differ in the number of groups to which they belong. What is the appropriate syntax to account for multiple membership when replicates belong to different numbers of groups? I have tried several alternatives, but the differing answers make me uncertain about which is correct.

From a previously answered question, it’s clear that filling in the extra group columns with NAs will cause the model to drop these replicates entirely. We can reproduce that undesired behaviour with an example dataset:

set.seed(234)

dat1 <- data.frame(
  y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100),
  g1 = sample(letters[1:10], 100, TRUE),
  g2 = sample(letters[1:10], 100, TRUE),
  g3 = sample(c(letters[1:10], NA_character_), 100, TRUE),
  stringsAsFactors = FALSE
)
fit1 <- brm(y ~ x1 + (1|mm(g1, g2, g3)), data = dat1)
sum(is.na(dat1$g3))  # 8 rows have missing g3
nrow(fit1$data)  # model throws away rows with NA

The answer proposed using filling of the NAs and weighting schemes to fit the model appropriately and with the complete dataset. I can think of three alternative strategies to do this:

(1) Fill extra groups using only the group levels to which the replicate is a member, balancing to achieve the appropriate weighting

tl;dr Clear that this works easily in some scenarios, but grows to be very tedious when replicates can belong to a wide range of group numbers.

This works only in limited cases. For example, when there are four group columns (g1, g2, g3, g4) and a replicate only belongs to two groups, one could code it as g1 = "A", g2 = "B", g3 = "A", g4 = "B". Each group would be given equal weights. This approach fails for the example dataset generated above, unless one were to expand the number of grouping columns to a values whose factors include all of the possible numbers of groups a replicate could belong to. Here, that would require at least 6 group columns to address the cases that belong to 2 vs. 3 groups (6 is the lowest number with factors 2 & 3).

(2) Introduce a new “dummy” group, with a corresponding weight of zero

tl;dr This behaves unexpectedly by estimating an intercept for the “dummy” group despite always having a weight of zero.

dat2 <- dat1
dat2$w1 <- ifelse(is.na(dat2$g3), 0.5, 1/3)
dat2$w2 <- ifelse(is.na(dat2$g3), 0.5, 1/3)
dat2$w3 <- ifelse(is.na(dat2$g3), 0, 1/3)
dat2$g3[is.na(dat2$g3)] <- "dummy"  # recode NAs to "dummy" in g3

fit2 <- update(fit1, . ~ x1 + (1|mm(g1, g2, g3,
                                    weights = cbind(w1, w2, w3))),
               newdata = dat2)
nrow(fit2$data) == 100  # keeps all data as expected
ranef(fit2)  # fits intercept for "dummy" (unexpected & undesired behaviour)

The model should not estimate an intercept for dummy because none of the replicates belong to that group…but it does. I suspect the estimate will always be nearly zero, but clearly is undesired behaviour.

(3) Haphazardly fill missing group with a pre-existing level, with a corresponding weight of zero

tl;dr The choice of group(s) used to replace NAs may be affecting the model fit, despite always having a weight of zero.

We ought to be able to replace the dummy group introduced above with any pre-existing group value provided that the corresponding weight remains 0. For convenience, let’s just use the value from g1:

dat3a <- dat2
dat3a$g3[dat3a$g3 == "dummy"] <- dat3a$g1[dat3a$g3 == "dummy"]
fit3a <- update(fit2, newdata = dat3a, seed = 984)
ranef(fit3a)  # no intercept for "dummy" (expected & desired behaviour)

It works! Provided that our choice of dummy replacement doesn’t matter. But it appears that the choice can change the fit. We can see this by replacing dummy with the value from g2 and comparing the fits:

dat3b <- dat2
dat3b$g3[dat3b$g3 == "dummy"] <- dat3b$g2[dat3b$g3 == "dummy"]
fit3b <- update(fit2, newdata = dat3b, seed = 984)
identical(ranef(fit3a), ranef(fit3b))  # FALSE

It’s possible that the seed argument isn’t behaving as I expect, and that the fits aren’t truly different. I don’t see the clear way to test this.

I would very much appreciate help/guidance as to how to appropriately handle this multiple membership problem.

  • Operating System: OSX 10.14.6
  • brms Version: 2.10.0

(1) Is the correct solution I think. You can always achieve the desired weighting using the weights argument.

Thanks for your reply, @paul.buerkner. Do you have a sense for why (3) gives different fits depending on which group is used to replace the NAs? Taken together with the behaviour of (2), it seems like the weights argument never treats a zero weight as I had expected.

Let me check, what zero weights actually imply. Perhaps there is a bug in that regard.

When I run

dat <- data.frame(
  y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100),
  g1 = sample(1:10, 100, TRUE), g2 = sample(1:10, 100, TRUE),
  w1 = rep(1, 100), w2 = c(1, rep(0, 99))
)

make_standata(y ~ x1 + (1|mm(g1, g2, weights = cbind(w1, w2))), data = dat)

I see that the weights are correctly zero.

Great, thanks! This resolves my question for now. I’ll re-open if I learn anything else from continued testing with approaches (1) and (3).

Revisiting this, @wpetry, I’m wondering if you learned more about option 3, which would seemingly be useful in cases where an observed outcome should have zero weights across the board. For example, imagine a dataset in which hospital patients’ outcomes are modeled in part as a function of the nurses who provided care for them (a multiple membership structure), but some patients had no nurses assigned to them for one reason or another.

I was able to use @paul.buerkner’s solution for my problem, and haven’t encountered a problem like the one you describe. I wonder whether zeroing out the multi-membership term would do what you describe in the patient care scenario. Wouldn’t this imply that patients with no nurse would have an outcome that was the same as an average nurse or group of nurses (i.e., equal to the population-level intercept)? I may be way off base, but my off-the-cuff inclination would be to encode “no nurse” as one of the multimembership group ids. This would allow the outcomes for these patients to deviate strongly from the population-level mean.

Thanks, Will. For background, I’m borrowing concepts from this paper by Tranmer, Pallotti, and Lomi: The embeddedness of organizational performance: Multiple Membership Multiple Classification Models for the analysis of multilevel networks - ScienceDirect

At one point, the authors write: "If the network did include isolates, the rows of W corresponding to those isolates would have zero weights, w_{i,j}, for all columns of W.

Implementing this is brms has been a little tricky because the brm() function throws an error message when all of the weights are zero (which is probably sensible overall). As a workaround, I’ve been building on your Option 3 while assigning an infinitesimal decimal amount to the pre-existing level that I’m including as a dummy of sorts for the first multiple membership level (and then zero weights for all other pre-existing levels assigned to the case). That seems inelegant but reasonable in light of the constraints. I’d be grateful for any feedback.