Multiple membership models when replicates belong to different numbers of groups

I would like to fit a model wherein replicates belong to different groups (a multiple membership model), but importantly the replicates can differ in the number of groups to which they belong. What is the appropriate syntax to account for multiple membership when replicates belong to different numbers of groups? I have tried several alternatives, but the differing answers make me uncertain about which is correct.

From a previously answered question, it’s clear that filling in the extra group columns with NAs will cause the model to drop these replicates entirely. We can reproduce that undesired behaviour with an example dataset:

set.seed(234)

dat1 <- data.frame(
  y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100),
  g1 = sample(letters[1:10], 100, TRUE),
  g2 = sample(letters[1:10], 100, TRUE),
  g3 = sample(c(letters[1:10], NA_character_), 100, TRUE),
  stringsAsFactors = FALSE
)
fit1 <- brm(y ~ x1 + (1|mm(g1, g2, g3)), data = dat1)
sum(is.na(dat1$g3))  # 8 rows have missing g3
nrow(fit1$data)  # model throws away rows with NA

The answer proposed using filling of the NAs and weighting schemes to fit the model appropriately and with the complete dataset. I can think of three alternative strategies to do this:

(1) Fill extra groups using only the group levels to which the replicate is a member, balancing to achieve the appropriate weighting

tl;dr Clear that this works easily in some scenarios, but grows to be very tedious when replicates can belong to a wide range of group numbers.

This works only in limited cases. For example, when there are four group columns (g1, g2, g3, g4) and a replicate only belongs to two groups, one could code it as g1 = "A", g2 = "B", g3 = "A", g4 = "B". Each group would be given equal weights. This approach fails for the example dataset generated above, unless one were to expand the number of grouping columns to a values whose factors include all of the possible numbers of groups a replicate could belong to. Here, that would require at least 6 group columns to address the cases that belong to 2 vs. 3 groups (6 is the lowest number with factors 2 & 3).

(2) Introduce a new “dummy” group, with a corresponding weight of zero

tl;dr This behaves unexpectedly by estimating an intercept for the “dummy” group despite always having a weight of zero.

dat2 <- dat1
dat2$w1 <- ifelse(is.na(dat2$g3), 0.5, 1/3)
dat2$w2 <- ifelse(is.na(dat2$g3), 0.5, 1/3)
dat2$w3 <- ifelse(is.na(dat2$g3), 0, 1/3)
dat2$g3[is.na(dat2$g3)] <- "dummy"  # recode NAs to "dummy" in g3

fit2 <- update(fit1, . ~ x1 + (1|mm(g1, g2, g3,
                                    weights = cbind(w1, w2, w3))),
               newdata = dat2)
nrow(fit2$data) == 100  # keeps all data as expected
ranef(fit2)  # fits intercept for "dummy" (unexpected & undesired behaviour)

The model should not estimate an intercept for dummy because none of the replicates belong to that group…but it does. I suspect the estimate will always be nearly zero, but clearly is undesired behaviour.

(3) Haphazardly fill missing group with a pre-existing level, with a corresponding weight of zero

tl;dr The choice of group(s) used to replace NAs may be affecting the model fit, despite always having a weight of zero.

We ought to be able to replace the dummy group introduced above with any pre-existing group value provided that the corresponding weight remains 0. For convenience, let’s just use the value from g1:

dat3a <- dat2
dat3a$g3[dat3a$g3 == "dummy"] <- dat3a$g1[dat3a$g3 == "dummy"]
fit3a <- update(fit2, newdata = dat3a, seed = 984)
ranef(fit3a)  # no intercept for "dummy" (expected & desired behaviour)

It works! Provided that our choice of dummy replacement doesn’t matter. But it appears that the choice can change the fit. We can see this by replacing dummy with the value from g2 and comparing the fits:

dat3b <- dat2
dat3b$g3[dat3b$g3 == "dummy"] <- dat3b$g2[dat3b$g3 == "dummy"]
fit3b <- update(fit2, newdata = dat3b, seed = 984)
identical(ranef(fit3a), ranef(fit3b))  # FALSE

It’s possible that the seed argument isn’t behaving as I expect, and that the fits aren’t truly different. I don’t see the clear way to test this.

I would very much appreciate help/guidance as to how to appropriately handle this multiple membership problem.

  • Operating System: OSX 10.14.6
  • brms Version: 2.10.0

(1) Is the correct solution I think. You can always achieve the desired weighting using the weights argument.

Thanks for your reply, @paul.buerkner. Do you have a sense for why (3) gives different fits depending on which group is used to replace the NAs? Taken together with the behaviour of (2), it seems like the weights argument never treats a zero weight as I had expected.

Let me check, what zero weights actually imply. Perhaps there is a bug in that regard.

When I run

dat <- data.frame(
  y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100),
  g1 = sample(1:10, 100, TRUE), g2 = sample(1:10, 100, TRUE),
  w1 = rep(1, 100), w2 = c(1, rep(0, 99))
)

make_standata(y ~ x1 + (1|mm(g1, g2, weights = cbind(w1, w2))), data = dat)

I see that the weights are correctly zero.

Great, thanks! This resolves my question for now. I’ll re-open if I learn anything else from continued testing with approaches (1) and (3).