I would like to fit a model wherein replicates belong to different groups (a multiple membership model), but importantly the replicates can differ in the number of groups to which they belong. What is the appropriate syntax to account for multiple membership when replicates belong to different numbers of groups? I have tried several alternatives, but the differing answers make me uncertain about which is correct.
From a previously answered question, it’s clear that filling in the extra group columns with NA
s will cause the model to drop these replicates entirely. We can reproduce that undesired behaviour with an example dataset:
set.seed(234)
dat1 <- data.frame(
y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100),
g1 = sample(letters[1:10], 100, TRUE),
g2 = sample(letters[1:10], 100, TRUE),
g3 = sample(c(letters[1:10], NA_character_), 100, TRUE),
stringsAsFactors = FALSE
)
fit1 <- brm(y ~ x1 + (1|mm(g1, g2, g3)), data = dat1)
sum(is.na(dat1$g3)) # 8 rows have missing g3
nrow(fit1$data) # model throws away rows with NA
The answer proposed using filling of the NA
s and weighting schemes to fit the model appropriately and with the complete dataset. I can think of three alternative strategies to do this:
(1) Fill extra groups using only the group levels to which the replicate is a member, balancing to achieve the appropriate weighting
tl;dr Clear that this works easily in some scenarios, but grows to be very tedious when replicates can belong to a wide range of group numbers.
This works only in limited cases. For example, when there are four group columns (g1
, g2
, g3
, g4
) and a replicate only belongs to two groups, one could code it as g1 = "A"
, g2 = "B"
, g3 = "A"
, g4 = "B"
. Each group would be given equal weights. This approach fails for the example dataset generated above, unless one were to expand the number of grouping columns to a values whose factors include all of the possible numbers of groups a replicate could belong to. Here, that would require at least 6 group columns to address the cases that belong to 2 vs. 3 groups (6 is the lowest number with factors 2 & 3).
(2) Introduce a new “dummy” group, with a corresponding weight of zero
tl;dr This behaves unexpectedly by estimating an intercept for the “dummy” group despite always having a weight of zero.
dat2 <- dat1
dat2$w1 <- ifelse(is.na(dat2$g3), 0.5, 1/3)
dat2$w2 <- ifelse(is.na(dat2$g3), 0.5, 1/3)
dat2$w3 <- ifelse(is.na(dat2$g3), 0, 1/3)
dat2$g3[is.na(dat2$g3)] <- "dummy" # recode NAs to "dummy" in g3
fit2 <- update(fit1, . ~ x1 + (1|mm(g1, g2, g3,
weights = cbind(w1, w2, w3))),
newdata = dat2)
nrow(fit2$data) == 100 # keeps all data as expected
ranef(fit2) # fits intercept for "dummy" (unexpected & undesired behaviour)
The model should not estimate an intercept for dummy
because none of the replicates belong to that group…but it does. I suspect the estimate will always be nearly zero, but clearly is undesired behaviour.
(3) Haphazardly fill missing group with a pre-existing level, with a corresponding weight of zero
tl;dr The choice of group(s) used to replace NA
s may be affecting the model fit, despite always having a weight of zero.
We ought to be able to replace the dummy
group introduced above with any pre-existing group value provided that the corresponding weight remains 0
. For convenience, let’s just use the value from g1
:
dat3a <- dat2
dat3a$g3[dat3a$g3 == "dummy"] <- dat3a$g1[dat3a$g3 == "dummy"]
fit3a <- update(fit2, newdata = dat3a, seed = 984)
ranef(fit3a) # no intercept for "dummy" (expected & desired behaviour)
It works! Provided that our choice of dummy
replacement doesn’t matter. But it appears that the choice can change the fit. We can see this by replacing dummy
with the value from g2
and comparing the fits:
dat3b <- dat2
dat3b$g3[dat3b$g3 == "dummy"] <- dat3b$g2[dat3b$g3 == "dummy"]
fit3b <- update(fit2, newdata = dat3b, seed = 984)
identical(ranef(fit3a), ranef(fit3b)) # FALSE
It’s possible that the seed
argument isn’t behaving as I expect, and that the fits aren’t truly different. I don’t see the clear way to test this.
I would very much appreciate help/guidance as to how to appropriately handle this multiple membership problem.
- Operating System: OSX 10.14.6
- brms Version: 2.10.0