Hi,

I’m trying to wrap my head around modeling of missing data, in particular of categorical group-level data.

I’ve looked at Handle Missing Values with brms but could not find my use case described, so I’m wondering if it is supported by brms/stan at all.

Problem description:

- I have a fairly large data set (~31k observations) with inherent grouping structure. I use a hierarchical model to model the behaviour of the groups (1 + A + B | g1/g2), plus the normal population-level effects.
- One of the groups (g1) are sure to contain some missing (unknown) values, while the g2 group is known to be complete (as it is the base of the data collection)
- I can be fairly certain (for organizational reasons) that the missing g1 values (which is a factor with 11 levels) belong to one of the existing levels. So I shouldn’t be looking for additional groups (which is what I am using right now, with an “Unknown” level in g1 - this is just to get the machinery going while I make a better model :)
- Conversely, the missing g1 values all come from another variable (a1, not part of the model), which itself has 75 levels - but 10 of these levels cannot identify which g1 value to use. One can think of the problem as “place all rows that share the same a1 value into the most likely existing g1 level”. But a1 is not part of the general model, is just used for choosing the correct (or rather, the most likely) g1 level.

Can brms (or stan) help me sort this out? How would I go about solving this problem in a reasonable way?

All in all, we are only talking about ~2% of the observations (624 observations out of 31007). And of those, two of the a1 levels represent over 2/3 of the missing values.

So, if this problem is unsolvable - what is your recommendation?

- Excluding the missing values?
- Modeling them as a separate (“Unknown”) group?

Best wishes,

/Anders