Hi everyone, I’m running a super basic brms model that looks like this (leaving out technical specifications like chains, iterations etc.):
brm(categorical outcome variable ~ binomial predictor + (1|participants), data, family = categorical())
The binomial predictor basically splits the data into two different groups based on their condition (text type in my case). There is more data for one text type than the other. I have now been asked a few times whether the model normalizes those amounts to account for the difference. I’m sure it does since not having the same amount of data for all conditions is the most normal thing in the world (? at least in linguistics I feel like it is), but I cannot find any source that actually says so.
I’m looking for:
- sources telling me whether this normalization happens
- tips on terminology so I can do more research myself (What is this size difference called? It’s not different sample sizes nor am I dealing with the sort of groups that would be specified as such in the model)
Thank you!
- Operating System: macOS
- brms Version: 2.22.0
You’re right that it’s normal to have different amounts of data in different regions of covariate space, and as long as you aren’t trying to make unprincipled extrapolations into data poor regions of covariate space, there’s nothing particularly interesting or tricky about having imbalances in covariates. There’s no “normalization” necessary to get correct inference, and I think that either these questions are misinformed or perhaps you are misunderstanding them.
The actual real problem, particularly in nonparametric approaches to classification, is how to handle serious imbalances in the categorical response data. As a simple example, if you are training a binary classifier using a loss function based on the proportion classified correctly, and the sample is 90% zeros and 10% ones, then you can find that the best performing classifier (according to your loss function) is simply to classify everything as zero, even when the data contain a strong signal that the covariate does matter (maybe at one value of the covariate there’s a 60% chance of a zero and at another value there’s a 99% chance of a zero, but the classifier doesn’t care because what it sees is that no matter what it should predict zero).
Parametric modeling approaches such as the glms implemented in brms avoid this issue by using better loss functions, in this case based on a parametric likelihood function, to estimate the parameters in the model (I say “better” with my tongue partway in my cheek, but I doubt many on this forum would disagree). So I wouldn’t worry here, as long as there are enough data to constrain the responses in all the categories and yield estimates that aren’t so uncertain as to be useless and also aren’t highly sensitive to the prior.
2 Likes
I second @jsocolar. Based on what you’ve described so far, it sounds like you’re fine. There’s nothing in your model set up that requires equal sample sizes by your grouping predictor variable. No normalization is necessary. The onus is on the people who are asking you about normalization to provide justifications for such a procedure (unless it’s your boss or advisor who is asking…).
2 Likes
@jsocolar @Solomon
Thank you both for your replies! I’m very happy to hear that there is nothing to worry about.
Regarding disbalances in the categorical response: yes, there are definitely some disbalances there, but I have already made sure to only include categories over a certain threshold in the statistical analysis to minimize weak likelihood issues. The remaining categories are still not completely balanced of course, but there aren’t any categories that completely eclipse one of the others in terms of number of observations now.
1 Like