Categorical family: group-levels not present in each response category - is this ok?

Hi,

I am working on a categorical regression with a single predictor and a single group-level effect. The group-level effects however are not present in each response category (I measured multiple individuals per group). But the model still estimates a value for each grouping level within the response category.
Here is a simple example:

X = rnorm(18, 0, 1)
Y = rep(letters[1:3], each = 6)
G = rep(1:6, each = 3)
df <- dplyr::bind_cols(X = X, Y = Y, group = G)

        X Y     group
    <dbl> <chr> <int>
 1 -0.410 a         1
 2  0.555 a         1
 3 -1.71  a         1
 4 -0.619 a         2
 5 -1.55  a         2
 6 -0.462 a         2
 7  0.482 b         3
 8 -0.972 b         3
 9  1.10  b         3
10 -0.620 b         4

In this dataset group level 1 is only found in category “a”.

mod1 <- brm( Y ~ X + (1|group), 
family = categorical(link = "logit", refcat = "a"), 
 data = df)

> posterior_summary(mod1) %>% round(digits = 3)
                          Estimate Est.Error    Q2.5   Q97.5
b_mub_Intercept              0.452     3.260  -5.760   7.223
b_muc_Intercept              0.191     2.978  -6.142   6.229
b_mub_X                      2.464     5.463  -4.154  15.879
b_muc_X                      1.386     4.111  -6.159  11.838
sd_group__mub_Intercept     10.172    10.299   2.018  39.487
sd_group__muc_Intercept     11.310    11.976   2.158  41.000
r_group__mub[1,Intercept]  -10.265    14.011 -43.116   0.529
r_group__mub[2,Intercept]   -8.547    12.212 -37.523   1.356
r_group__mub[3,Intercept]   11.371    14.573   0.220  44.897
r_group__mub[4,Intercept]   10.562    12.806   0.272  40.739
r_group__mub[5,Intercept]   -6.096    13.921 -40.179   9.429
r_group__mub[6,Intercept]   -5.223    14.422 -40.256  10.334
r_group__muc[1,Intercept]  -11.016    14.123 -49.811   0.497
r_group__muc[2,Intercept]   -9.569    13.562 -44.353   1.344
r_group__muc[3,Intercept]   -6.402    13.830 -42.161   8.826
r_group__muc[4,Intercept]   -6.379    14.707 -43.210   9.205
r_group__muc[5,Intercept]   12.545    14.153   0.392  49.174
r_group__muc[6,Intercept]   12.435    15.262   0.111  46.443

This model estimates a value for group level 1 in categories “b” and “c”, even though that combination does not exist in the data. This seems strange to me. It seems like the model cannot effectively estimate r_group_mub[1, Intercept] because there are no group 1’s in category b! And in practice, the sd of these estimates are really high.

I tried a variety of methods to have the model only estimate the group level effects for the specific categories in which they are found, such as nested and partially nested structures, but nothing worked (which was probably intentional).

Is this model still valid as it is? Is there a different way to write the formula, or a different way to remove certain categorical combinations from the model? Or should I take a different approach to this problem all together?

Thanks for the advice!

1 Like

Hi,
I am not sure how much are the simulated data representative of your actual data, but categorical models in general expect there to be some variability in the response given predictors. (This is similar to your other inquiry at Multilevel ordinal model - large values in threshold and group level sd estimates?) In the data you’ve shown the group seems to completely determine Y. I think in frequentist context, one would speak about “complete separation” (and freqeuntist models will fail as the maximum likelihood estimate for the group effect will be infinite). Bayesian models do not fail so badly in face of complete separation, because the prior let’s us avoid the infinity in the estimates, but still, the data + prior are unable to inform the coefficients beyond putting a lower/upper bound. Also prior has a big influence on the final values of the coefficients in this case.

Also note that in a categorical model, the coefficients are always relative to the reference category, so you cannot avoid estimating a value for all categories to determine a distribution for the outcome, so I don’t think removing the coefficients would make much sense - if you really wanted to do it, I think you’d need to go to pure Stan.

Best of luck with your model!

2 Likes

Thanks for the reply @martinmodrak! I am also tagging this other post Multilevel ordinal model - large values in threshold and group level sd estimates? here because these two separate questions stem from the same problem with “complete separation” in the group level predictors.

I’ll explain the specific problem in a bit more detail and ask the community for ways of dealing with it, recognizing that there may not be a good solution.

The df dataset above has X as a continuous predictor, and Y as a categorical response. Then I have group. What group represents in reality are different species. So the real data look like this

   measurement category species
         <dbl> <chr>      <int>
 1     -1.88   a              1
 2     -1.20   a              1
 3      0.608  a              1
 4     -1.21   a              2
 5     -0.0216 a              2
 6      0.824  a              2
 7      1.14   b              3
 8      0.232  b              3
 9      1.30   b              3
10      1.16   b              4

Where each species belongs only in a single response category, but I have multiple measurements per species ; 96 total measurements from 24 species. Unlike this fake data set though I have uneven sampling of each species (mean = 3.5, range = 1-5).

My goal is to use the intraspecific error in the model rather than use the mean of each species as a predictor, but of course each species only belongs to one response category, so I can’t use a regular category ~ measurement + (1|species) structure. Here are a few possible solutions:

  1. Ignore the species grouping variable, and run the model with all data. This of course means that I have uneven sampling of each species, therefore uneven contribution from different species which could bias results.

  2. Estimate a standard error for each species, and use a measurement error model such as category ~ me(measurement, se(measurement)). The problem with this is that I reduce my dataset from n=96 to n=24.

  3. Some form of non-linear model perhaps, where I jointly estimate a distribution of mean values for each species and use these values as predictors to estimate the focal model category ~ measurement. This is similar conceptually to what I mention above with

a different way to remove certain categorical combinations from the model?

but different because there is no direct group effect. However I am not really experienced with the non-linear framework and am not sure if that is possible or advised.

In this post I was discussing the categorical response family, though I believe the same principles apply to any of the ordinal families as well, or the bernoulli family for that matter.

Does anyone have suggestions on a good way to proceed with this data structure. I feel like this can’t be a unique problem. Thank you in advance.

1 Like

I think that the problem as you’ve posed it is potentially not really well formed: if species already completely determines category we can’t really do better at predicting category than by using species, so adding more predictors is meaningless.

A potential way to reframe this would be to use measurement to predict species. Since species gives you category directly, you can then use the fitted model to also make predictions and inferences about category (which would be more precise than the inferences you make about species).

Another model that could make sense would be to have category ~ measurement + (1 | species), but that has a somewhat different interpretation.

Additionally I don’t think there is anything particularly wrong with the me approach, while you reduce the number of rows in the dataset, you should not be losing a lot of information as each row now contains more information.

Best of luck with your model.

1 Like

I think that the problem as you’ve posed it is potentially not really well formed: if species already completely determines category we can’t really do better at predicting category than by using species , so adding more predictors is meaningless.

That’s a good point! I didn’t describe the end goal. The question is whether measurement predicts category. The multiple measurements are just an artifact of the sampling. The ultimate goal is to have a predictive model where we predict category from measurement for species we know nothing about, in this case extinct species that died off millions of years ago.

Additionally I don’t think there is anything particularly wrong with the me approach

Ok, I like getting a second opinion on that approach :) I think this could also work well in the predictive framework I mentioned above.

Another model that could make sense would be to have category ~ measurement + (1 | species) , but that has a somewhat different interpretation.

So that was the original model — category ~ measurement + (1 | species), family = categorical that led to the problems in the beginning, where species only belongs to one category, hence the blown up error estimates for species not in the specific category. Though as a side note, using category ~ measurement + (0 + measurement | species), family = categorical actually gave way more realistic error estimates, even for the “completely separated” categories.

Thanks @martinmodrak !

1 Like

Oh, you’re right, my attention slipped a bit, sorry for the confusion. The category ~ measurement + (0 + measurement | species) is IMHO quite sensible!

1 Like