Weak priors lead to high Rhat values

I am struggling with a binomial model

I am fitting the model with brms as follows

mb <- brm(
    bf(y | trials(total) ~ 1 
       + category + category : prop_category
       + (1 | item))    
  , data = data
  , family = binomial
  , prior = c(
        prior(normal(0, 0.1), class = sd)
      , prior(normal(0, 2), class = b)
      , prior(normal(0, 0.01), class = Intercept)
    )
  , cores = 4
  , chains = 4
  , warmup = 1000
  , iter = 4000
  , control = list(adapt_delta = 0.8)
)

The data look as follows

  y     total item   category  prop_category
  <dbl> <dbl> <chr>  <chr>     <dbl>
1    29    55 item_1 c1       0.0157  
2     2    47 item_1 c2       0.0134  
3     0    26 item_1 c3       0.00742 
4     0     3 item_1 c4       0.000857
5     0  3371 item_1 c5       0.963   
6   519 13097 item_2 c1       0.978   

Where for each item, have at least 2 of 5 possible categories, and the proportion with which that category appears with that item (in the trials, and not the y outcome).

The prior prior(normal(0, 0.1), class = sd) is there for theoretical reasons. However, the strong prior on the intercept is for purely fitting reasons. If I set a weaker prior, the chains do not mix well for the intercept, and I get very low ESS and high Rhat (again, only for the intercept).

My guess is that this is caused by some collinearity in the predictors. However, I cannot really fit a smaller model, it wouldn’t make much sense theoretically.

Here the pairs plot:

Weirdly, the estimates of the model with a weak or strong prior on the Intercept are essentially identical (except for a bit of variation on the Intercept due to the chain mixing issue).

I also tried using QR decomposition and horseshoe priors as Burker suggested in another thread, but this didn’t help.

Is there any way around this issue?

Alternatively, given that the estimates seem to be stable independently of the intercept prior, is it justifiable to use a very strong prior on the grounds that otherwise the chains fail to converge?

Can you tell us more about the data? To me, the prop_category seems like a bit of an odd predictor - if I understand it correctly, there’s a unique value for each item : category combination, but then you also add an interaction of it with category?

Looking at the pairs plot, it looks like you have colinearity between one of the categories and category-by-prop interactions. So the larger the model estimates the main effect, the larger it estimates the interaction as well. Also, the slope estimates for the interactions are very big, much bigger than for main effects. If they’re on the log odds scale, then those are some unbelievably strong effects. Lastly, the intercept is completely outside the scale of your priors, so that to me suggests that the prior isn’t correctly specified.

Seems like a really narrow range for residual variance, given how big some of the slope estimates are.

Have you tried modeling the trial-level data?

Hi, thank you for your answer.

Can you tell us more about the data?

Let’s say I have a text corpus where nouns can appear in either red or blue. In this corpus, nouns also always appear with one of five prepositions (the categories in the model). So the data would look something like this in long format:

Noun_1 preposition_1 blue
Noun_1 preposition_1 blue
Noun_1 preposition_2 blue
Noun_1 preposition_2 red
...

We believe that the rate at which a noun appears with blue, is mostly governed by the preference of the noun itself, but also to a smaller degree by the color preference of the preposition. However, at the same time, nouns have a preference for preposition, so maybe Noun_1 really likes to appear with prep_1 but Noun_2 likes to appear with prep_2. This is what the interaction term is supposed to be capturing, but maybe it’s completely wrong?

We also have two different datasets. We expect to see much less variation in the effect of preposition for Dataset 1, while we expect that for Dataset 2 the effect of Item should completely overpower everything else.

So the larger the model estimates the main effect, the larger it estimates the interaction as well. Also, the slope estimates for the interactions are very big, much bigger than for main effects. If they’re on the log odds scale, then those are some unbelievably strong effects.

My guess was that this was caused because of the scale of prop_category, but maybe I’m wrong?

Lastly, the intercept is completely outside the scale of your priors, so that to me suggests that the prior isn’t correctly specified.

The weird thing is, the estimate of the intercept is basically the same if I set the prior as Normal(0, 10), for example.

Have you tried modeling the trial-level data?

You mean, instead of aggregating the data as observatiosn | trials, do logistic regression on each observation?

1 Like

Oh, I see! I may have to think this through a bit more yet, but I think you may not need to include the proportion as a predictor, or any interaction at all. That is, if you’re interested in the effect of noun in predicting color, but at the same time want to control the effect of the preposition, then you can just include both as main effects:

color ~ noun + preposition

I might be wrong, but I think the frequency in which the nouns and prepositions appear together shouldn’t matter in this case. Think of it like this - it doesn’t matter for the probability of a coin landing heads whether you throw it 10, 100 or 1,000 times. That is, unless you have reason to believe that the frequency in which the combination of noun x preposition appears influences the probability that the color ends up blue. You might run into trouble if you have very low counts of certain combinations in that you won’t be able to accurately estimate the predictor slopes & your credible intervals will be very wide, but that should still be less of an issue if you only include main effects.

Also, as shown above, I think preposition may be better modeled as fixed effect rather than random effect, given that: 1) it sounds like the effects of individual prepositions may be very different & so it feels weird to think of some superpopulation from which the prepositions are randomly drawn, 2) you only have 5 levels, 3) the effect of individual prepositions may be interesting(?).

Right, that’s a good point, I didn’t think of that. That actually may be where some of your sampling problems come from - that the proportion parameter is constrained between 0 and 1. Under the linear assumption, that would mean that the difference between 90 and 99.999999% probability is the same as the difference between 50 and 60%. So if you were to model the proportion, it might help to transform the probability into to log odds.

The estimate may be the same because the data has overwhelmed the prior. As far as I know, however, outside of very specific use cases, an extreme overwhelming of the prior like this means that the prior was far too restrictive & didn’t include the regions of the parameter space that are plausible under domain knowledge. If you get problems with the sampler with wider prior, that to me suggests problems with the model specification.

Try the simple model that I outlined above. Set some reasonable, weakly informative priors like normal(0, 1) on all parameters. If the model converges, makes sure to do posterior predictive checks, and possibly cross-validate and use PSIS-LOO to diagnose the model.

Yep, I don’t have much experience modeling the aggregated data, but I find that often it makes it a lot easier to think about a modeling problem on the level of the observational unit (which in this case would be the observation of a single blue/red noun with a specific preposition).

I think this answers my question. Thanks.