Bayes factors change depending on reference level!

I have a mixed-effects regression model with a binary response variable Yes-No. The predictors are all binary as well. It turns out that when I calculate the Bayes factors for each parameter, the results change based on the reference level of the response variable and the predictors. When the reference level is Yes, one of the predictors gets a Bayes Factor of 0.14, when I change it to No the Bayes factor of the same predictor is < 1000.

Screenshot 2021-05-07 at 16.22.01
Screenshot 2021-05-07 at 16.24.56

How is this possible? Is this the way it is supposed to be?

1 Like

I think people will be better able to comment or help if you can show the underlying model (including the priors) and the Stan implementation. It might also be useful to clarify what null model the Bayes Factor for each parameter is being computed against.

1 Like

This is the model:
baymodel_full4= brm(Clitic~NumberDO*WordOrder+AnimacyIO+
cores=4, seed=135, save_pars = save_pars(all=TRUE),
control = list(max_treedepth = 15,adapt_delta=0.99),
prior=set_prior(“cauchy(0,2.5)”, class=“b”)+
set_prior(“cauchy(0,10)”, class=“Intercept”))

I’m using the Bayes factors to study the effect of the predictors not for model selection so I’m not comparing to a null model.

This is the syntax:
full_bf4= bayesfactor_parameters(baymodel_full4)


Where does the bayesfactor_parameters function come from and what does it do?

1 Like

From the bayestestR package. From the authors of the package: This method computes Bayes factors against the null (either a point or an interval), based on prior and posterior samples of a single parameter. This Bayes factor indicates the degree by which the mass of the posterior distribution has shifted further away from or closer to the null value(s) (relative to the prior distribution), thus indicating if the null value has become less or more likely given the observed data.

1 Like

I think I know what’s going on. The bayes factor is testing the effect of the binary variable at the reference level of the other binary predictors so when you change the level of the predictors the directionality of the effect changes and this is what the BF shows.
I will use some contrasts and report what happens since in that case one is testing for a main effect.

1 Like

Right, this is a trap (one of many :-) in using Bayes factors - I think a good background is at Bayes Factors • bayestestR where they suggest using orthonormal coding of categorical predictors precisely for this reason. We’ll be happy to help you compute your Bayes factors, but please be advised that Bayes factors are tricky beasts. Beyond coding of factors, your choice of prior for the b class will make huge changes to the Bayes factors. Additionally, it appears your model has fitting problems as you ramped up adapt_delta and max_treedepth very high (presumably to avoid divergences/treedepth warnings) - this can introduce additional fragility in the computation of Bayes factors.

An extensive background for some problems with Bayes factors is in [2103.08744] Workflow Techniques for the Robust Use of Bayes Factors, my current best thoughts on the general topic of hypothesis testing/model selection at : Hypothesis testing, model selection, model comparison - some thoughts

Best of luck with your model!

Ohh wow, I had no idea about this, I did read their vignettes but I may have overlooked this detail. I’ll read it again. The adapt_delta and max_treedepth were like that because the full model had a very complex ranef structure and then when I reduced it I didnt change those parameters. Should I not use those values unless absolutely necessary?

In a model that doesn’t require such high values, the only penalty for using them is in efficiency, but this efficiency penalty can be large.

However, if your model doesn’t require such high values, then it is also reassuring to know that the model doesn’t require these values. Well-behaved posteriors where the computation has the strongest guarantee (which is never an absolute guarantee) of being trustworthy generally don’t require such large treedepths or adapt_delta’s. Regardless of your specified max_treedepth, you can check the treedepths that the model was using post-hoc from Stan’s CSV outputs. But with a high adapt_delta, then you’ll lose some of the sensitivity of divergences as a diagnostic for pathologies in the computation, and there’s no way to recover this sensitivity post-hoc.

I’m struggling with the interpretation of the model coefficients using orthonormal contrasts. I can’t find anything on it online. Any references? Should I present the coefficients or should I use marginal effect plots to report the model in the paper? I was using both posterior intervals and marginal effects plots but I don’t understand the signs of the coefficients (they don’t make sense to me). Thanks

maybe i should be more specific. The signs of the interaction coefficient estimates are negative when I know they should be positive (with treatment coding) is this because both coefficient contrasts are negative (-0.7071) in this case so the estimate of the interaction becomes negative?

Interactions in particular are very tricky to interpret in almost all models and the way you code factors changes how they behave. I think it is generally more useful to not fixate on the coefficients you have in the model but rather on the question you want to get answered. So e.g. in dummy coding an interaction between factors can be interpreted as “difference of differences”: if you factor 1 has levels A:B and factor 2 has levels X:Y and the model is y ~ f1 * f2 then the B:Y coefficient corresponds to

(E(y | f1 = B, f2 = Y) - E(y | f1 = A, f2 = Y)) 
- (E(y | f1 = B, f2 = X) - E(y | f1 = A, f2 = X))

you can construct the same expectations (and their differences) by summing/subtracting the relevant coefficients.

And if your model is more complex, you can always use model predictions (from posterior_linpred / posterior_predict) for inference - you create a dataset representing a hypothetical future experiment, you predict what would happen according to the model and then subtract the predictions to get samples of differences/comparisons of interest. This has the benefit of forcing you to be explicit what type of comparison are you doing (which may not be clear when looking at coefficients) - e.g. how do you include the varying effects (not at all vs. predict for a hypothetical new verb / country vs. use the fitted verbs / countries).

Best of luck!