Proper interpretation of Bayes factor (from rstanarm + bridgesampling)

When I was learning Stan and Bayesian methods, the Bayes factor was not covered much, but I’m finding it might be useful to present findings to clients. I wanted to make sure that I am interpreting some output correctly.

Consider the following R code:

library(rstanarm); library(bridgesampling)
inv_logit <- function(x) exp(x) / (1 + exp(x))
set.seed(1839)
n <- 500
x <- rbinom(n, 1, .5)
y <- rbinom(n, 1, inv_logit(.3 + .2 * x))
z <- rbinom(n, 1, .3)
dat <- data.frame(x, y, z)

This creates the data. Now what I can do is look at the probability (given the data, priors, and likelihood) that there is a positive effect of x (note that this is usually done with some value of practical significance, like 1 ppt, but left it at zero here):

# probability of effect of x > 0
mod <- stan_glm(y ~ x, binomial, dat, seed = 1839)
mean(as.matrix(mod)[, 2] > 0)

This gives me about a 97% chance that the effect is positive. Now, let’s imagine we want to test if the effect of x depended on a demographic variable z. I could do some model comparison like:

# probability of no interaction vs. interaction
mod_h0 <- stan_glm(y ~ x + z, binomial, dat, seed = 1839,
                   diagnostic_file = file.path(tempdir(), "df.csv"))
mod_h1 <- stan_glm(y ~ x * z, binomial, dat, seed = 1839,
                   diagnostic_file = file.path(tempdir(), "df.csv"))
bayes_factor(bridge_sampler(mod_h0), bridge_sampler(mod_h1))

Which tells me that the Bayes factor for h0 / h1 is 4.32. Since the priors in both models are the same (note that I usually give them some semi-informative prior, but they are always the same in both models), the Bayes factor simplifies to the probability of no interaction over the probability with an interaction. So, would a fair interpretation be: “It is 4.32 times more likely that the effect of X is the same across both groups Z” and “There is about an 81% chance there is no interaction, while there’s a 19% chance the interaction is non-zero” (since 4.32 / (4.32 + 1) = .81).

Am I interpreting the Bayes factor correctly?

That is basically right. Although I wouldn’t say the priors are “the same” simply because they are both independent normal distributions. The number of margins of that normal prior changes and hence the predictive distributions. I personally prefer bridgesampling::post_prob, which will do the calculations for you and be sure to pay attention to the error measures in the calculation of the marginal likelihoods.

And you should emphasize to yourself at least that all of these X% calculations are under the assumption that exactly one of the two models you estimated is correct, which is usually questionable. You can often get something (e.g. stacking weights) under different / weaker that are as easily interpreted by people with limited interpretation skills.

2 Likes

Thank you for the prompt response! Would you mind elaborating more on this point:

You can often get something (e.g. stacking weights) under different / weaker that are as easily interpreted by people with limited interpretation skills.

https://projecteuclid.org/euclid.ba/1516093227

1 Like

And here is a counterpoint (see links therein): https://www.bayesianspectacles.org/rejoinder-more-limitations-of-bayesian-leave-one-out-cross-validation/. Not trying to get anyone confused here, just presenting an alternative viewpoint.

1 Like

Thank you for this! Good read. However, it seems like the goal of that paper is to combine a number of candidate models to generate a predictive distribution for future cases. That is, it seems like it is used for prediction, whereas I’m interested in inference (particularly causal inference, given that one of the factors is an experimental condition). So it might give me a distribution for a future case, but I’m interested in looking at how likely two hypothetical models are, relatively-speaking. Or am I reading it incorrectly/assuming a dichotomy between prediction and inference?

I like the post_prob function for that reason. I think I’m OK assuming hypothetically that one of the candidate models I’m comparing is ground truth. I know that this assumption isn’t true, but I think it is useful nonetheless to say, “Assume that one of these models is correct. Given the data and our priors, it is X% likely that M1 is the correct model, and it is 100 - X% likely that it is M2.”

Also, I understand now what you mean by:

Although I wouldn’t say the priors are “the same” simply because they are both independent normal distributions. The number of margins of that normal prior changes and hence the predictive distributions.

There is a difference between (a) the priors I put on the coefficients in the model, which is part of M1 and M0 in the BF and posterior probability calculations, and (b) the priors put on the models themselves, which are multiplied to the BF to get the posterior probability. Even if I put different priors on the coefficients in M1 and M0, the BF will still equal the posterior probability if both of those models are set to be equally likely. That is, there are priors on the coefficients and separate priors put on how likely each model is to be the correct one?