Tiny Bayes Factor for a model with a non substantial interaction vs one without interaction

I want to know the Bayes Factors in favour of the lack of interaction in a model. I ran two models, one with interaction and one without it. Then compared them with bayes_factor(inter, nointer) and got this:

Estimated Bayes factor in favor of inter over nointer: 0.50862

How do I interpret this? My supervisor said that to publish a negative result (lack of interaction) we needed first see the Bayes factor in favour of this negative result. I think 0.50 is not relevant.

But what I don’t understand is how a model that explores an interaction only to find that there is no substantial evidence of it, is any better from one without an interaction. I thought that the point of BF was to see how much better is one model compared to another with an effect, so you can say that the size of that effect is BF = 10 or whatever.

Another question: accuracy is whether they get an answer correct or not. They can get it correct by just guessing, 50% of the times. Do I need to specify this somehow in the model or is that irrelevant?

inter <- brm(
  accuracy ~ 1 + time + condition + ft
  + time:condition
  + ( 1 + delay | id ),
  data = df,
  family = bernoulli(),
  control = list(adapt_delta = 0.95),
  save_all_pars = TRUE,
  file = "inter"

nointer <- brm(
  accuracy ~ 1 + time + condition + ft
  + ( 1 + time | id ),
  data = df,
  family = bernoulli(),
  control = list(adapt_delta = 0.95),
  save_all_pars = TRUE,
  file = "nointer"
  • Operating System: Windows 10
  • brms Version: 2.12.0
1 Like

I don’t know if this is the right place to ask about Bayes factors, since afaik most of the senior developers on here tend to work with approximate out-of-sample predictive accuracy using Pareto-smoothed importance sampling leave-one-out cross-validation (PSIS-LOO cv) for model comparison, rather than BFs.

The nice thing about PSIS-LOO is that it can also help you diagnose problems with model fit and influential outliers. Additionally, instead of making a decision based on one number (i.e. BF), the differences in predictive accuracy estimated via PSIS-LOO have standard errors, so those can be helpful for model comparison as well (if there’s a large difference between the predictive accuracy of two models, but the standard error of that difference is large as well, it might be a good idea to go with the more parsimonious models). If you wanted to try out PSIS-LOO, here’s a vignette by Aki Vehtari: https://cran.r-project.org/web/packages/loo/vignettes/loo2-example.html.

(there’s also other options like model averaging/stacking but I don’t know much about those)


Hi, and thank you for your response. I do use PSIS-LOO cv for model comparison. The message I received asking me to explore BF said: “It would be worth adding Bayesian stats to have an idea of how solid is the negative result”. When I asked them what did they want (as I have done the same analysis but with brms instead of lme4 already) they asked me to get Bayes Factors to see if there was evidence of the negative result. Can I evaluate the strength of a negative result with PSIS-LOO cv?

The result for the comparison is this:

               elpd_diff   se_diff
 nointer         0.0          0.0   
 inter          -1.8          0.7   

Does it mean that the model without interaction is a better fit? Does it say anything about “how solid the negative result” is?

They can get it correct by just guessing, 50% of the times. Do I need to specify this somehow in the model or is that irrelevant?

Depending on how your response format is setup, you could use a metric called D-prime, which factors in chance rates of being correct http://phonetics.linguistics.ucla.edu/facilities/statistics/dprime.htm

Your BF for the interaction model vs. no interaction is .5, meaning that the interaction model is favored half as much as the no interaction model. Or, flip it around, the no interaction model is favored twice as much as the interaction model. That is not a hugely conclusive BF against the interaction, but clearly there is not any evidence for the interaction model, and so you might say for parsimony the no interaction model is favored, though it is not conclusive.

The ELPD values also suggest the no interaction model is favored (in ELPD diff, the difference is taken relative to the best model, which gets ‘0’ and everything else is relative to that). I think the difference shown here is quite negligible, so it’s not like the interaction is making your predictions much worse, but clearly the interaction is not improving model performance and so again for parsimony you might conclude that the interaction term is not really relevant.


One small further thing to note. It seems like there is something different in your models besides the presence of an interaction? One has

( 1 + delay | id )

the other has

( 1 + time | id )

If you wish to make an interence about how one aspect such as the interaction changes your inferences, then you would ideally keep all other parts of the models constant, to enable a comparison where only that factor is changed.


I think both @abartonicek’s @JimBob’s suggestions are reasonable first steps, but I would like to add a bit of broader context.

I would be very cautious about claiming “lack of interaction” from just BF or LOO results. Remember that LOO is really just a guess at the model’s ability to predict new observations (without claims about causality, truth etc.) and BF is … well a complicated number related to how likely are the whole spaces of possible configurations of the individual models to produce the data you see (caveat, I don’t understand BFs much, they’ve always been confusing to me).

What you actually see is IMHO likely to be more accurately described as “the data we collected are unable to determine how big the interaction term is”, or similarly “the data we collected is explained well by a model without interaction”.

There are two ways to test for yourself if that is the case:

  1. Run the same analysis with some (say 1/4) of the data omitted at random. My guess is that you’ll get stronger preference in LOO for the model without interaction with this smaller dataset. I have much less intuition about BFs (which are fairly unintuitive), but I wouldn’t be surprised if the smaller dataset also gave you stronger support for the simpler model in BF.

  2. Look at the posterior distribution of the interaction coefficient in the model. My guess is that it would be very wide, implying the data is consistent with a wide range of interaction strengths.

To really claim that “there is no interaction” you would IMHO need to fit a model with interaction term and have the posterior for the interaction coefficient concentrated around 0, which would need loads of data.

All of the above is obviously assuming that the linear model you propose is a good description of the data you have, which is almost 100% not true, so you would also need to check if the model is “importantly wrong”, e.g. via posterior predictive checks.

When you dive into those things you often find that making a claim confidently is quite hard and many published works don’t have a high standard of proof/analysis/… Unfortunately it is hard to publish a paper along the lines of “We collected some data and it turns out we didn’t learn much”. To what extent are you OK with “playing the game” and be happy with “doing what others are doing” or if you want to invest in doing stuff better (usually at cost to your career progress and your supervisor’s satisfaction) is a tough call and I don’t want to second guess anybody’s choices. I just want to highlight that the tension is there.

Hope this makes sense.

And also:

I don’t think you need to specify this in the model if you are just after model comparison, but I would guess “correct” would be a better name for the variable as accuracy usually implies an average over multiple trials. I think you should keep that in mind if you tried to make real-world decisions based on model predictions (e.g. if some participants are consistently worse than chance in guessing than their accuracy could be improved by just switching the labels on the answers, which is likely an artifact of the experimental setup).

1 Like

Thank you for such a thoughtful response. This was a small side experiment and they wanted to investigate if it made sense to do more experiments in the same topic and try to publish them, or if we just drop it. Because there was no evidence for the interaction, they wanted to know if we had strong evidence against. If it’s not conclusive, they might decide to collect more data or to just ignore this experiment.

Now, for future experiments I want to ask something about this:

I took the posterior samples (I think that’s what you mean with looking at the posterior distribution?) and found the point estimate and the CIs like

 samples <- posterior_samples(inter)
 b1 <- round(mean(samples$`b_timehour:condition2`),2)
SD1 <- round(sd(samples$`b_timehour:condition2`),2)
CI1 <- round(quantile(samples$`b_timehour:condition2`, probs = c(0.025, 0.975)),2)

and the results are b1 = 0.02, SD1 = 0.2, and CI = 2.5% (-0.37) and 97.5% (0.42).

If I understood you correctly, I can claim that there is no interaction because the posterior for the interaction coefficient is actually concentrated around 0 (in case the model is not terribly wrong).

Thanks again for your response. For beginners like me it’s very hard to understand all of this just by reading papers, since we are still familiarizing ourselves with the material.

1 Like

Martin raised some great points. The way I understood it, the main take-away is that not concluding you need an interaction (to predict out-of-sample data) is not the same as concluding you don’t need an interaction. Absence of evidence is not evidence of absence.

Whether the coefficient is concentrated around zero really depends on what units your dependent variable is in. The raw numbers don’t tell us anything. E.g. if you had two brothers and wanted to decide whether they are about the same height, and you estimated that the difference between brother A and B is 0.02 meters (-0.37 - 0.42 95% CI), that wouldn’t be a precise enough estimate to base the decision on because it’d be pretty plausible that A’s either 37 cm shorter or 42 cm taller than B. If, on the other hand, the estimated difference between the brothers’ heights was 0.02 centimetres (-0.37 - 0.42 95% CI), then it might be reasonable to conclude that the brothers are equally tall, for all practical purposes.


This is important, but what I wanted to say is that there is difference between “concluding you don’t need an interaction to explain/predict the data I have now” (which you can do reasonably well with LOO, provided you also check your models carefully) and “concluding the process I study doesn’t involve an interaction” or, similarly “concluding I would not need an interaction to explain/predict any dataset possibly ever created by this process”

As @abartonicek noted, scale is important here. You would likely need to think about SESOI (smallest effect size of interest) and if the bulk of the posterior is smaller in absolute value than the SESOI, you can claim that the interaction is non-existent or not of interest. Note that for logistic regression (bernoulli family), even relatively small numbers can correspond to large effects. Assuming the covarietes are around +/- 1 (which would be true if your covariates are factors), changing the log-odds (the unit the model works with) by 0.42 * 1 means multiplying the odds by exp(0.42) = 1.5 and 1.5 times the odds would in most settings be IMHO considered substantial.

That looks correct, but note that with brms, just print(inter) or summary(inter) should give you all of those quantities easily.


Edit: I think Martin answered this question whilst I was typing it!

I haven’t thought about it! It’s a pretty awesome example but I am not sure how to apply it to my dependent variable. I have used Bernoulli family because my dependent variable are 1 (correct) and 0s (incorrect). I thought that these type of models predicted the probability of getting an item correct. So this 0.02 means that the slope in condition 1 between time 1 (reference level) and time 2, is only different from the slope between time 1 and time 2 in condition 2 is not substantially different, but the difference in any case was of 2%. So the difference between slopes is 2% of probabilities of getting an item correct?

Ok, noted! But I thought that if the 95% CI contained a 0 that means no substantial effect!

If I change both CI to log-odds I get 0.69 and 1.15, which contains 1 and again, I thought that with odds having a 1 in the range meant no interactions. So how can that 1.15 times the odds can be substantial if the CI range does contain 1?

Maybe it has to do with the “Assuming the covariates are around +/- 1”. My independent variables (covariates?) are condition with 2 levels, and time, with three levels. When you say they could be around +/- 1, do you mean how they are coded?

I think that is the core of the misunderstanding: it means that the data is consistent with no substantial effect. So if the 95% CI is (-1000, 1000) the data is consistent with a huge positive effect, huge negative effect and also a negligible effect. We don’t have evidence for an effect but we also don’t have evidence against an effect. But if the CI is (-0.0001,0.0002) then the data is consistent only with a very small effect, so we actually have evidence against effect larger than 0.0002 and can confidently say the effect is non-existent/negligible.

In the case of your model, we can say, that (provided the model is a good approximation of reality), the log-odds for the interaction term is unlikely to be much less than -0.37 or much larger than 0.42, but all values within the 95% posterior credible interval are reasonably plausible and we cannot rule them out.

Yes - if your covarietes are factors, than they would be translated using dummy coding to 0/1 values and interpreting the coefficients directly is usually OK. If, as @abartonicek noted, the covarietes would be continuous variables (time, height, …) you need to worry about the units.

Does that make sense?


It makes a lot of sense! I don’t really have evidence against the effect, and I finally understand what the CI actually mean.

If I understand correctly, I won’t need Bayes Factors (which makes me very happy) to discuss this with my supervisors.

Knowing all of this will be extremely helpful for the rest of my experiments, thank you very much. And thanks for mentioning the SESOI, it will be very helpful to think about it as well.


Good on you! Posting on forums about a topic one is inexperienced in can often be daunting, so it’s great to hear that you feel like you learned something,

One last thing, as @martinmodrak mentioned in regards to the size of the coefficients, log odds can represent quite a large changes in probability of the outcome. For example, if the probability of your outcome was 50% without the interaction (which corresponds to 0 log odds), and your interaction was binary, then a -0.37 change in log odds would correspond to the probability decreasing to ~40% (= exp(-0.37) / [exp(-0.37) + 1]) of the outcome, and 0.42 change would correspond to increase to ~60% (= exp(0.42) / [exp(0.42) + 1]). So even though the most likely estimate is that your interaction doesn’t really affect the probability of the outcome, it’s also plausible that it changes it by plus or minus 10% (assuming all the other effects add up to 50/50 probability) , which is not a meaningless difference in most fields (i.e. not a SESOI).


I was actually too nervous to even post the question. Sometimes some responses can be so difficult to understand that it makes me feel like I’ll never understand statistics. So your responses are quite encouraging.

I think I understand what you mean, I’m just a bit confused by this:

How do I know if that is the case? By transforming all the estimates to probabilities and then adding them up? Just to make this clearer, here are the Population-Level Effects from the model inter:

 Population-Level Effects: 
                        Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
 Intercept                  1.19      0.12     0.97     1.42 1.00     3198     2863
 timehour                  -0.34      0.13    -0.59    -0.08 1.00     3450     3086
 timeday                   -0.42      0.13    -0.66    -0.17 1.00     3889     3104
 condition2                 0.67      0.16     0.35     1.00 1.00     3168     3108
 fttarget                  -0.55      0.08    -0.70    -0.40 1.00     8078     2618
 timehour:condition2        0.01      0.21    -0.38     0.42 1.00     3466     3325
 timeday:condition2        -0.11      0.20    -0.51     0.27 1.00     3605     3138

So, to know if all the other effects add up to 50/50 (1?) I need to do transform all the numbers under Estimate and transform them to probabilities (exp(Estimate)/(exp(Estimate)+1)), and add up the probabilities to see if all added up are 1?

Just in case it matters, fttarget means if the item was a target or a foil. In this case then the ref level must be foil, which apparently was easier to be identified correctly than the targets.


I think what @abartonicek is saying regarding the 50/50 thing is just hypothetical - if it were around 50/50 then the effect could be quite large, but at other values the change caused by the interaction might lead to only a small change in your outcome in terms of probabilities. You could check the effect by making calculations from the regression equation with your parameter estimates, and converting the outcomes to probabilities similarly to what you are suggesting.

I think brms actually has a quick way of checking this sort of thing, if you ask for marginal effects plots http://paul-buerkner.github.io/brms/reference/marginal_effects.html these can include the interaction, and so you should see a plot that shows the impact of your interaction term.


Like this? I think the points are the posterior probability and the bars are standard deviations from the mean?

1 Like

Yes, at least I think this can give you a basic/starting sense of the impact of the interaction. The interaction is indicated by how the size of the difference between conditions 1 and 2 changes depending on the delay. They look quite similar to me but I don’t know in your field what would be considered a sizable effect. This is also of course just eyeballing it, and there are likely more sensitive ways of judging this that the others might suggest - but this could give you a basic indication.

Given what I know of brms, I expect that the point estimate in the middle of each error bar is the mean estimate of the probability from the posterior distribution. The error bars could be the 95% ‘ETI’ (equal tailed interval) - the middle 95% of values from the posterior distribution of the estimated value. I find that more likely than the standard deviation, but I am not certain.