Bayesian credible intervals: Significant compared to what?

I have a questioned regarding the interpretation of credible intervals retrieved from, e.g., a brms model.

From this article (link below), I understand that significance is defined as both upper- and lower-95%-credible intervals etiher above or below zero - as the parameter value zero, is defined as no effect. (Understanding and interpreting confidence and credible intervals around effect estimates - PubMed)

However, when the credible intervals of a variable are then said to be significant, what are they then significant compared to?

Let me give an example:
Given a model with e.g. three variables, each of either two or three levels:

Time ~ Shoes + Experience + Age

where time is the time (in minutes) it takes participants to run a certain track, “Shoes” is a categorical factor of three levels indicating the brand of shoes participants are wearing (Nike, Adidas, or Puma), “Experience” is a categorical factor of three levels (Beginner, intermediate, pro), and Age is a categorical factor of two levels (Young or old).

The coefficient table with credible intervals is extracted, where the intercept contains the alpha-numerically lowest values of each variable. :

                    L-CI            U-CI

Intercept -3.13 -1.51
Nike -1.20 -1.00
Puma -1.32 4.55
Intermediate -1.44 -1.24
Pro -3.22 -2.24
Young -1.39 2.67

Thus, the intercept, “Nike”, “intermediate” and “Pro” all have significant, negative effects on the time, indicating that they significantly reduce the time it takes participants to run the track. BUT compared to what?

I hope my question makes sense, otherwise I’ll gladly try and elaborate, and I hope you guys can help me understand how Bayesian credible intervals should be interpreted.

Any literature on the topic would also be very much appreciated!

Best regards,

The answer to this question doesn’t actually depend on whether the analysis is Bayesian nor even on whether we are reporting C*I’s (confidence or credible). Really it boils down to: How do we interpret a coefficient estimate in a model with a categorical predictor?

Here’s one treatment of the topic (this isn’t a uniquely great treatment–I just found it by googling–but I’ve glanced at it and it seems correct and clear enough).

1 Like

Thank you, but I’m still unsure as to whether the variables are compared to the pooled intercept or to each of the alpha-numerically lowest values of each variable? Put another way using my example:

Is the significance of “Nike” compared only to that of Adidas (the alpha-numerically lowest value of the variable “Shoes”) or to the whole intercept (“Adidas”, “Beginner”, “Old”, pooled)?

The latter. An important clue is the fact that there is no coefficient estimated for Adidas.

Again, I’d encourage you to think more broadly than what the “significance” is in reference to. The fundamental question is not about the significance of the coefficient, but rather about the meaning of the coefficient, irrespective of what its posterior distribution is and whether or not that distribution includes substantial probability mass on both sides of zero.

In your model, the parameter for Nike means the expected difference between Nike and Adidas with all other covariates held constant. And the 95% credible interval for this parameter that brms has constructed for you doesn’t overlap zero.


@jscolar has a good answer here. Regression and Other Stories (by Gelman et al) also has some good chapters on interpreting regression models, though I forget which ones exactly.

Basically (and I’ll probably gloss over some technical details), all the 95% credible interval says is that, conditional on the model and data, we believe there is a 95% posterior probability that the effect of X predictor is between these two values. Or, there is a certain posterior probability, again conditional on the model and data, that it is greater than some value (say 0). Similar to interpreting p-values there is a technically correct interpretation, the probability of observing a value of a test statistic as or more extreme than the observed given our model assumptions, and the interpretation we want to make, that this variable is “important”.

The general point is that “importance” as we’d like it to mean depends on what we want to use our model to do. For example, do we want to recommend an athlete switch to a different shoe type? There will probably be some quantifiable benefits from faster times (e.g. prize money, value of achieving personal goals etc) and drawbacks (e.g. upfront cost of shoes, impacts to training etc.) So here, both the effect size and uncertainty will play a role. Or, maybe we just want to make a simple statement about which shoe is “better”. Again, there will be some utility/disutility by making certain errors, say mistaking the sign of the coefficient or its magnitude.

1 Like

@js592 @jsocolar Thank you both!

I just have a hard time understanding how these measures (credible and confidence intervals) make any sense compared to p-values, when potentially half of the information is “lost” or “uninterpretable”, as it is pooled in the intercept. Whereas an ANOVA table contains all the information on all levels of a variable.

Is it possible to extract the “missing” credible intervals, does it even make sense to do, or is it complete statistical witchcraft?

Sorry for all my questions, I just want to make sure that I completely understand this topic and how to use it/communicate it.

Perhaps it would be useful or helpful in your understanding to get two posterior predictions, one with, say, {Nike, Beginner, Old}, the other {Adidas, Beginner, Old}, and subtract the two posteriors.

If you are including factor levels in the model then by mathematical necessity you have to set one as a baseline (unless you enforce some hierarchical constraint). Including all of them will result in collinearity with the intercept term. But, this isn’t much of a problem for interpretation – if one shoe type is set as the baseline (usually the first in alphabetical order by R’s convention) than the coefficient on another is interpreted as the change in average time relative to the baseline shoe type holding all else constant. The 95% credible interval is the amount of that change we believe could be reasonable or compatible with the data given our modeling assumptions.

A few footnotes:

  1. “Conditional on the model” is important here: you’d want to verify that your model represents the real world data well before making inferences based on estimated coefficents
  2. Concluding that the change in shoe “causes” the change in average time, rather than only being associated with it cet. par. requires much stronger assumptions

Ok, but is it possible to ensure that the intercept has only one baseline value? Thus, for the thought up example, is it possible to define e.g. only Adidas as the baseline, and thereby “exclude” the “beginner” and “old” from the intercept?

Otherwise, your statement kind of contradicts @jsocolar 's answer to my question as to whether “the significance of “Nike” is compared only to that of Adidas (the alpha-numerically lowest value of the variable “Shoes”) or to the whole intercept (“Adidas”, “Beginner”, “Old”, pooled)”.

The intercept represents the expected value when all factors are at their baseline/reference values and covariates are equal to zero. Unless you parameterize the model differently.

You could, for example, suppress the intercept and estimate outcomes for each level of the first predictor directly.

This is generally how regression works across both Bayesian and frequentist implementations.

ANOVA is really just a highly constrained version of simple linear regression with the model estimates presented in a slightly different way.

1 Like

The parameter Nike is about the difference between Nike and Adidas irrespective of the values of the remaining covariates, as long as those covariate values are fixed. It’s the difference between Nike and Adidas when the observation is otherwise beginner and old. It’s also the difference between Nike and Adidas when the observation is otherwise intermediate and young, or any other combination of covariates, as long as these other covariates remain fixed (i.e. the same) for both the Nike and Adidas cases.

Importantly, if your model is a glm with a non-identity link, then the “constant” difference associated with the predictor Nike is constant on the link scale, and thus the implied difference on the outcome scale may indeed depend on the values of the remaining covariates.


Thank you, that makes sense!

But, I’m still not sure how to adress the case where the effect of both levels of a variable are important in the analyses. For the example given, let’s say that “Age” is a very important variable, and the effect of both being “young” and “old” on the running time is interresting. When the analysis is done, one is used as baseline, and the credible intervals (hence, the effect) of that level cannot be assessed directly.

Then what is done? How can this problem be solved? Is it possible to extract the credible intervals of the baseline?

The intercept is the estimate for the reference level so just use the intercept’s CIs.

1 Like

It seems like you’re expecting there to be two effects here, one for young and one for old. But there’s only one effect: the difference between young and old. The effect of being old as opposed to young is just the negative of the effect of being young as opposed to old.

If you want to extract CIs or predictions for an old runner (with some combination of other covariates), you can just use the intercept (plus any other relevant terms pertaining to other covariates) as @andymilne says. This is true in general. The intercept gives you the (link-scale) expectation for the reference level of all covariates (or zero value of continuous covariates), and if you want expectations for other levels/values you just add the necessary terms.

Perhaps it would be clarifying to think about how this model is actually coded under the hood. The factor Shoes is expanded to a two-column matrix of ones and zeros (with a row for each observation), one column for Nike and one for Puma. Adidas is neither Nike nor Puma, and so has a zero in both columns. From here on, these columns are treated exactly as if they were continuous (like if you had an additional column for the runner’s height in cm). When the Nike covariate is zero (i.e. the intercept), the shoe is not Nike. When the Puma covariate is zero (i.e. the intercept) the shoe is not Puma. Since there are no shoes that are both Nike and Puma, there will be no row with ones for both Nike and Puma. When both Nike and Puma covariates are zero, the shoe therefore is Adidas. So a factor with N levels expands to N-1 binary numeric covariates which are treated exactly like any other covariates in the model. If you have intuition about how to interpret ordinary continuous covariates in a regression model, this intuition is directly applicable to the interpretation of the factors.


It is maybe worth noting that you don’t have to parameterise the ANOVA model this way in a regression. You can also use what’s known as a cell means parameterisation. This fits a model with no intercept and models each of the cell means.

e.g., Time ~ 0 + Shoes

Gives you a model with means for Nike, Adidas and Puma rather than Adidas mean (intercept) and the Nike - Adidas and Puma - Adidas differences.

This gets fiddly to expand to interactions but can be done as:

e.g., Time ~ 0 + Shoes:Age
Time ~ 0 + Shoes:Age:Experience

In practice it may be easy to create a single factor with all combinations.

In the classical approach you then construct contrasts for the patterns of interest (differences, interactions etc.). If using something like brms it is even easy as you just can set up the contrast of interest using hypothesis().

One thing that hasn’t been mentioned yet is the use of ‘conditional_effects()’ or ‘fitted()’ to obtain results expressed in the space of the response variable.

In ANOVA, the information being expressed about all the levels of a categorical variable is being obtained by ‘post-hoc’ tests. ANOVA only returns the p-value for the test hypothesis across levels; the additional p-values or confidence intervals are obtained by some additional method. I don’t think using ANOVA or thinking about broader modelling results in terms of ANOVA is useful because it tends to trap you into a significance-testing approach.

When you set up a general linear model using dummy coding for a categorical variable which is the brms default, as you’ve seen the summary() call returns the coefficients and CI, which are for the effect of each level relative to the reference level. Those are the parameters of the model, which have been explicitly estimated (sampled).

You aren’t restricted to reporting those. If you’re interested in explicit comparisons between the different levels, you can use hypothesis(). If you’re interested in the expected values of the response for the different levels, you can obtain those (with CI) using fitted(), or plot them using conditional_effects(). The package ‘emmeans’ seems popular for this type of approach for frequentist models. An advantage of working in brms is that these quantities are already implied by the model; you just need to the obtain information from the samples.

1 Like