Help with understanding of bayesian inference using GLM and BRMS


I am still relatively new to using Bayesian Inference. I use it for the first time now in a real project. I have some surprising findings but I suspect that I maybe made a mistake in the progress. I do not get supervision for this and due to lack of experience I would highly appreciate if you could take a look at what I did. Even if you do not work with BRMS, maybe you could help me by looking if my conceptual understanding is (in)correct. If it helps, I could extract stan code from the model.

I use the following model (BRMS)

# specify formula
f1 <- as.formula("abundance |mi() ~ treatment*time + age_s + mi(covariate) + (1 + time + age_s + mi(covariate)|subject_id)")
f2 <- as.formula("covariate |mi() ~ treatment*time + abundance + age_s + (1 + time + age_s + abundance|subject_id)")
formula <- bf(f1) + bf(f2) + set_rescor(FALSE)

# fit model
    family = skew_normal(), data = data, formula = formula,
    chains = 4, warmup = 1000,
    control = list(adapt_delta = 0.9999, max_treedepth = 15), prior = c(
      set_prior("normal(0, 2)", class = "b", resp = "abundance"),
      set_prior("exponential(25)", class = "sd", resp = "abundance"),
      set_prior("normal(0, 10)", class = "Intercept", resp = "abundance"), 
      set_prior("lkj(2)", class = "cor"),             
      set_prior("normal(0, 2)", class = "b", resp = "covariate"),
      set_prior("exponential(25)", class = "sd", resp = "covariate"),
      set_prior("normal(0, 10)", class = "Intercept", resp = "covariate")

Outcome = Continuous measure of bacterial abundance. For simplicity just assume it is any continuous outcome.

Treatment = “yes” or “no”
Time = “pre” or “post”
Age = continuous age standardized
Covariate = Continuous

Note that I used the exponential(25) prior because some models had divergent transitions or did not converge. I use mi() because the covariate has 10 missing values and the model would otherwise disregard those cases, which would mean less power for my comparisons of interests.

I fit this type of model for 130 different bacterial species. Then for each of the 130 models, I hypothesize that the model parameter \mu of the skew normal model is different in the post treatment group compared to the other 3 subgroups. To check this, I did the following:

Given the way I set contrasts, the posterior distribution of the intercept reflects the parameter \mu for the pre-no group. The intercept + the “treatment posterior” is then the pre-yes group. The intercept + time the post-no group and finally intercept + treatment + time + time:treatment posterior is the post-yes group.

So, I added them accordingly and end up with 4 posterior distributions that reflect the parameter \mu for each of my subgroups.

Then to answer my hypothesis, in R I calculated e.g.

mean((yespost - yespre) < 0)

as the probability that the parameter \mu differs between these two groups. Using this approach I also compared the group “yespost” to all other groups and thus end up with 130 probabilities for each of the three comparisons.

Of course, before I went on, I checked the diagnostics. I used shinystan for visualization and checked the nuts diagnostic parameters. Of the 130 models I could include 120 and for the other ones increasting the exponential prior to e.g. 30 might do the trick.

By subtracting the posterior distributions that reflect \mu for my subgroups, I can quantify mean difference with credible interval and also can make a statement about the probability of the difference being greater than 0.

Next I tried to correct for multiple testing. Here I struggle the most. If someone has a good resource where this is explained it would be great. I thought that I could just sort the probabilities for each comparison and then keep the maximum amount of outcomes such that the mean probability of those stays below e.g. 0.05. The latter would mean that 5% of the values in my list are false discoveries. Is this a correct way to account for multiple comparisons?

What is surprising to me is that I find many very small effects. Even though they might not be very meaningful I would expect that if there is no effect at all that zero is very likely to be included in the credible interval most of the time. To compare, I looked at the same question but used lme4. But a direct comparison is difficult because: 1. I cannot use shrinkage for the age and the covariate parameter + as far as I know I only would do posthoc testing if the interaction is significant and this is not nearly as often the case as I found “significant” difference using the model above.

Any thoughts or feedback is welcome!

Well, AFAIK running multiple chains is not equivalent to independent testing. Is that what you mean? Basically multiple chains are used to confirm convergence rates and other sampling based criteria. You cannot for example run multiple chains and claim that those are independent tests/statistical samples.

No that is not what I mean. I try to phrase it differently:

I use a GLM to check whether there is a difference in the outcome between 2 categories over time. So, I have a variable treatment (yes/no) and a pre and post measurement (time). By fitting a GLM, I end with posterior distributions for the Intercept (which reflects time = pre & treatment = no), treatment (by adding this to the Intercept, the posterior reflects time = pre & treatment = yes), time (by adding only this to the intercept but not the treatment posterior, this would reflect time = post & treatment = no). Finally there is also a posterior for the interaction and if I add this to the intercept and both time and treatment I end up with the posterior distribution that reflects time = post & treatment = yes.

What the intercept parameter reflects depends on your contrast settings in R. But with my settings, this is how it is.

Now, my hypothesis is that the the post-treatment group differs “significantly” from all other groups in the mean outcome. Thus, I expect the difference of the posteriors to be either clearly > 0 or < 0. In my understanding, I can just subtract the posterior that reflects any group from the posterior of any other group and then calculate the proportion that is < 0 or > 0. This proportion reflects the probability that there is a difference between groups (at least for the model). So, my first question is, if you could confirm this line of reasoning. If this is correct, this brings me to the maybe more difficult problem:

I need to do this for 130 outcomes. So I fit 130 GLMs, one for each outcome. In the bayesian setting I think I can make the direct comparisons as described above and for each comparison (e.g. time = post & treatment = yes versus time = pre & treatment = yes), I end up with 130 probabilities. In my understanding you would expect 1% of those probabilities to be <= 0.01. So, by chance there will be more than 1 “significant finding”.

In a frequentist setting, I would also end up with 130 pvalues for each comparison that I make. And also here you expect 1% of those to be <= 0.01.Usually the error rate is controlled by using e.g. Bonferroni or in my case Benjamini Hochberg correction. Bonferroni is simpler to explain: If you say a pvalue <0.05 is what you consider “significant”, then you would check for p< 0.05/130 = 0.00038.

Thus, my second question is: How do we deal with this when using Bayesian Inference? What to do or how to interpret my 130 probabilities that I have for each comparison?

Ok, with regard to my second question, Richard McElreath pointed me to this post from Andrew as well as this paper. I will check this out. Maybe I come back with questions about this afterwards.

1 Like

Ok, I checked out the paper. It is really helpful to read as a introduction to this issue. It mainly describes multiple comparisons when there is a single outcome variable. Those strategies I can apply for each single model, because for each model I make indeed more than one comparison.

However, my main issue is not multiple comparisons for a single outcome variable but multiple comparisons due to having multiple outcomes. The paper also touches upon that, especially for cases when the outcomes are interrelated such as e.g. if you had 6 cognitive tests. My outcomes are bacterial abundance and I am not sure if I really can apply here a multilevel model for what I want to do. At least I would currently not know how to fit this model. Definitively something to look into for later when I have more time to dive a bit deeper. Any further input on this would be welcome.

For now I need to find a way to correct for the multiple comparison or at least correctly interpret my findings. I think about just explaining on how to interpret this without further adjustments:

Remember, I have 130 outcomes. For each outcome I make 4 comparisons. So, I might say that one would expect 5%*130 = 6.5 of comparisons to be significant by chance for each effect I test.

Since I find indeed for 4-6% of the outcomes significant effects for those comparisons where we did hypothesize no effect but find an effect in 30% of the outcomes in those comparisons where we hypothesized an effect, this for me is evidence that the treatment had an effect even though the effect sizes are very very small for most of the effects. What is the probability to find 39 of 130 probabilites < 0.025 or > 0.975 if there was no effect? It should be basically 0.

The question is if such an interpretation is enough. Also of course we are interested which outcomes are the ones that are incorrectly significant.