Modelling Multiple Comparisons/testing multiple associations solution in brms?

In Bayesian Inference, I have come across multiple papers (e.g. see Why We (Usually) Don’t Have to Worry About Multiple Comparisons & Ordinal Regression Models in Psychology) stating that when you have asked participants multiple items (e.g. could be different items from a same scale measuring the same construct), the way to correct for multiple comparison is to just model each item within a multilevel framework.

Let me quickly explain with an example. Imagine I am interested in how sex affects depression. I have asked participants 5 questions about depression from a popular validated psychometric measure (all on the same likert scale, with 1 = indicating strong disagreement and 5 = indicating strong agreement). As I am now running 5 separate tests on each item (i.e. multiple hypothesis testing), according to Gelman et al. & Paul and Matti, you just model it as:

Likert Scale Score ~ 1 + (Sex | item) + (1 | item) + (1 | pid)

Note that 1|pid appears as the same person has answered 5 items.

According to Gelman, the multilevel framework automatically corrects for multiple comparison due to partial pooling and shrinkage imposed by the item-level grouping. This conceptually makes sense when all items are from the same questionnaire measuring the same construct.

However, in most applied settings where people test multiple hypotheses, people often test the same model on very distinct items that measure very different constructs. For instance, I might also ask participants a distinct item about how strongly they believe in god and agreement with the statement : “I see delusions” (to measure religiosity and experience of delusions respectively - same scale 1 - strong disagreement and 5 - strong agreement). How does one model these 7 items (5 depression scale items + 1 religiosity item + 1 delusion item) together in the multilevel framework now using brms to correct for multiple comparison?

Gelman notes here that imposing the same multilevel framework is a bit tricky, as you now assume that these 2 new items are exchangeable units with the other 5 items. He also outlines a solution on how to conceptually resolve this, but I found it quite hard to follow, especially without any brms code.

Does anyone have any good vignette or applied data analysis example to share for these sorts of situations? What’s the correct way to model this? Tagging for help: @Solomon @Erik_Ringen @matti @lukas @paul.buerkner

The simplest way to do this is to fit one multilevel model for each questionnaire/construct of interest. E.g., if you have 20 items, with 10 from one questionnaire and 10 from the other, fit two models. It’s also possible to fit a 3-level model where the items are nested in questions, and the questions are nested in questionnaires.

1 Like

It looks like this your first post, here. Welcome to the Stan forums, @adigitaltanay!

1 Like

Hi Solomon, thanks so much for the warm welcome and taking the time to answer my question :)

Re fitting 2 separate models for each questionnaire, that makes sense, but then we lose out on modelling multiple hypothesis testing as Gelman proposes (which your other solution does resolve tho - having questionnaires nested with the big original participant question list which had both questionnaires)??

Also, modelling this seems straightforward for the example you outlined i.e. 10 items from one questionnaire + 10 from another questionnaire = 20 items asked to participants. However, what to do when you have 10 different items, all measuring distinct (but say somewhat related) constructs that a researcher just made up for a particular study and that aren’t part of any validated questionnaires?

As Gelman notes in the same paper, the exchangeability assumption of a clustering variable gets violated in such complicated cases (as the one I just outline before): “Similar issues arise when researchers attempt to evaluate the impact of a program on many different outcomes. If you look at enough outcomes, eventually one of them will appear to demonstrate a positive and significant impact, just by chance. In theory, multilevel models can be extended to accommodate multiple outcomes as well. However, this often requires a bigger methodological and conceptual jump. The reason that multilevel models were such a natural fit in the examples previously described is because all the estimated groups effects were estimates of the same phenomenon. Thus it was reasonable to assume that they were exchangeable. Basically this means that we could think of these estimates as random draws from the same distribution without any a priori knowledge that one should be bigger than another. If we had such information then this should be included in the model, and in more complicated settings it is not trivial to set up such a model.”

What is a solution then to these more complicated cases? As you suggest before, should we just run separate models on the 10 different items, as they measure separate constructs and then just report the full posterior? Or is there a way to model multiple comparison in such complex cases?

P.S: Big fan of your blogposts and have learnt a lot from them!!

I think that’s too strong. Fitting two models will still impose multilevel partial pooling within each model. But sure, it might be less strong than if you fit a single model. One consideration is the skill of the modeler. IMO, a two-model solution fit competently is better than a poorly-executed three-level model. Or even in the case where the three-level model was fit skillfully, it might be overly confusing to a given target audience. One has to pick one’s battles…

If you had 10 items all measuring distinct but related constructs, I think it might be fine to fit a single multilevel model. I also think there’s room for differing opinions on that one.

Thanks for the kind words on my blogs; I’d glad to hear they’ve been of some use.

1 Like

Thanks for your prompt response @Solomon. The practical tradeoffs perspective makes a lot of sense.

Do you have a reference that has explicitly tackled with this using code? It might be useful to cite to reviewers/colleagues in case someone asks for a rationale. Also, what are the differing opinions on this?

Is it to just report the full posteriors of the 10 items and see their direction? I ask as some say that the advantage of going bayesian while testing multiple distinct hypotheses (some of which might be on related constructs, others might not) is that we can see the full posteriors and their direction, while others focus on modelling multiple comparisons? Just asking as a confused bystander and a relatively naive modeller.

The only paper that comes to mind is the Gelman et al multiple comparisons paper (https://sites.stat.columbia.edu/gelman/research/published/multiple2f.pdf), which you might use as a starting point for a lit search if you need more specific citations. As to differing opinions, others will have to chime in on that.

Regarding your final paragraph, this is where you, the researcher (or team), get to be very specific about what you find meaningful for your scientific purposes. Another way of putting this is: Define your estimand. For more on this way of thinking, see Lundgerg et al (https://doi.org/10.1177/00031224211004187) or the brand new preprint from Rohrer and Arel-Bundock ( OSF ). I should note that this whole define your estmand business is a big old rabbit hole, in the best way possible. I started getting into this 3-4 years ago and it’s had a big influence on how I think about science.

3 Likes

Thanks for sharing this and the estimand stuff – vv useful!

P.S: I am sure there will be a big audience for a future Kurz blogpost on the different ways to conduct multiple comparision using a multilevel framework :))

1 Like

@Solomon @matti @paul.buerkner @richard_mcelreath I have reflected a bit on the estimand issue, and had a further question for you guys

Say we return to my example of 10 depression items in a questionnaire. These have been administered to individuals who belong to certain households which are further nested in communities. Now I have two estimands of interest:

(a) what is the amount of variation (via adjusted ICCs) in likert-scale score present at the household- and community- level “for each of the 10 items” ?; Basically, want to see whether ICCs for these social levels varies across different items.

(b) does the effect of higher-level variables (e.g. community size) vary across each item, after adjusting for individual level factors (e.g. age, sex, income)?

From my limited understanding, the big brms model that models multiple comparison appropriately would potentially look something like:

Likert_Scale Score ~ 1 + Age + Sex + Income + Community Size + (1 + Age + Sex + Income + Community Size | Item) + (1 | Item) + (1 | Community/Household) + (1|PID)

(1 | PID) is added as one individual is now answering 12 items.

However, two issues arise instantly:

  1. This model won’t estimate the community and household variance for each item separately, right? How do I this, given that I want to calculate the ICCs for communities and households separately for each item?
  2. @Solomon How should I calculate the ICCs accurately for each item? My prior intuition is to add up the community, household and (pi^2/3 - logistic models constant variance) in the denominator and use the pertinent level’s variance for a specific item as the numerator? However, some questions arise - can I just ignore the (1|PID), (1|item) and the varying slope variances in these calculations?
  3. Furthermore, there is the issue of standardisation. As per the Enders & Tofighi (2007) paper on centering (@Solomon thanks for sending this), they mention that grand-mean centering or global standardisation is appropriate when the estimand is to calculate the effect of higher-level predictors after adjusting for individual-level differences OR quantifying between-cluster variation after adjusting for individual differences. However, they do caveat by saying that introducing varying slopes can affect interpretation of intercept variance. Does introducing the item-level varying slope introduce complexities in calculating ICCs? Also, does one globally standardise after converting to a long dataframe or before? This is easy to think of in wide data formats, but it isn’t that straight forward for long dataformats?
  4. Due to this increased complexity in calculating ICCs in this more complex model, my initial reflex is to be scared and just run 10 different item-level models separately for calculating ICCs and using strong null priors to correct for multiple comparison (e.g. normal (0, 0.5) for log-odds co-efficients). Is this alternative appropriate?

Any inputs would be much appreciated :)

I’m not in a good position to answer most of your questions. It’s been several years since I computed or even thought in terms of an intraclass correlation coefficient (ICC). Broadly speaking though, I don’t like thinking in terms of “variance” when fitting ordinal models. My vote is to reserve ICCs for continuous models, and embrace the strangeness of ordinal data with terms more natural to ordinal data. You, of course, are not beholden to my preferences.

As to your fourth question, my recommendation here is similar to one from a few exchanges back. Simpler models done well might be preferable to sophisticated models done poorly. My general recommendation is to fit the simplest model possible to your data as a first step, and as the very next step make sure you fully understand the model. You might even call these steps 1a and 1b. Then if needed, add one or at most two additional elements to the model, and just iterate like this until the model serves your scientific purposes. I do this all the time with my own work: Start simple and build slowly.

1 Like

Point well taken regarding building models slowly!