Hey
Not sure if this is the right forum for my question.

I have a number of subjects that each gets 3 treatments (A, B, C).
There are several measuments per subject.
I’m interested which pairs of treatments are significantly different within the subjects.
I know that most subjects do not respond to any treatment.
Some do respond and will probably respond differently to the 3 treatments.

I started to model this by putting a random effect on the subject-treatment interaction

y \tilde{} subject_s + (1 | subject_s:treatment_t)
There is one common hyperprior on the sigma of the random effect for all subjects.

When applying this model to ground truth data, I have the impression that I don’t capture the variance structure in my mixed effect properly.
I have the impression that there is to much shrinkage for the subjects that respond to the treatment (high variance across the 3 treatments) and not enough for the subject that do not respond (no treatment effect).

I thought of putting a different hyperprior on the sigma of the random effect for each subject.
But then I ignore the information that is present in the other subjects.

What would be the correct way to deal with this problem?
Can you do something like putting a hyperprior per subject that is shrunken to a common global hyper prior? (I guess something like a hyper hyper prior) Or is there some other way to deal with this?
I had not much luck searching for something similar in the literature.

I should also mention that in the end during inference I want to report which subjects show a difference in treatment effect between a given pair of treatments at a certain FDR level.
I also not much luck finding in the literature how to calculate this FDR in a bayesian setting.
Any pointers for this are welcome :)

Well, the goal is to implement this with stan software and I searched the online stan manual/examples for simmilar problems.
Other then that, I hoped the ‘general’ category was meant for broader questions about bayesian statistics.
Sorry, If I’m wrong :)

We do answer a lot of stats questions here, but the general category is intended to be things that don’t apply to a single Stan interface. The developers are pretty overwhelmed by the volume here, so we tend to pick off easier to answer questions or ones with long-term ramifications for the project. Sorry about that!

If you’re interested in statistical significance, then you probably don’t want to be using Stan.

How are you assessing this if you haven’t coded the example yet.

They probably should get different priors if they’re different treatments. Three would be too few to usefully pool a higher level up.

You might want to track down some of Andrew’s posts on why he doesn’t like FDR-based analyses. He also has a paper with Jennifer and Masanao on how hierarchical modeling already adjusts for multiple comparisons.

I have already an implementation running with lmer from the R package lme4.
Basically I have a mixed model for each subject, with a random effect for treatment to induce a ridge penalisation ot the treatment parameters.
I calculate pvalues for all possible treatment contrasts within a subject.
Then I do an Bejamini Hochberg correction to control the FDR over all subjects.

I should note that I simplified my example quite a bit. The real problem and the actual model is more complex.
The biggest problem is that in my case it’s hard to get the correct degrees of freedom to calculate pvalues of the contrasts.
I was hoping to avoid these problems by using the posteriors of the treament parameters in a fully bayesian model for inference.

So In my case I have a hyperprior for treament estimated for each individual subject. But some subject have few datapoint so there is almost no shrinkage here. I was hoping to borrow some information from the other subjects to induce more shrinkage. On the other hand there are few subjects with very strong treatment effects so I don’t want the shrinkage to be too strong in this case. That’s why I thought that a ‘hyper hyper prior’ estimated from all subjects on the individual estimated ‘hyper prior’ per subject for the treatment random effect might be a good idea.

I’ve read Andrew’s paper about multiple comparisons but I still struggle to see if this is still valid in my case.
Out of curiosity, what did you mean with your comment:
‘If you’re interested in statistical significance, then you probably don’t want to be using Stan.’

You can do the computations, but trying to reduce things to FDR and significance/p-values is missing the point of Bayesian inference, which is characterizing uncertainty. We tend to focus on model calibration (is interval coverage correct) and sharpness (how narrow are posterior intervals, with narrower being better given calibration).

As far as I know, there’s not a single line of code in all the gazillions of lines of code and doc in the Stan project that talks about statistical significance. The word doesn’t even show up in the manual :-)

The Bayesian perspective separates inference from decision making. With Stan you can build powerful inferences with bespoke models, but they will not have any guaranteed performance for subsequent decisions, such as false discovery rate for a selection decision. Instead you have to calibrate your model as best as you can and live with whatever performance your model has. For more information see https://arxiv.org/abs/1803.08393.

@betanalpha and @Bob_Carpenter
Thanks for your answers!
I went through the paper you linked, interesting read. Allthough I will probably have to go through it a second time to fully grasp it :)
Are there somewhere examples of these techiques implemented in stan somewhere? (eg the Bayesian Limit Setting with Posterior Quantiles)

Unfortunately I don’t have any examples immediately available, but it will help to first reason about the inferences and then when you’re happy with your model construct the decisions you want to make in the generated quantiles block.

It’s definitely interesting. I like it, but I don’t particularly find the region of practical equivalence (ROPE) stuff very helpful. It’s still very much a binary procedure like hypothesis testing, which can be appropriate, but certainly in far fewer cases than current practice would suggest. I very much agree that it is helpful to consider what you think is of practical equivalence, particularly in decision theory, but that doesn’t haven’t to be framed as a hypothesis test on a model parameter.