Multi-level modelling for discrete (symptom based) groups in medicine


Hi Stan community,

I have a couple of theoretical question regarding hierarchical multi-level modelling for multiple groups & repeated measures when we do not know whether the different groups are indeed different groups or come from a same continuum.

Let me give an example:

Say we have 4 group study in medicine/psychiatry (Group1: Healthy controls, Group 2: healthy controls @ increased familial risk of addiction, Group 3: currently addicted patients, Group 4: remitted from addiction).

Aim: The goal of such a study would be to fit behavioural data from all participants on a task and compare whether latent variables of interest (say, from our best fitting model) differ between groups.

Question 1:

What do you think is the most appropriate way of modelling these 4 groups hierarchically ?

Option1 : Should I fit my multi-level model with priors over parameters that are shared across all 4 groups (i.e. partial-pooling across all groups)? As in:

model_option1 {
	mu_group ~ N(0,1);
	sd_group ~ HC(0,5);

	param_subject ~ N(mu_group,sd_group);

	y[group,subject,trial] ~ normal(param_subject*x[group,subject],…);

The problem with such a model is that I’m enforcing different groups to be more similar from each other, while they might not be (i.e. parameter estimates will be too similar/conservative-estimate?)

Option 2: Should I fit each group separately and compare the ‘group-level’ priors instead of ‘subject_level’ parameters ?

The problem with option 2 is that it’s more likely to make groups more dissimilar to each other and potentially inflate differences between groups. Theoretically this could be justified if we assume that different groups are indeed coming from distinct populations as they have a different diagnosis (but not everyone agrees with this — some researchers argue all subjects, regardless of diagnosis, lie within a continuum from ‘healthy’ to ‘pathological’). That is, with option 2 are parameter estimates going to be ‘too liberal’ and finds significant group differences more often than would/should be warranted?

Option 3: Should I add another layer to the hierarchy (as in the radon levels case study) where I enforce group level priors to come from an underlying shared distribution, as in:

model_option3 {
	mu_shared_groupwise ~ N(0,1);
	sd_shared_groupwise  ~ HC(0,5);

	mu_group1 ~ N(mu_shared_groupwise,…);
	sd_group1 ~ HC(sd_shared_groupwise,…);

	mu_group2 ~ N(mu_shared_groupwise,…);
	sd_group2 ~ HC(sd_shared_groupwise,…);

	mu_group3 ~ N(mu_shared_groupwise,…);
	sd_group3 ~ HC(sd_shared_groupwise,…);

	mu_group4 ~ N(mu_shared_groupwise,…);
	sd_group4 ~ HC(sd_shared_groupwise,…);

	param_subject_group[1,:] ~ N(mu_group1,sd_group1);
	param_subject_group[2,:] ~ N(mu_group2,sd_group2);
	param_subject_group[3,:] ~ N(mu_group3,sd_group3);
	param_subject_group[4,:] ~ N(mu_group4,sd_group4);

	for i_group = 1 in 4 {
		y[i_group,subject] ~ normal(param_subject_group[i_group,i_subj]*x[i_group,subject],…);

Is option 3 the least biased (i.e. not too conservative, nor too liberal?)? If I use option 3, would you compare ‘subject_level’ parameters (as in: param_subject_group[1,:] vs param_subject_group[2,:]) or ‘group_level’ parameters ?

Thank you so much for your help
Best regards,


Sorry—I think this one fell through the cracks at the holidays!

What we usually recommend is fitting bunches of different models and seeing which one works best—usually measured predictively in some way we care about.

You can compare predictions for new elements within groups or new groups as a whole.

Presumably your groups are different and the real question is what the effect is and how large it is.

Your options are modeling decisions about exchangeability. When you do partial pooling, the data will tell you how much to pool. The problem is with only four groups, you probably won’t be getting much information to set the hyperpriors. Jennifer Hill and Andrew Gelman and Masanao Yajima have a paper on multiple comparison with hierarchical models:

If you look at the radon example, there are over 100 counties, so you get a lot of information to set the hierarchical parameters.

I don’t think it’s going to be helpful to think in terms of bias here. What you want to think about is calibration (as in perform posterior predictive checks) and sharpness (how tight are the predictive intervals).

Coding tip: whenever you see repeated numerical prefixes, you want to replace those with arrays (or vectors).