Interpretation of hierarchical model results - difference in group level parameter mean

Hi all,
this is maybe a trivial question, but I just can’t figure out where i’m going wrong…

I have data that has a 3-level hierarchy: 2 groups of subjects -> ~ 40 subjects each -> 50 sessions per subject (with 20 trials per session). For simplicity let’s say that for each session I then fit 1 parameter. There is no structure across sessions, i.e. i’m ignoring that there may be systematic effects how all participants might perform better in session 50 than in session 1. My measure of interest is how variable subjects are between sessions and whether this differs between the groups.

Based on help here on the forum before (Parameters during warmup far outside prior range), I have tried modelling the data in two ways: either taking into account the 3-level hierarchy or ignoring it. If I ignore it, I instead do: 2 groups of subjects -> 40*50 sessions. From the output of the two-level model, I then (in Matlab) again compute the parameter for each subject by averaging across the sessions.

Surprisingly, when I simulate data from the 3-level hierarchy, I find that the recovered parameters from the 2-level model correlate better with the true parameters than from the 3-level model (for the group level to some extend, but very strongly for the subject and session level). I wonder whether this might be because there is not that much data on each level (i.e. only 20 trials per session etc.). In any case, these simulation results make me think that I should also analyse my real data with the 2-level model. (right?)

Now the confusion I have: I want to compute whether a) the two groups are different in their mean of the parameter and b) whether they differ in how variable the parameter is across sessions (so, the mean of the parameter standard deviation for each subject across sessions).

For b), it seemed straight-forward to compute for each subject the standard deviation across sessions and then the group level mean. When I make a histogram of the 1000 samples (from Stan) for this, the results are about the same as when I do a t-test on the data from the subjects after averaging over the 1000 samples. - so that’s reassuring.

However, for a), I thought the analysis should work the same way. But it does not: when I plot histograms of the means of the two groups (either computed as for the standard deviation above, or directly based on the samples of the group-level parameters that I get from Stan) [see top plot below] it looks very different than if I plot histograms of the single-subject parameters for the two groups [bottom plot]. (And the only stats I can do on this is a t-test because I have averaged out the samples – so that doesn’t seem good either). When I compute the means (either across samples [top] or across subjects [bottom]), they are the same.

So which figure now tells me whether my two groups are different? I assumed it would be the top figure and just did the bottom one as a sanity check, but now I’m not sure anymore.

I would be very grateful for any advice

It’s hard to fit anything with N = 2.

That can happen, especially when the model is very fine-grained and you don’t have enough data to support the richer model. See above. You wind up needing very strong priors if you only have N = 2 groups.

How did you simulate the data? Did you simulate the parameters from the priors and the data from the parameters and run multiple simulations to test calibration? That’s the Cook-Gelman-Rubin approach.

I don’t understand what you mean here—that’s not a parameter, per se. The parameters are what are declared in the parameters block of the program. The program defines a density over those, and that’s what’s sampled. Everthing else is a derived quantity like transformed parameters or generated quantities.