Is my approach to posterior distribution of effect size in linear model valid?

[Edit - my first proposal of an approach in this message was not correct due to mistaking what the sigma term in the model output means. In the later comments from ReneTwo and myself, a more suitable approach is (hopefully) arrived at]

I am looking to check the validity of an approach I have taken to get a posterior distribution of effect sizes in a linear model.

I have fitted the following regression model. I suppressed the intercept so that I begin with estimated means for each group. Group and Prox are categorical variables with 2 values:

model <- brm(outcomeVar ~ groupindex-1 + proxindex + proxindex:groupindex + (1|subject)

To approximate a ‘main effect’ of group, I:

  • drew samples from the posterior (posterior_samples)
  • calculated the overall score of each group, irrespective of proximity, by adding different parameter estimates together. E.g., for group1 overall score, I just calculated: (group 1 mean + (group 1 mean + effect of proximity))/2 - so the result is the average of group 1, over both proximity conditions.
  • I then calculated differences between the groups by subtracting one Group’s overall score from the other, over all the samples from the posterior.
  • I then can get a posterior distribution of the differences. So, this is just a simple group level contrast that averages over one of the variables, and I think it is correct

Is it valid to then generate a posterior distribution of possible effect sizes (Cohen’s d - the difference between groups divided by the standard deviation) by simply dividing the group difference in each posterior sample by the estimated sigma value for each sample? This makes sense to me, and does produce sensible results, but I wanted to confirm that it is an acceptable way to use the posterior. I have attached a plot I made of one such effect size posterior distribution (the very large size of the effect is not a mistake - it is for a manipulation check where there is almost no overlap in the values from each group).

What I did next was also to compare two effect sizes by subtracting their estimates from each other. I then could get a posterior distribution of the difference between effects sizes, showing whether or not one was likely larger than the other. Is that also a viable approach?

If you are able to confirm whether this approach to making effect sizes is valid, it would be awesome!


Please also provide the following information in addition to your question:

  • Operating System: macOS Sierra 10.12.6
  • brms Version: 2.8.0

Hi JimBob,

I lately pondered about a similar way to circumvent Stan-hardcoding… (where this question has a different answer, namely: estimate the standardized effect directly, and then transform it to the group mean estimates).

I think, in general your way of doing this is somewhat justifiable. But, I see three general issues in doing this, which need some elaboration if you want to sell this:

  1. You say you are using the estimates for sigma to standardize the mean differences. But why not the subject intercepts, or both (I think you simply can add both variances together, not sure exactly). That is, you -have- two variance clusters, one caused by the presence of different subjects (by subject intercepts) and one taking the rest (the residuals that can not be explained by the model, including the differences between individuals). So the subject intercepts imply, that you have repeated measurements of your DV. And taking this variability into account will lead to a different -theoretical- interpretation of your effect size. Now, would you want to standardize a between-group effect relative to between-subject variability? The answer seems to be ‘yes’, and it should be the rather intercept variance, because random intercepts represent the deviations of the subjects average response from the group mean(s), while the remaining residuals, are within subject errors. Within subject variance usually is smaller than between subject variance, such that using the residuals probably drastically underestimates your “between-subjects” effect (and thus, still, the large effect size looks like a mistake :)); no offense).

  2. There is an estimation issue I guess (I did not try it out yet), with respect to what is called effect- ‘shrinkage’ (of the difference), which is an important/debated topic in statistical modeling (i.e. shrinkage = somewhat desirable, you will easily find papers on that, if not, let me know). Shrinkage (in short: ‘downscaled’ estimation of the ‘difference effect’) might be absent with the way you estimate your cell-means, and calculate differences afterwards. In other words, usually, you define an effect prior (for the difference; e.g. Cauchy(m=0, sd=1)) which serves as starting point for the sampling procedure, and then construct the cell-means from that mean difference. And the starting point always influences the posterior (e.g. with a prior of Gaussian(m=0,sd=0.1), you will hardly obtain a posterior different from that (e.g. if the true difference is 10), because the most likely differences are never sampled. This is an extreme (and unreasonable) case of shrinkage… :)) . But you see, with an -actual- Null-Prior on the difference, the estimated difference will be another one than the one you got. Maybe not (it might depend on the way you define the cell-mean priors). But you will need arguments for or against this :)

  3. Back to the variance issue. Traditionally, mean standardization is never done by using -overall variance estimates- (regardless of which you variance term you will use eventually), but variance estimates which are the pooled variances of both groups. So lets take the subject intercepts: I wondered somewhat (not too far) whether one could simply use the overall variance estimate of the subjects, or whether one should separately estimate subject variance in each group. But in the end, one might simply stop thinking, and do the latter to be on the safe side. For the subject intercepts this is definitely possible since the new brms release. i.e.:

 .... +(1|gr(subject, by = group))

You can then pool these separate intercept variance estimates (weighted by the sample sizes of course). And if you found a reasonable answer to the shrinkage issue (or to whether it is important to you) than this might be your best shot :)

Hope this helps.
Best René

Thanks @ReneTwo for your quick response.

Regarding point 1, are you suggesting just adding the estimate of the variance in intercepts for each subject to the overall sigma given in the model output, and then using that as the divisor of the difference between groups?
No offence taken about the size of the effect - I agree it is very large, perhaps questionably so. On the other hand, it is ratings of emotion in a very high vs a low fear group, and they are told they are just about to be confronted with the feared object, so in the one group they really are freaking out, and the other group are just not bothered - the responses they give are totally different.

  1. For this shrinkage consideration, I used a rather vague prior as I knew the differences could be very large. Could this issue however be addressed with a sensitivity analysis - trying some alternative priors for the effect that are increasingly conservative, and confirming at which point, if at all, the estimates change?

  2. I think I follow you, but I’m not sure. Indeed if we did the effect size the standard way in frequentist stats we would get grp1 sd and grp2 sd separately, then do a pooled estimate to divide the group difference. In a separate regression, I did look at whether overall sigma varied by group (but not the variance in intercepts of subjects), and sigma seemed the same for each group, so I thought an overall sigma value might be possible. However, I did not ask for separate intercept sds per group.

I don’t think I completely understand what the code you noted would do - I think provide separate intercept variances for each group? Would you then pool these estimates, and add them to the overall sigma, or just use them on their own as the divisor of the group difference? Is gr a function?


No big deal :) . I like this topic.

on 1) To be completely and un-agreeably clear. I think using the residuals (sigma) of the model is simply wrong (still no offense), apart from the nature of the manipulation (like emotion, I see your point). But your manipulation seems to be between-subjects. And in a model, which separates residuals from between-participant variance (subject intercepts), the residuals have nothing to do anymore with between-variance, which has to be the main basis of the standardization (i.e. the subject intercepts). One can surely do nothing wrong to also include the residuals, but using only is not want you want.

on 2) Sounds good to me :) But, unfortunately, I am no expert on shrinkage.

on 3) Again, the residuals (sigma) in your model are -not- between variance, but -within-variance; see point 1) (and check this post This means, to find out whether the groups have different variances between-participants, you need to look at the subject intercepts. And yes, the code I provided does exactly what you think it does :) You simply need to replace (1|subject) by it and voila :), separate intercept estimates for each group (within-each-group variability between subjects). It is actually quite important to do this, since the overall sample size greatly determines the variance estimates. And if one group has less variance then the other then the variance of the latter will be underestimated if you do not take them separately. In turn, if the ‘actual’ variance in a group is larger than your estimate tells, then the posterior of the corresponding group mean, will underestimate the range of likely mean parameters, as well (i.e. overestimate the certainty you can put into the mean), which, in turn, will affect your difference estimate :))

Best, René

I appreciate the clearness! Really this type of analysis is quite new to me so I think I may have been misinterpreting what the sigma value in the overall model actually represents. As it is called something like ‘family specific parameters’ I had thought it is something like a pooled overall estimate of the standard deviation around the mean in each group, and thus would be like what you would use to make Cohen’s d in a typical analysis. From what you’re saying, I’m quite mistaken!?

I will try out editing the model with your suggested addition and also read the link to get a better idea. I will probably write back again once I’ve gone over everything properly…


Well, think about it this way:
If there is a common definition of what a ‘residual’ is, then it would be: The difference between a predicted and an actual outcome. And all residuals together describe the residual variance. If you further go on and define what a ‘prediction’ is (i.e. the model), this determines what you can say additionally about what the residual is, e.g. like you just did.
For example, a classical (non-mixed-model), would look like:

y ~ [intercept + effect] + residual'

Where the brackets are what the definition above, refers to as ‘model’, and error’ (residuals) are left over.
In mixed models, in which you have repeated measures for subjects, this would look like

y~[intercept + effect + (1+effect|subject)] + residual''

Now, the error (residual) is something different, because the subject variance is partialed out, or ‘bound’ to the intercept and slope variance.
And notice, when comparing both models you see:
(classic) residual’ = (1+effect|subject)] + residual’’

And you also see, if every participant would have only given one response, then:
(1+effect|subject)] + residual’’ defines three parameters (intercept, slope, and residuals) for -one and only one deviation from a mean for each individual (i.e. subject variance), which is beyond overparameterization (i.e. unidentifiable). In other words, subject intercepts and residuals are the same (and slopes don’t exist). And in this case, one-response-per-subject, you would be right, residuals = individual variation. But with repeated measures (or stimulus replications) subject intercepts ‘remove’ the individual variation from the residual terms, and residuals become within subject variance.

Hope this helps clarifying mixed-models (or how to make plausible hierarchical Bayes models). I am still trying to find the most intuitive/effective way of explaining this kind of statistics :))

Best, René

This does clarify things, thank you. I think my confusion partly stemmed from just seeing residuals labeled as ‘sigma’ in the model summary, whereas with lmer, the estimate of the residuals is actually labelled as ‘residual’. I saw ‘sigma’ and just thought it means the estimated sd for the means, assuming them to be homogeneous over all the levels. Now I see it is essentially the ‘prediction error’ of the model, if I understand you correctly.

I successfully implemented the model with the edited code you suggested. Looking into the posterior for these I was also able to see that some had what are to me quite implausibly low estimates for deviations, and so I was then able to set a prior on the sd values with a gamma distribution that makes such low estimates more unlikely to arise, which could otherwise lead to inflated effect size estimates. I was also able to make pooled estimates over both groups, taking account that one group had one extra subject, so should be weighted slightly differently in the calculation. Thanks very much for this!

Now to compound my confusion and possibly test your patience, when we request the model to estimate (1|gr(subject, by = group) as you suggested, the values returned are, for example, labeled as sd(Intercept:group0Low) - i.e., deviations around the intercept for ‘low’ and ‘high’ groups separately. If I were to use these for estimating effect sizes, is it problematic to use them for estimating the effect size when the between group factor I am manipulating is not the value that is used for the intercept? E.g., I am imagining that the variability might be different at different levels of the within groups factor, and this difference might also vary by group.

Looking at your example of a mixed model above, I think I may have managed to retrieve the sd around the means for the different levels implied by the 2x2 design: WithinGroup0@BetweenGroup0, WithinGroup1@BetweenGroup0, WithinGroup0@BetweenGroup1, and WithinGroup1@BetweenGroup1 separately with the following code:

(1 + proxindex | gr(subject, by = group))

For the full model that is then:

brm(outcome ~ group-1 + proxindex + proxindex:group + (1 + proxindex | gr(subject, by = group))

The output I then get for the group level effects lists (as well as covariances):

Can I rightly interpret these to mean?:
sd(Intercept:groupindex0Low) <- stddev around the mean for WithinGroup0@BetweenGroup0
sd(proxindex1Proximal:groupindex0Low) <- stddev around the mean for WithinGroup1@BetweenGroup0
sd(Intercept:groupindex1High) <- stddev around the mean for WithinGroup0@BetweenGroup1
sd(proxindex1Proximal:groupindex1High) <- stddev around the mean for WithinGroup1@BetweenGroup1

If that is the case, then I really think these could be used to make proper calculations of effect sizes for any comparison at different levels of the model, by selecting and pooling the appropriate sds.

If I have lost the plot here, or if you don’t understand what I’m getting at, then I can try to clarify.

Thanks so much for your explanations - I think I am getting somewhere…

Ah yeah… good point. It slipped my attention. There is a trick.
To being able to actually do what you have in mind, the code would be (I also use bf() lately, because you can plug it into useful other functions).

bayformula<-bf(outcome ~ 0 +proxindex:group + (1 + proxindex | gr(subject, by = group)))
get_prior(bayformula, data=THEDATA)
brm(bayformula,data and options andsoforth)

get_prior will show you how the output will be structured. In this case, the estimated fixed effect parameters will be (I don’t know the actual labels now):
groupX1: proxindexProximal
groupX1: proxindexDistal
groupX2: proxindexProximal
groupX2: proxindexDistal

which just means, that the model estimates all four cell means. On top, you still can model the random effect (or hierarchical) structure as before, this does not matter much, as it still simply defines a source of variance on the DV.
This means, you can finally take the cell means, and each has a variance part.
That is, for the group-effect, for instance, it is the intercept sd_group1 and sd_group2, which you can use to pool and standardize the mean differences.

I guess there is a way to start of with ‘real’ intercepts (DV~1 + group … etc), but I have no idea, why this might be wanted (back to the shrinkage issue I guess, because, the group effect with a real intercept, is a deviation from the intercept-estimate, instead of the cell-mean). In the end, I have the feeling that the model design formula (like intercept vs. no intercept) in this case is simply arbitrary, and important to know to interpret the effects, but nothing more, as long as it describes the same thing (e.g. group means).

Best, René

Ps: by saying “I have the feeling” I mean, that there might be a debate about the consequences in MCMC sampling when using intercept+effect or cell-means directly. Formally (without sampling questions) both approaches will fit the data identically.

1 Like

@ReneTwo this seems to work very well! I now have estimates of cell means, and estimates of the sd around them. It seems quite reasonable to me to estimate some effect sizes based on appropriately pooling those sds and looking at differences between cells/averages of cells. Thanks very much for your help with it :)