Checking interpretation of sd parameter

I just have a quick question to check that I am correctly interpreting the output of a multilevel model I have run using BRMS. The formula for the model is:

Outcome ~ 1 + Session + Session:Condition + (1 + Session | gr(PPN, by = Condition))

With this model, I obtain the following output (the precise numbers are not relevant):

Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept:Condition1) 0.48 0.11 0.31 0.73 2731 1.00
sd(sessions2:Condition1) 0.62 0.12 0.43 0.89 2525 1.00
sd(sessions3:Condition1) 0.71 0.14 0.48 1.04 2986 1.00
sd(sessions4:Condition1) 0.77 0.15 0.53 1.10 3070 1.00
sd(Intercept:Condition2) 0.42 0.06 0.31 0.57 2695 1.00
sd(sessions2:Condition2) 0.55 0.08 0.41 0.72 2778 1.00
sd(sessions3:Condition2) 0.57 0.08 0.44 0.74 2674 1.00
sd(sessions4:Condition2) 0.61 0.08 0.47 0.79 2915 1.00

What I wanted to check is that the sd terms for sessions2, sessions3, and sessions4 are not themselves sd terms, but rather deviations from the intercept sd. Is that correct?

For example, in order to calculate the sd for sessions2 in Condition 1, I would not just take the estimate sd(sessions2:Condition1). Instead, I should add together the posterior distributions of the intercept sd(Intercept:Condition1) and sd(sessions2:Condition1), and then make an estimate from the result.

This is quite important to understand, because if the values of the sd parameters must be combined, then it seems clear that the sds for the later sessions are higher than for the intercept (which makes sense, as people respond differently to a treatment, but are quite similar to one another at the start). On the other hand, if the sds for the later sessions are just direct estimates of the the session sds (rather than deviations from the intercept sd), the interpretation is quite different.

Have I understood the sd parameters correctly?

All the best,

Please also provide the following information in addition to your question:

  • Operating System: MacOS 10.12.06
  • brms Version: 2.8.0
1 Like

If I understand you correctly, this is not correct.

sd(sessions2:Condition1) AFAIK estimates how much does the coefficient for sessions2:Condition1 vary between the levels of PPN

I am also not sure what you mean by “sd for sessions2 in Condition 1”, could you clarify? In general it tends to be tricky to interpret coefficients of a hierarchical model directly and I find it preferable to just get posterior predictions with posterior_predict or posterior_linpred and interpret the predictions.

Hi Martin,
I don’t think I was super clear, and also gave an unnecessarily complex example. Let’s imagine a much simpler model, where we have a bunch of people seeking treatment at baseline, and we give them all a treatment, and then we test them again, so our only key effect is Session (Baseline vs. Posttreatment):

Outcome ~ 1 + Session + (1 + Session | Participant)

We might expect that the treatment affects different people in different ways, and so at baseline, the standard deviation is relatively low for Outcome, but at Posttreatment, some people get better and some people don’t, so we have a larger standard deviation around the estimated mean at Posttreatment. To incorporate this, we allow the model to have varying intercepts for each participant, and a varying effect of session. My understanding is that this allows the sd parameters to then vary from Baseline to Posttreatment in the model.

If I were to run this model, I would get two sd parameters:
sd(Intercept) - which I interpreted as sd around the mean at baseline, and
sd(Posttreatment) - which I interpreted as either 1) the sd around the mean at posttreatment, or 2) the change in the sd around the mean at posttreatment, relative to baseline. So if I wanted to know what the estimated sd is at posttreatment, I might then have to add (via samples from the posterior) the sd estimates of sd(Intercept) and sd(Posttreatment).

I hope I have clarified somewhat what I was getting at, but if not then I am happy to try and explain further.

I see. I think your approach will give you a rough answer, but I still think a better way is to make predictions - the main problem I see is that adding those sds will ignore any correlation in the posterior. E.g. it is well possible that if the data do not constrain the parameters very well high sd(Intercept) will tend to be paired with low sd(Posttreatment) and vice versa.

The only way to incorporate this correlation is to use the posterior_predict function to generate samples of predictions and compute sds over those samples. This also makes explicit the choice between predicting for the participants in the study (i.e. using the data you used for fitting for prediction) and predicting for hypothethical new participants (the inferences for those could - in general - differ). See the allow_new_levels parameter to extract_draws as well.

Hope that helps :-)

That does help indeed, thank you! Just to clarify with regards to my original question about whether the sd terms reflect the actual sd at that level, or a deviation relative to the intercept sd - is it correct that if you did want to understand the sds in the way I suggest, the sd for posttreatment in this case would be sd(intercept) + sd(Posttreatment), and not simply sd(Posttreatment), in the same way that the mean for Posttreatment would be mu(intercept) + mu(Posttreatment), rather than merely mu(Posttreatment)?

I’m a little confused about why adding the 2 could not take the possible correlated levels into account. If I extracted samples from the posterior, then added them together, and then summarised the whole distribution of samples, would this not incorporate correlations between the parameters, as each sample from the posterior is a set of jointly credible parameter values?

I will in any case look further into posterior_predict, as it seems a very useful and powerful approach. If you know of a decent tutorial on using posterior predict to make inferences with a simple model for a metric outcome it would be much appreciated! I’m currently looking at which seems to be along the lines you are talking about.

I think it’s also possible that I am really quite mistaken in interpreting the sd parameters, and if I want to get the sd for e.g., pretreatment and posttreatment, in fact what I might better be doing is explicitly asking for a prediction of standard deviations (aka) sigma in the regression equation, as Buerkner does in a tutorial on estimating distributional models, e.g.:

fit1 <- brm(bf(symptom_post ~ group, sigma ~ group), 
            data = dat1, family = gaussian())

Essentially, I just wish to have not only the means pre and post-treatment, but also the standard deviations, given that these are likely to vary. I thought this is what I was allowing to happen with the varying slopes, but now I think not!

I am also not sure what exactly do you want to do with the sd. Standard deviation of what? Of the predictions? Or residual standard deviation? Or some combination of those? If you need some measure of “variance explained”, then intra-class correlation coefficient (ICC) might be for you - as in Intraclass correlation - Wikipedia or Nakagawa & Schielzeth 2010, ‘Repeatability for Gaussian and non‐Gaussian data: a practical guide for biologists’

Also -one thing I missed before: you should IMHO not add sds - you should add variances (which are additive) and then take the square root to have sd of the sum.

Exactly - when you compute whatever value you need per sample and then take the summary everything is smooth. I thought you wanted to take the summary values of the individual parameters and then add, which might be problematic.

Roughly yes. The sd post treatment would IMHO be sqrt(sd(intercept)^2 + sd(Posttreatment)^2 + sigma^2), where sigma is the residual variance. One problem I notice is that this implies that sd post treatment is always higher than sd pre-treatment - which is true for this model, but not necessarily in reality (if the treatment worked very well you would expect the variance to be reduced).

(Now just speculating as I am bit out of my depth) Using Outcome ~ 1 + Session + (0 + Session | Participant) will treat baseline in the same way it does treatment and so post-treatment variability could actually be lower. Alternatively having distributional model with sigma ~ Session will explicitly let the omdel to have lower variability post treatment.

Anyway, please check for yourself that what I say makes sense, I am definitely not 100% sure all I am saying is rock solid, but I hope it is of some use to you :-)

Thanks for your patient explanation @martinmodrak - what I am trying to do is get a model that allows for, and estimates, heterogeneous variances/standard deviations around the mean in each cell of the model. Perhaps it is easier to get what I mean by thinking of it as a sort of ANOVA type design rather than a regression type design. Essentially in this simple model it would we a 2 (session, pre vs. post treatment) x 2 (treatment, placebo vs. real treatment) design. Each cell (e.g., pretreatment, placebo - which in the regression model would be the intercept) could be described by normal distribution with a mean and standard deviation. I have reason to believe that the means in each of these 4 cells may well differ, but also that the standard deviations around the mean might also differ. So I am trying to make a model that does not demand the standard deviations all be the same, or at least for example that it allows them to differ for pre vs. post-treatment.

I had thought, I now think very mistakenly, that by allowing the slopes to vary, I was also implicitly saying that the standard deviations in each cell could also differ. I think it might be more appropriate to explicitly ask the model to estimate both mu and sigma.

The reasons for doing this are twofold. Firstly, I would not like the model to make assumptions about the data that are violated, if the different levels in fact do have very different standard deviations. Secondly, if the sd is estimated for each cell, it would also be possible to compute estimates of effect sizes.

ANOVA-like designs IMHO do call for the distributional regression features of brms, so I think you are on the right track.

Yes and no :-). The standard deviations of the outcome from the predicted mean, i.e. the observation noise (which roughly corresponds to the variability within cells you would expect when running the experiment multiple times with the same person, assuming complete “reset” of the person for each run) is assumed constant across cells even when slopes vary. Do you believe that this “noise” could differ between cells?

On the other hand, the standard deviations of predicted values within cells (which roughly corresponds to the variability you would expect within each cell in a hypothetical replication of the experiment with either a new set of people or the same people, depending on the way you predict) could do all sorts of stuff with varying slopes. For example, the model fit could have that participants with higher pre-treatment values tend to have lower slopes and thus variance post-treatment would decrease. The opposite could also be true.

Yes, I think so too, and brms seems a very nice way of doing it, if I can get to grips with it!

Frustratingly complicated! As far as I understand what you are saying, I don’t think that noise should necessarily differ. I also agree that the sds of predicted values within cells could indeed go down or up, especially e.g., if there is some sort of ‘normal’ value that everyone treated moves towards after treatment.

What I would like to do is essentially use the data to get estimates of whether this is happening, if that is possible. So I am wondering, how might you set up a model where at the end of it, you could read out some parameters in the same way as you could when describing descriptive statistics for the raw data in a 2x2 design. E.g., you could say we estimate that pretreatment, real treatment group, scores are distributed with a mean of X and an sd of Y (including credible intervals from the posterior). Post-treatment, scores are distributed with a mean of A and a sd of B. You could perhaps even then compare the posterior distributions of the means, and divide by the pooled sd (also from the posterior distribution), to get an estimate of plausible Cohen’s d values. Might this be possible, either with a combination of a carefully specified model and then adding parts of the posterior distributions for parameters together, or by carefully specifying the model that allows for such effects and then summarising posterior predictions?

Apologies that this is dragging on rather a lot - I really appreciate the help!

You can do that with any Bayesian model - you just predict (via posterior samples) new values for the cells, compute the relevant sds for each sample and then summarize the distribution of the samples. The tricky part is to make sure your model is a good fit to your data and that the numbers you have computed are relevant to the real world. This involves both theoretical considerations and empirical checks - e.g. posterior predictive checks as described in the visualisation paper. For brms models this is quite eaasy to do via the pp_check.brmsfit method.

If in doubt about model choice, you can always fit more than one model and see if the inferences differ noticeably, in the spirit of “multiverse analysis”. You can then report all the models in you paper/other output. There are many ways to report such multiverse analysis. Apart from the way Andrew does it in the linked paper, here is for example how I’ve done this in a recent collaboration (which I am very happy about :-):

Where each row is yes/no statement that can be evaluated for each sample (in your case it could be for example “sd post treatment < sd pre treatment”) and posterior probability is simply the proportion of samples where the statement holds.

While this was some extra work to do, it also simplified things: since the conclusions we care about do not change much with model choice, there is less need to spend too much time making sure you’ve chosen the best model. Also reviewer complaints in the spirit of “you should have accounted for X in your model” were pretty minimal.

Very nice representation, wow!

I guess the final thing I want to check, which comes back to the interpretation/understanding of the parameters and the initial model specification, my understanding is that if I do not specify in the model that it should have a parameter that estimates standard deviations separately for each ‘cell’ in the 2x2 design, then the posterior predictions of the sds in general will not differ for each cell, as they are based on what is specified in the model. So what, precisely, should I incorporate into the model to allow for this to happen? E.g., if I just specified:

Outcome ~ 1 + Session + (1 | subject)

Then the posterior predicted values might give a different mean, but the standard deviations around the means in session 1 vs. session 2 would not have been modelled as differing, and so would in general not differ (or am I just wrong here?). So would I need to specify something like:

bf(Outcome ~ 1 + Session + (1 | subject), Sigma ~ 1 + Session + (1 | subject))

So that posterior predictions can have varying standard deviations? (I’m also not sure what to do about the nested observations within each subject when specifying the sigma formula).

The tutorials are great, but I would kill for a tutorial that essentially just uses brms to run a quite simple mixed (between + within subjects) ANOVA with a metric outcome variable!

Yes, here there is nothing to make the sd of the cells differ. But with a varying slope (e.g. Outcome ~ 1 + Session + (1 + Session | subject)), the sd of the cells can differ due to the varying slope per subject.

Letting sigma vary per subject is probably not a great idea as you probably don’t have data to estimate separate error term per subject (unless you have many measurements of the same subject per cell). I would go for something like

bf(Outcome ~ 1 + Session + (1 | subject), sigma ~ 1 + Session)

Which you are more likely to have enough data to estimate.

(but really you should test if this is sensible, the distributional regression vignette has some hints on this, and also do pp_check)

Then please write up your experience, once you are done with it! :-) Even a short post will help someone and we IMHO lack people who write tutorials.

Also note that the first example in the distrubtional regression vignette is AFAIK one way ANOVA, so make sure you are generalizing from there.

Ahh yes, I think that’s where my initial confusion in understanding those sd slope terms came from - I was thinking about what the parameters would mean for model predictions, but not directly about what they exactly mean.

Indeed, I was thinking it is strange to have that in there, but didn’t know if it would somehow throw the model of if the equations for the outcome and sigma don’t mirror each other or something.
Likely I do not have sufficient observations to model both varying slopes and sigma, and for some of my data even just measuring one may be tricky as for several variables it is just one observation at each time point per subject. However, I do have some data with multiple observations so I will try what I can, now that I know more what I am trying to achieve!

Yes, the distributional vignette I think has a simple example with an ANOVA, and I’ve also successfully done this in a different data set with just a comparison between groups and estimating sigma and mu - it works very easily. Things seem a bit trickier when it comes to repeated measures in terms of convergence, so perhaps either my data is much more messy or I need some more informative priors.

Yes I absolutely would like to do this - I was speaking with some of my colleagues who are also perplexed by multilevel. I think once even I can understand it, I should be able to explain it very simply - just a little hesitant as every time I think I’ve made a breakthrough I realise there is one other slight thing I’ve misunderstood!

Note that convergence issues can easily arise when your model could be mismatched to the data or your model is not identifiable. Shameless plug: I’ve written something beginner-friendly on divergences and non-identifiability which might be of value to you:

Also feel free to ask for help with convergence issues (probably better with a new thread, this one is getting long)


Awesome, I will read through those posts - thank you!