Computing ICC-like reliability (& within-unit variance) for hierarchical models?

So, take an hierarchical data scenario where each of multiple subjects in an experiment are measured many times in each condition of some manipulated variable IV, a saturated model of which being (using lmer syntax):

DV ~ 1+IV + ( 1+IV | SUBJECT )

This will estimate an intercept and effect of V1 for each level of SUBJECT, plus the correlation therebetween.

If one has two sessions of data, it is fairly straightforward to achieve inference on the traditional “test-retest” correlations by binding the rows of the two sessions together and using indicator contrasts for each session, resulting in the following unique rows in the contrast matrix:

# A tibble: 4 x 4
  `Intercept_1`  IV_1 `Intercept_2`  IV_2
          <dbl> <dbl>         <dbl> <dbl>
1             1     1             0     0
2             1    -1             0     0
3             0     0             1     1
4             0     0             1    -1

With that formulation, the correlation between Intercept_1 & Intercept_2 encodes the test-retest reliability of the intercept, and the correlation between IV_1 & IV_2 encodes the test-retest reliability of the effect of the variable IV. And note that each reliability comes with a full posterior distribution.

However, I feel that, given the existence of classical quantities such as intraclass-correlation, there should be a way to compute reliability from even a single session. Gelman & Pardoe (2006) show a simple computation for a quantity I think is related to this, the “pooling factor” (\lambda) but that quantity would take the form of a single point value for each term, and I (possibly naively) was hoping there was a sensible quantity for which we could obtain a full posterior distribution.

Now, to be clear, G&P2006 define \lambda for a given term (i.e. the intercept or the effect of IV) as (with my shorthand at the far right):

\lambda = 1 - \frac{V^{K}_{k=1}(E(\epsilon_k))}{E(V^{K}_{k=1}(\epsilon_k))} = 1 - \frac{\lambda_n}{\lambda_d}

Where, \lambda_n the numerator of the fraction is straightforward, representing the between-subjects variance (though possibly not an identical quantity to the latent between-subjects variance that we typically model but a quasi-empirical variant? At least, in the G&P2006’s R code they seem to look at the latter even though the BUGS model incorporates the former).

The denominator, \lambda_d, is a bit more complicated, but I think it’s intended to reflect the expected variance of the term within each subject. With multiple sessions, it’s straightforward to compute this for a given term (ex. the intercept) by computing it for each subject and session, then within each subject computing the variance of the values and finally collapsing the resulting collection of variances to a single mean across subjects. I expect that value would correspond well with the correlations derived above.

With one session, however, it seems more difficult to derive the expected within-subject variance of a term. G&P2006 propose using the variance across samples in the posterior, thus yielding a single point value rather than a full posterior.

In principle I guess one still has a full posterior on the between-subjects variance (or, at least if one uses the model parameter for this quantity, c.f. G&P2006’s use of a quasi-empirical quantity), one could then divide this value by the point value for \lambda_d to yield a full posterior on the pooling factor, but I still think I’m missing something with that “solution”.

I presume this is all old had somewhere and I’m just terrible at searching to find the relevant literature on this computation. Anyone have any tips? (@andrewgelman yourself perhaps?)

One idea I’ve had is to use the posterior samples to literally generate many simulated sessions of data for many simulated subject for each sample in the posterior, but that feels like a lot of compute and again I have this nagging feeling there’s an analytic solution that I’m missing. If the model were just an intercept, then the magnitude of measurement noise and number of samples would determine the within-subject variance of the intercept term, but with a more complicated model including effects reflecting difference scores between conditions, I’m finding it harder to think of the formula for the implied magnitude of within-subject variance for all terms.


Over on Twitter @donny pointed me to some papers he wrote with @Stephen_Martin that look to be exactly what I’m looking for! I’ll need to take some time to digest, but I’ll report back here.

(Oh, and btw, I’m personally more interested in the \lambda_d term than reliability per-se, but I know that reviewers will want to see reliability too and thought the context of reliability would yield solutions to getting a posterior on \lambda_d)

1 Like

Glad they might work.

If I recall correctly, the shrinkage factor is the average score reliability, no ?

From Introduction to multilevel modeling using rstanarm: A tutorial for education researchers


And then here is average score reliability from

So the idea of using shrinkage to overcome multiple comparisons, etc., is related exactly to psychometrics, i.e., measurment reliability (actually, the earliest derivation I know of for the shrinkage factor comes from a psychometrician from the 1920’s in educational testing).

1 Like

Yes, and @Nathaniel-Haines shows/explores the relation in this post.

I suggest the following modeling approach.

  1. Format your data as below:
subject     session       IV        DV
subj1        ses1        -0.5       ...
subj1        ses1         0.5       ...
subj1        ses2        -0.5       ...
subj1        ses2         0.5       ...
  1. Adopt the model formulation (17) in our recent test-retest reliability paper

  2. Execute the model as

DV ~ ~0+session+session:IV+(0+session|subject)+(0+session:IV|subject)

The correlation associated with (0+session:IV|subj) is what you are looking for (the contrast between the two levels of IV) while the correlation associated with (0+session|subj) is for the average between the two levels of IV.

Right, sorry, I could have been more clear in my presentation of the topic. I am pretty comfortable with the computation in the context of multiple sessions of data. It’s the single-session scenario that has left me puzzled.

Here is an old school source for computing the pooled estimates.

Kelley, T. L. (1927). Interpretation of educational measurements.

Here is from Gelman and Pardoe “Bayesian Measures of Explained Variance and
Pooling in Multilevel (Hierarchical) Models”


Same thing (\omega is reliability), but we have the citation from psychology (my field) from 1927. I heard Gelman once say at a talk, if you thought of it, a psychometrician probably thought of it decades ago. Which many are surprised to see such things from psychology related fields, so I am sure to cite Kelley :-)

1 Like

One note is that model assumes everyone has the same measurement error, which is hardly the case in, say, cognitive tasks. In our work linked above, we allow for individual differences in sigma (or the residual variance), which leads to individual difference in reliability, i.e., per subject reliability.

1 Like

(still haven’t looked at the refs, but couldn’t resist looking at your responses)

Yeah, I’ve been doing full location-scale models. Wait, actually, the main one I’m working on is multi-outcome with location-scale for one outcome (log-RT) but the multivariate normal also has log-odds coefficients for a binomial outcome (response error rates); presumably for a binomial outcome the within-subjects variance term is entirely determined by number of trials? Or the probability and number of trials? Feel free to ignore this if the binomial outcome case is discussed in your papers already…

Hi–I’m getting lost in the details. My intuition here is that for each wrinkle in the design you can include a new variance parameter in your model. This paper may give a sense of what I’m talking about:
I hope this helps!
See you

1 Like

@donny Finally got around to looking at the refs you suggested (all this is for publishing my dissertation work, so takes side-priority to my day job), and so far as I can tell they show how to use the MELSS framwork to get posterior distributions on the individual-level reliability in each condition of a multi-condition within-Subjects experimental design. So, for a stroop experiment, we get a reliability for Subject 1 in the Congruent condition and a reliability for Subject 1 in the Incongruent condition.

Is there any similarly straightforward way to derive the posterior distribution on Subject 1’s reliability for the Incongruent-minus-Congruent difference score?

There was one paper that used a multivariate model in psychometrika not to long ago. I tried to find it, but could not.

But what you could do, assuming you have two sessions, is fit a multivariate mixed effects model. Then the correlation between the inhibition effects would be test-retest reliability.

Other than that, if there is some equation for difference score reliability for one session, then maybe this can be applied directly to the posteriors somehow ?

Yes, I’ve done that part already. Part of why I’m trying to work out if there is a way to do it in a single session is that my multi-session correlations are unreasonably low for even quantities like the intercept, which should be very high. I thought the lkj prior might have been over-regularizing to zero amid a massive set of correlations, but I just ran with no prior on the correlations and getting the same odd behaviour. When I run the model with a much smaller correlation matrix (ex. modelling just the intercept), they come out high as expected, so I thought I’d explore alternate computations and see if they yield the same behaviour.

Yeah, this is my core question, whether said equation already exists somewhere.

Oh! If K is the across-subjects covariance between congruent & incongruent “true” scores and j indexes a particular subject, and N_{Q_j} is the number of trials observed in condition Q for subject j, then wouldn’t it be simply:

  • First work out the between-Ss SD of the difference score given the variance & covariance of the component scores:
    \sigma_{B_{\Delta}} = \sqrt{ \sigma_{B_{Incongruent}}^2 + \sigma_{B_{Congruent}}^2 - 2*K }

  • then for each component score compute the within-Ss standard error given the within-Ss SD and measurement effort:
    SE_{W_{Q_j}} = \frac{\sigma_{W_{Q_j}}}{\sqrt{N_{Q_j}}}

  • Now combine the two SEs:
    \sigma_{W_{\Delta_j}} = \sqrt{(SE_{W_{Incongruent_j}})^2 + (SE_{W_{Congruent_j}})^2}

  • and finally use the standard ratio:
    \rho_{\Delta_j} = \frac{\sigma_{B_{\Delta}} }{\sigma_{B_{\Delta}} + \sigma_{W_{\Delta_j}} }


Hm, \sigma_{B_{\Delta}} feels right but I’m less sure of \sigma_{W_{\Delta_j}}; What do you think @donny?

(edit: corrected a silly mis-expression of the standard error)

My kids are up, so have to come back to this.

But here is that paper

I am not sure if it will work for your case, as it seems their formulation is for actual pre post, rather than repeated measurements from the different conditions.

No rush! I think I worked things out for the Gaussian-outcome case in my last post here, now to work out the binomial-outcome case…

Just a quick update to this topic in case anyone has interests in this domain. While I think I worked out how to obtain the single-session reliability of a measure with Gaussian-distributed measurement noise through the eqns posted above, I was having trouble working out a similar solution to the binomial outcome case and also had some other issues with extracting quantities of interest from hierarchical data like this, so I came up with a much more brute-force solution that is also far more flexible.

The approach I’m now working on is:

  • Sample hierarchical model with whatever contrast scheme makes most sense for appropriate regularization, etc. (so, no need for complicated contrasts to split multi-session data to sample test-retest correlations; that quantity will be derivable later) and for each sample in the posterior:
    • extract the parameters associated with characterizing the group-level multivariate structure (means, sds, correlation matrix) and use this structure to generate many simulated subjects, then within each:
      • compute any derived quantity of interest (\Psi_{true})
      • generate lots of simulated observation-level datasets using the same structure as that which obtained the original real data, and in each:
        • Use Stan to estimate the simSubj’s “true” parameters, using the group-level multivariate structure as priors, then for each sample in the resulting posterior
          • compute each DQI (\Psi_{est})
        • collapse the resulting posteriors on each \Psi_{est} to some point estimate (ex. mean/median). (It’s possible that this part needs to be avoided, causing later steps to have to instead sample from each posterior, but going with this for now for compute/storage minimization)
      • Each DQI for this simSubj now has a distribution of values from many simulated sessions, so within each, compute a mean and an SD (\mu_{\Psi_{est}}, \sigma_{\Psi_{est}})
    • Each simSubj:DQI combo now has a true value, a mean estimated value and the sd of the estimated values, so now we can compute:
      • core quantities:
        • group-level correlations among \Psi_{true} (\rho_{\Psi_{true}})
        • SD of \Psi_{true} (\sigma_{\Psi_{true}}, true between-Ss variability in the DQI)
        • Mean of \sigma_{\Psi_{est}} (\mu_{\sigma_{\Psi_{est}}}; mean within-Ss variability in the DQI)
        • “reliability” of \Psi (\Phi_{\Psi}) via \frac{\sigma_{\Psi_{true}}}{\sigma_{\Psi_{true}} + \mu_{\sigma_{\Psi_{est}}}}
        • (if interested, one could also compute here each simSubj’s individual reliability, which wouldn’t be of interest by itself given they’re not real subjects, but the variability of the by-subject reliabilities might be of interest)
      • extras (not sure these acutally have usefulness)
        • group-level correlations among \Psi_{est} (\rho_{\Psi_{est}})
        • SD of \Psi_{est}
        • SD of \sigma_{\Psi_{est}} (\sigma_{\sigma_{\Psi_{est}}}; variability of within-Ss variability in the DQI)
        • “reliability” of \Psi_{est} (\Phi_{\Psi_{est}}) via \frac{\sigma_{\Psi_{est}}}{\sigma_{\Psi_{est}} + \mu_{\sigma_{\Psi_{est}}}}
  • Now we have a posterior distribution on each of the core and extra quantities!