Example hierarchical models with 3+ levels of hierarchy?

Are there any example hierarchical models anywhere with 3 or more levels of hierarchy? For example, one score from each of multiple students, each in precisely one of multiple schools, each in precisely one of multiple districts?

Embarrassing as it is, I’m having trouble wrapping my head around implementing something like that and I’d like to see how some canonical examples approach it to help get past what I’m sure is going to be a head-smacking block.

I just coded up a quick example using brms. Does this help you?

# number of students/scores
N_g <- 50
# number of schools
N_j <- 20
# number of districts
N_k <- 10
# total cases
N <- N_g * N_j * N_k

# intercept (e.g. IQ scores)
alpha <- 100
student_vec <- 1:N
school_vec <- rep(1:N_j, each=N_g, times = N_k)
district_vec <- rep(1:N_k, each=N_j*N_g)
# random intercept for schools
r_j <- rnorm(N_j, 0, 10)
# random intercepts for districts
r_k <- rnorm(N_k, 0, 5)
# linear predictor
mu <- alpha + r_j[school_vec] + r_k[district_vec]
# noise
sigma <- 5

# simulate data
y <- rnorm(N, mu, sigma)

# create data frame
d <- data.frame(
  school = school_vec, 
  district = district_vec

# fit the model (can take a few minutes depending on N)
fit <- brms::brm(
  y ~ 1 + (1 | school) + (1 | district), 


And the output from a representative run:

Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y ~ 1 + (1 | school) + (1 | district) 
   Data: d (Number of observations: 10000) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~district (Number of levels: 10) 
              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept)     3.72      1.06     2.29     6.17        622 1.01

~school (Number of levels: 20) 
              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept)    11.44      1.91     8.25    15.86        512 1.00

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept   100.73      2.81    95.08   106.36        369 1.00

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma     4.99      0.03     4.93     5.06       1389 1.00


Where the SD among students within school i is \sigma_i, I’m hoping to have partially-pool the values as log(\sigma_i) \sim N(\mu_{schools\sigma},\sigma_{schools\sigma})

Then, where the SD among schools within district j is \sigma_j, I’m hoping to have partially-pool the values as log(\sigma_j) \sim N(\mu_{districts\sigma},\sigma_{districts\sigma})

Does brms permit that kind of complexity? (I’m still not really familiar with it; more a raw-Stan person)

So you want heteroskedastic variances, varying by school and district?

brms can handle a variety of distribution-specific parameter formulas. For a model with a Gaussian likelihood, you could do something like:

fit <- brms::brm(
    y ~ 1 + (1 | school) + (1 | district), 
    sigma ~ 1 + (1 | school) + (1 | district),

although this is a much more difficult beast to fit, in my experience. I have a two-level hierarchical ordinal probit model with heteroskedastic variances here, if your interested: https://github.com/ConorGoold/GooldNewberry_modelling_shelter_dog_behaviour/blob/master/Stan_full_model.stan


Wow, that’s remarkably simple! (Specification-wise; I take your point that the geometry may reflect a difficult space to sample)

Yes, brms is excellent! Although I’m not actually a frequent user, tending towards raw Stan too.

I would probably code up those types of models in raw Stan anyway, because they will likely need careful tuning when it comes to fitting on real data sets with other complexities.

1 Like

Hm, on actually inspecting the generated Stan code, I actually don’t think this gets at the kind of nested hierarchy with heterogeneity I was seeking to implement. The specification:

          y ~ 1 + (1 | school) + (1 | district)
   ,  sigma ~ 1 + (1 | school) + (1 | district)

Allows for school and district random effects on the magnitude of variability among students, but still expresses that there is a single SD that conveys how schools vary from one another in their mean in each district. The idea I was trying to implement would have a separate SD per district to capture variability across districts in how schools’ means vary. Ditto separate SD per district to capture variability across districts in how schools’ SDs vary.

To be concrete, here’s some pseudocode of the generative process I’m thinking of (easiest to parse if you start at the innermost for loop):

for(this_district in districts){

	this_district_mean_school_mean = rnorm(1,mean_district_mean_school_mean,sd_district_mean_school_mean)
	this_district_sd_school_mean = exp(rnorm(1,mean_district_logsd_school_mean,sd_district_logsd_school_mean))

	this_district_mean_school_logsd = rnorm(1,mean_district_mean_school_logsd,sd_district_mean_school_logsd)
	this_district_sd_school_logsd = rnorm(1,mean_district_sd_school_logsd,sd_district_sd_school_logsd)

	for(this_school in schools_in_this_district){

		this_school_mean = rnorm(1,this_district_mean_school_mean,this_district_sd_school_mean)
		this_school_sd = exp(rnorm(1,this_district_mean_school_logsd,this_district_sd_school_logsd))

		for(this_student in students_in_this_school){
			score = rnorm(1,this_school_mean,this_school_sd)

Do you mean something like this:

data {
   int N_districts;
   int N_schools;
   int N_students;
   int school_in_district_idx[N_schools];
   int student_in_school_idx[N_students];
   vector[N_students] score;

parameters {
  real top_mu;
  real<lower=0> top_sigma;

  vector[N_districts] district_mu;
  vector<lower=0>[N_districts] district_sigma;
  vector[N_schools] school_mu;
  vector<lower=0>[N_schools] school_sigma;

  vector[N_students] student_mu;
  real<lower=0> sigma;

model {
  top_mu ~ std_normal()
  top_sigma ~ std_normal()
  district_mu ~ normal(top_mu, top_sigma);

  for (s in 1:N_schools) {
     int idx = school_in_district_idx[s];
     school_mu[s] ~ normal(district_mu[idx],district_sigma[idx]);

  for (k in 1:N_students) {
    int idx = student_in_school_idx[k];
    student_mu[k] ~ normal(school_mu[idx],school_sigma[idx]);
score ~ normal(student_mu, sigma);

Yes, though I think you’ve left off the specification of partial pooling for district_sigma.

I’m coding this in raw Stan myself right now (with some more complexity in design plus my reduced-computation trick), so I’ll post back when I’m done. I am nonetheless still curious if brms can do it out-of-the-box.

1 Like

FYI, I’ve been playing with this a bunch and have discovered that it really matters whether you employ the centered or non-centered parameterization, and independently so for each level of the hierarchy. For example, I’m finding that with the simulation space I’m exploring right now, the schools have to be non-centered and the students have to be centered.

I’ll be posting a mini case-study on this shortly.

@betanalpha made a nice figure giving some guidance when centered/non-centered parameterization work well.

1 Like

Yeah, it’s on my to-do to sit down and put some serious thought into how geometry of the 3-level case I’m exploring relates to that. I think the idea will be that you can have one layer of the hierarchy for which the data strongly inform and one layer where it doesn’t?

I have not thought deeply about this, my only intuitions are that the more fewer groups one describes with a higher level prior, the less informed the estimation of that prior is.

On the other hand, if this higher level prior has itself an (even …) higher prior, is estimation could also be informed by other groups on the same level.

I don’t have a strong intuition and would probably resort to simulations to figure this out and sharpen/clarify my intuition…

You can do this with MELSMs, of course; unfortunately, brms does not (currently) support adding scale models to RE-SD parameters, and though my package (LMMELSM) does support adding scale models to RE-SD parameters, it does not (currently) support more than one random factor (e.g., no crossed or nested effects).

But you may want to look at the MELSM literature; I know I’ve seen 3-levels MELSMs. In the end, it’s really not conceptually harder than a 2-level MELSM, and nesting is not really different than just including a A:B interaction of grouping terms. You basically implement it all the same way.

Edit: Also for centering/non-centering, you can write an _lp function that, for each factor, transforms each level based on whether it should be a ncp or cp transform, and increment the target accordingly. The function just needs raw RE parameters, and a specification that indicates whether each level of each factor should be treated as a ncp or cp RE.