Example hierarchical models with 3+ levels of hierarchy?

Are there any example hierarchical models anywhere with 3 or more levels of hierarchy? For example, one score from each of multiple students, each in precisely one of multiple schools, each in precisely one of multiple districts?

Embarrassing as it is, I’m having trouble wrapping my head around implementing something like that and I’d like to see how some canonical examples approach it to help get past what I’m sure is going to be a head-smacking block.

I just coded up a quick example using brms. Does this help you?

# number of students/scores
N_g <- 50
# number of schools
N_j <- 20
# number of districts
N_k <- 10
# total cases
N <- N_g * N_j * N_k

# intercept (e.g. IQ scores)
alpha <- 100
student_vec <- 1:N
school_vec <- rep(1:N_j, each=N_g, times = N_k)
district_vec <- rep(1:N_k, each=N_j*N_g)
# random intercept for schools
r_j <- rnorm(N_j, 0, 10)
# random intercepts for districts
r_k <- rnorm(N_k, 0, 5)
# linear predictor
mu <- alpha + r_j[school_vec] + r_k[district_vec]
# noise
sigma <- 5

# simulate data
y <- rnorm(N, mu, sigma)

# create data frame
d <- data.frame(
  school = school_vec, 
  district = district_vec

# fit the model (can take a few minutes depending on N)
fit <- brms::brm(
  y ~ 1 + (1 | school) + (1 | district), 


And the output from a representative run:

Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y ~ 1 + (1 | school) + (1 | district) 
   Data: d (Number of observations: 10000) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~district (Number of levels: 10) 
              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept)     3.72      1.06     2.29     6.17        622 1.01

~school (Number of levels: 20) 
              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept)    11.44      1.91     8.25    15.86        512 1.00

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept   100.73      2.81    95.08   106.36        369 1.00

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma     4.99      0.03     4.93     5.06       1389 1.00


Where the SD among students within school i is \sigma_i, I’m hoping to have partially-pool the values as log(\sigma_i) \sim N(\mu_{schools\sigma},\sigma_{schools\sigma})

Then, where the SD among schools within district j is \sigma_j, I’m hoping to have partially-pool the values as log(\sigma_j) \sim N(\mu_{districts\sigma},\sigma_{districts\sigma})

Does brms permit that kind of complexity? (I’m still not really familiar with it; more a raw-Stan person)

So you want heteroskedastic variances, varying by school and district?

brms can handle a variety of distribution-specific parameter formulas. For a model with a Gaussian likelihood, you could do something like:

fit <- brms::brm(
    y ~ 1 + (1 | school) + (1 | district), 
    sigma ~ 1 + (1 | school) + (1 | district),

although this is a much more difficult beast to fit, in my experience. I have a two-level hierarchical ordinal probit model with heteroskedastic variances here, if your interested: https://github.com/ConorGoold/GooldNewberry_modelling_shelter_dog_behaviour/blob/master/Stan_full_model.stan


Wow, that’s remarkably simple! (Specification-wise; I take your point that the geometry may reflect a difficult space to sample)

Yes, brms is excellent! Although I’m not actually a frequent user, tending towards raw Stan too.

I would probably code up those types of models in raw Stan anyway, because they will likely need careful tuning when it comes to fitting on real data sets with other complexities.

1 Like

Hm, on actually inspecting the generated Stan code, I actually don’t think this gets at the kind of nested hierarchy with heterogeneity I was seeking to implement. The specification:

          y ~ 1 + (1 | school) + (1 | district)
   ,  sigma ~ 1 + (1 | school) + (1 | district)

Allows for school and district random effects on the magnitude of variability among students, but still expresses that there is a single SD that conveys how schools vary from one another in their mean in each district. The idea I was trying to implement would have a separate SD per district to capture variability across districts in how schools’ means vary. Ditto separate SD per district to capture variability across districts in how schools’ SDs vary.

To be concrete, here’s some pseudocode of the generative process I’m thinking of (easiest to parse if you start at the innermost for loop):

for(this_district in districts){

	this_district_mean_school_mean = rnorm(1,mean_district_mean_school_mean,sd_district_mean_school_mean)
	this_district_sd_school_mean = exp(rnorm(1,mean_district_logsd_school_mean,sd_district_logsd_school_mean))

	this_district_mean_school_logsd = rnorm(1,mean_district_mean_school_logsd,sd_district_mean_school_logsd)
	this_district_sd_school_logsd = rnorm(1,mean_district_sd_school_logsd,sd_district_sd_school_logsd)

	for(this_school in schools_in_this_district){

		this_school_mean = rnorm(1,this_district_mean_school_mean,this_district_sd_school_mean)
		this_school_sd = exp(rnorm(1,this_district_mean_school_logsd,this_district_sd_school_logsd))

		for(this_student in students_in_this_school){
			score = rnorm(1,this_school_mean,this_school_sd)

Do you mean something like this:

data {
   int N_districts;
   int N_schools;
   int N_students;
   int school_in_district_idx[N_schools];
   int student_in_school_idx[N_students];
   vector[N_students] score;

parameters {
  real top_mu;
  real<lower=0> top_sigma;

  vector[N_districts] district_mu;
  vector<lower=0>[N_districts] district_sigma;
  vector[N_schools] school_mu;
  vector<lower=0>[N_schools] school_sigma;

  vector[N_students] student_mu;
  real<lower=0> sigma;

model {
  top_mu ~ std_normal()
  top_sigma ~ std_normal()
  district_mu ~ normal(top_mu, top_sigma);

  for (s in 1:N_schools) {
     int idx = school_in_district_idx[s];
     school_mu[s] ~ normal(district_mu[idx],district_sigma[idx]);

  for (k in 1:N_students) {
    int idx = student_in_school_idx[k];
    student_mu[k] ~ normal(school_mu[idx],school_sigma[idx]);
score ~ normal(student_mu, sigma);

Yes, though I think you’ve left off the specification of partial pooling for district_sigma.

I’m coding this in raw Stan myself right now (with some more complexity in design plus my reduced-computation trick), so I’ll post back when I’m done. I am nonetheless still curious if brms can do it out-of-the-box.

1 Like

FYI, I’ve been playing with this a bunch and have discovered that it really matters whether you employ the centered or non-centered parameterization, and independently so for each level of the hierarchy. For example, I’m finding that with the simulation space I’m exploring right now, the schools have to be non-centered and the students have to be centered.

I’ll be posting a mini case-study on this shortly.

@betanalpha made a nice figure giving some guidance when centered/non-centered parameterization work well.

1 Like

Yeah, it’s on my to-do to sit down and put some serious thought into how geometry of the 3-level case I’m exploring relates to that. I think the idea will be that you can have one layer of the hierarchy for which the data strongly inform and one layer where it doesn’t?

I have not thought deeply about this, my only intuitions are that the more fewer groups one describes with a higher level prior, the less informed the estimation of that prior is.

On the other hand, if this higher level prior has itself an (even …) higher prior, is estimation could also be informed by other groups on the same level.

I don’t have a strong intuition and would probably resort to simulations to figure this out and sharpen/clarify my intuition…