A faster implementation of hierarchical models

Rick_RS95 · May 9, 2026, 12:06pm

Hi all!

I am sure many of you know useful tricks to make hierarchical models fit faster. I have been trying different things for a while, and I would like to share a specification that is very fast and that, as far as I know, I have not seen discussed in the Stan forums.

Because my background is in psychometrics and experimental psychology, I will use a simple example inspired by common experimental tasks such as Stroop, flanker, or Simon tasks. Let Y_{ik\ell} be the observed response for subject i on trial \ell in condition k, where k indexes two conditions, for example congruent and incongruent trials.^[1]

Y_{ikl}\sim\mathcal{N}\!\left(\mu = \alpha_i + \beta_i \cdot X_k,\;\sigma = \sigma\right),

where X_k is coded as -\frac{1}{2} and +\frac{1}{2}. Thus, \alpha_i is the subject’s average response, and \beta_i is the subject-specific experimental effect. At the population level,

\begin{bmatrix} \alpha_i\\ \beta_i\\ \end{bmatrix} \sim \mathcal{MVN}\!\left(\boldsymbol\mu = \begin{bmatrix} \mu_\alpha\\ \mu_\beta\\ \end{bmatrix},\;\mathbf\Sigma = \begin{bmatrix} \sigma^2_\alpha & \sigma_{\alpha\beta}\\ \sigma_{\alpha\beta} & \sigma^2_\beta\\ \end{bmatrix} \right).

The R code below generates the simulated data used as a running example in this post. Specifically, I simulate a balanced design with 150 subjects and 150 trials per subject and condition. The next sections show how to fit the same model to these data using three different specifications: one using brms, and two more efficient versions programmed directly in Stan. In all cases, I use the same prior distributions.

Click here to see the R code used to simulate the data

library(MASS)

# Basic gaussian hierarchical model with random interceps and slopes
# Population parameters from here: https://osf.io/preprints/psyarxiv/gv6k7_v2
set.seed(2026)
mu_alpha        <- .65
mu_beta         <- .06
sigma_alpha     <- .12
sigma_beta      <- .04
rho_ab          <- matrix(c(1,.5,.5,1), ncol=2)
sigma_trial     <- .18
Sigma_REs <- diag(c(sigma_alpha, sigma_beta)) %*% rho_ab %*%
  diag(c(sigma_alpha, sigma_beta))

# Subjects and trials per subject and coondition (2, e.g., congruent/incongruent)
I <- 150
L <- 150

# Simulate random effects
REs <- MASS::mvrnorm(
  n = I, 
  mu = c(mu_alpha, mu_beta), 
  Sigma = Sigma_REs
)

# Simulate observed data
dat <- matrix(NA, ncol = 3, nrow = I * L * 2)
cnt <- 1
for(i in 1:I) {
  for(k in 1:2) {
    
    # Alpha, beta and contidion
    alpha_i <- REs[i,1]
    beta_i  <- REs[i,2]
    X_k     <- (k - 1.5)
    mu_i    <- alpha_i + beta_i * X_k
    
    for(l in 1:L) {
      dat[cnt, 1] <- i
      dat[cnt, 2] <- k
      # Trial-level data
      dat[cnt, 3] <- rnorm(1, mu_i, sigma_trial)
      cnt <- cnt + 1
    }
  }
}

# Store as data frame
dat <- as.data.frame(dat)
colnames(dat) <- c("subj", "cond", "rt")

1. Fitting the model with `brms`

As a reference, I first fit the model using brms:

\texttt{rt = 1 + cond + (1 + cond | subj)}.

This is the most convenient specification. In my computer, this model requires 16 minutes to be estimated.

Click here to see the `brms` code

# Set condition as contrast sum
dat$cond <- dat$cond - 1.5

# Set prior distributions
priors_brms <- c(
  prior(normal(0, 1), class = "Intercept"),
  prior(normal(0, 1), class = "b", coef = "cond"),
  prior(student_t(4, 0, 0.5), class = "sd", group = "subj", coef = "Intercept"),
  prior(student_t(4, 0, 0.5), class = "sd", group = "subj", coef = "cond"),
  prior(student_t(4, 0, 0.5), class = "sigma"),
  prior(lkj(1), class = "cor", group = "subj")
)

# Fit brms model
brms_fit <- brm(
  formula = rt ~ 1 + cond + (1 + cond | subj),
  data = dat, 
  prior = priors_brms,
  iter = 2000, 
  warmup = 1000, 
  chains = 4, 
  cores = 4, 
  seed = 2026)

# Estimation time
et_brms <- mean(rowSums(get_elapsed_time(brms_fit$fit)))

Before moving to the hand-written Stan versions, it is useful to look at what brms does in the model block:

model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = rep_vector(0.0, N);
    mu += Intercept;
    for (n in 1:N) {
      // add more terms to the linear predictor
      mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
    }
    target += normal_id_glm_lpdf(Y | Xc, mu, b, sigma);
  }
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(to_vector(z_1));
}

In this Gaussian model, brms builds a temporary linear predictor of length N, where here N=I \times K \times L = 45,000, and then evaluates the likelihood in one vectorized call. The two Stan versions below avoid building this full N-length vector.

2. A hand-written Stan model using a long-format matrix

The first hand-written Stan version still uses the long-format data. The main change is that I do not build a full N-length linear predictor. Instead, I evaluate the likelihood separately for each subject, using only that subject’s slice of the long data. This only requires two indexing objects: the number of rows per subject and the row where each subject starts. If the data are sorted by subject, both can be obtained with two lines of R code:

# Number of observations for each subject
trialsum <- as.integer(rle(dat$subj)$lengths)

# First row of each subject in the long data frame
idx_start = cumsum(c(1L, head(trialsum, -1L)))

The Stan model below fits the same hierarchical model as the brms specification. The key part is the loop over subjects in the model block: for each subject, I find the relevant rows and apply the normal model to that slice of the data.

data {
  int<lower=1> N;                         // Number of observations
  int<lower=1> I;                         // Number of subjects
  matrix[N, 3] Y;                         // Columns: (1 = subj, 2 = cond, 3 = rt)
  array[I] int<lower=1> trialsum;         // Number of observed trials per subject
  array[I] int<lower=1> idx_start;        // First row of each subject in Y
}          

parameters {
    vector[2] mu_REs;                     // Population means
    vector<lower=0>[2] sd_REs;            // Population std. devs
    real<lower=0> sigma;                  // Trial-level std. dev
    cholesky_factor_corr[2] L_R;          // Population REs correlation
    matrix[2, I] REs_std;                 // Standardized REs
}

transformed parameters{
  // Scaled random effects
  matrix[I, 2] REs = rep_matrix(mu_REs', I) + 
                    (diag_pre_multiply(sd_REs, L_R) * REs_std)';
}

model {
    // Model priors
    mu_REs ~ std_normal();
    sd_REs ~ student_t(4, 0, 0.5);
    sigma  ~ student_t(4, 0, 0.5);
    L_R    ~ lkj_corr_cholesky(1);
    to_vector(REs_std) ~ std_normal();

    // Model Log-Likelihood
    for (i in 1:I) {
      // idx of the last observed raw for subject i
      int idx_end = idx_start[i] + trialsum[i] - 1;
      
      // Vectorized likelihood for subject i
      Y[idx_start[i]:idx_end, 3] ~ normal(
        REs[i, 1] + REs[i, 2] * Y[idx_start[i]:idx_end, 2],
        sigma
      );
   }
}

generated quantities {
  // Model-implied correlation matrix
  corr_matrix[2] Rho_REs = multiply_lower_tri_self_transpose(L_R);
}

Click here to see the R code for the long-format Stan model

# The efficient hierarchical model via blocking
HM_stancode_block <- "
data {
  int<lower=1> N;                         // Number of observations
  int<lower=1> I;                         // Number of subjects
  matrix[N, 3] Y;                         // Columns: (1 = subj, 2 = cond, 3 = rt)
  array[I] int<lower=1> trialsum;         // Number of observed trials per subject
  array[I] int<lower=1> idx_start;        // First row of each subject in Y
}          

parameters {
    vector[2] mu_REs;                     // Population means
    vector<lower=0>[2] sd_REs;            // Population std. devs
    real<lower=0> sigma;                  // Trial-level std. dev
    cholesky_factor_corr[2] L_R;          // Population REs correlation
    matrix[2, I] REs_std;                 // Standardized REs
}

transformed parameters{
  // Scaled random effects
  matrix[I, 2] REs = rep_matrix(mu_REs', I) + 
                    (diag_pre_multiply(sd_REs, L_R) * REs_std)';
}

model {
    // Model priors
    mu_REs ~ std_normal();
    sd_REs ~ student_t(4, 0, 0.5);
    sigma  ~ student_t(4, 0, 0.5);
    L_R    ~ lkj_corr_cholesky(1);
    to_vector(REs_std) ~ std_normal();

    // Model Log-Likelihood
    for (i in 1:I) {
      // idx of the last observed raw for subject i
      int idx_end = idx_start[i] + trialsum[i] - 1;
      
      // Vectorized likelihood for subject i
      Y[idx_start[i]:idx_end, 3] ~ normal(
        REs[i, 1] + REs[i, 2] * Y[idx_start[i]:idx_end, 2],
        sigma
      );
   }
}

generated quantities {
  // Model-implied correlation matrix
  corr_matrix[2] Rho_REs = multiply_lower_tri_self_transpose(L_R);
}
"

# Store trialsum
trialsum <- as.integer(rle(dat$subj)$lengths)

# Prepare Stan data
sdata_block <- list(
  N = nrow(dat),
  I = length(unique(dat$subj)),
  Y = as.matrix(dat),
  trialsum = trialsum,
  idx_start = as.integer(cumsum(c(1L, head(trialsum, -1L))))
)

# Fit stan model
fit_stan_block <- stan(
  model_code = HM_stancode_block, 
  data = sdata_block, 
  iter = 2000, 
  warmup = 1000, 
  chains = 4, 
  cores = 4, 
  seed = 2026)

# Estimation time
et_stan_block <- mean(rowSums(get_elapsed_time(fit_stan_block)))

On my computer, this model takes a little under three minutes to fit, which makes it about six times faster than the brms specification.

3. A hand-written Stan model using a balanced array

The second hand-written Stan version uses an array representation of the data. Because the simulated design is balanced, the response variable can be stored as

Y[i, k, \ell],

where i indexes subjects, k indexes conditions, and \ell indexes trials. This makes the likelihood very compact: for each subject and condition, all trials share the same linear predictor, so the likelihood can be vectorized over trials:

data {
  int<lower=1> I;                         // Number of subjects
  int<lower=1> L;                         // Number of trials per subj and cond
  array[I, 2, L] real Y;                  // Array with observed responses 
}          

parameters {
    vector[2] mu_REs;                     // Population means
    vector<lower=0>[2] sd_REs;            // Population std. devs
    real<lower=0> sigma;                  // Trial-level std. dev
    cholesky_factor_corr[2] L_R;          // Population REs correlation
    matrix[2, I] REs_std;                 // Standardized REs
}

transformed parameters{
  // Scaled random effects
  matrix[I, 2] REs = rep_matrix(mu_REs', I) + 
                    (diag_pre_multiply(sd_REs, L_R) * REs_std)';
}

model {
    // Model priors
    mu_REs ~ std_normal();
    sd_REs ~ student_t(4, 0, 0.5);
    sigma  ~ student_t(4, 0, 0.5);
    L_R    ~ lkj_corr_cholesky(1);
    to_vector(REs_std) ~ std_normal();

    // Model Log-Likelihood
    for (i in 1:I) {
        for(k in 1:2) {
            // Linear prediction
            real mu_ik = REs[i, 1] + REs[i, 2] * (k - 1.5);
            
            // Vectorized log-likelihood
            Y[i, k, ] ~ normal(mu_ik, sigma);
      }
   }
}

generated quantities {
  // Model-implied correlation matrix
  corr_matrix[2] Rho_REs = multiply_lower_tri_self_transpose(L_R);
}

Click here to see the R code for the array-based Stan model

# The efficient hierarchical model
HM_stancode_array <- "
data {
  int<lower=1> I;                         // Number of subjects
  int<lower=1> L;                         // Number of trials per subj and cond
  array[I, 2, L] real Y;                  // Array with observed responses 
}          

parameters {
    vector[2] mu_REs;                     // Population means
    vector<lower=0>[2] sd_REs;            // Population std. devs
    real<lower=0> sigma;                  // Trial-level std. dev
    cholesky_factor_corr[2] L_R;          // Population REs correlation
    matrix[2, I] REs_std;                 // Standardized REs
}

transformed parameters{
  // Scaled random effects
  matrix[I, 2] REs = rep_matrix(mu_REs', I) + 
                    (diag_pre_multiply(sd_REs, L_R) * REs_std)';
}

model {
    // Model priors
    mu_REs ~ std_normal();
    sd_REs ~ student_t(4, 0, 0.5);
    sigma  ~ student_t(4, 0, 0.5);
    L_R    ~ lkj_corr_cholesky(1);
    to_vector(REs_std) ~ std_normal();

    // Model Log-Likelihood
    for (i in 1:I) {
        for(k in 1:2) {
            // Linear prediction
            real mu_ik = REs[i, 1] + REs[i, 2] * (k - 1.5);
            
            // Vectorized log-likelihood
            Y[i, k, ] ~ normal(mu_ik, sigma);
      }
   }
}

generated quantities {
  // Model-implied correlation matrix
  corr_matrix[2] Rho_REs = multiply_lower_tri_self_transpose(L_R);
}
"

# Store data in a three-dimensional array
Y_array <- aperm(array(dat$rt, dim = c(L, 2, I)), c(3, 2, 1))

# Prepare Stan data
sdata_array <- list(
  I = I,
  L = L,
  Y = Y_array
)

# Fit the model
fit_stan_array <- stan(
  model_code = HM_stancode_array,
  data = sdata_array,
  iter = 2000,
  warmup = 1000,
  chains = 4,
  cores = 4,
  seed = 2026
)

# Estimation time
et_stan_array <- mean(rowSums(get_elapsed_time(fit_stan_array)))

Here I use the simplest version, because the simulated design is balanced. The same idea also works with unbalanced data, as long as we can keep track of which observations belong to each subject. In fact, I used this strategy with unbalanced data in my own work, based on Haines et al. (2025), which is where I first saw this way of specifying hierarchical models.

On my computer, this model takes about 27 seconds to fit. This makes it around 6 times faster than the previous Stan version, and around 35 times faster than the brms specification.

4. Model comparison

The table below shows the elapsed time for the three approaches on my computer:

Model	Seconds	Minutes	Relative to array
`brms`	962.6	16.04	34.9
Stan, long-data	171.3	2.86	6.2
Stan, array	27.6	0.46	1.0

Below, I also report the estimates of the group-level parameters obtained with the three models.

# ------------------- #
#    Results: brms    #
# ------------------- #

 Family: gaussian 
  Links: mu = identity 
Formula: rt ~ 1 + cond + (1 + cond | subj) 
   Data: dat (Number of observations: 45000) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Multilevel Hyperparameters:
~subj (Number of levels: 150) 
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)           0.12      0.01     0.11     0.14 1.06      138      311
sd(cond)                0.04      0.00     0.03     0.04 1.01     1039     2195
cor(Intercept,cond)     0.64      0.06     0.51     0.75 1.01      942     2183

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.66      0.01     0.64     0.67 1.03       90      182
cond          0.05      0.00     0.05     0.06 1.01      278      667

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.18      0.00     0.18     0.18 1.00     6770     2631

# ----------------------------------- #
#    Results: Long format approach    #
# ----------------------------------- #

# Results: Long format approach
  parameter      variable       mean       sd median   q2.5  q97.5  rhat ess_bulk ess_tail
  <chr>          <chr>         <dbl>    <dbl>  <dbl>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>
1 mu_alpha       mu_REs[1]    0.654  0.00957  0.654  0.636  0.674   1.01     41.7     93.5
2 mu_beta        mu_REs[2]    0.0524 0.00342  0.0524 0.0457 0.0590  1.01    154.     717. 
3 sigma_alpha    sd_REs[1]    0.123  0.00738  0.123  0.110  0.139   1.02    129.     196. 
4 sigma_beta     sd_REs[2]    0.0377 0.00282  0.0376 0.0324 0.0436  1.00    851.    1826. 
5 sigma          sigma        0.181  0.000586 0.181  0.180  0.182   1.00   7633.    2823. 
6 rho_alpha_beta Rho_REs[1,2] 0.636  0.0619   0.639  0.503  0.748   1.00    755.    1665.

# ----------------------------- #
#    Results: Array approach    #
# ----------------------------- #

  parameter      variable       mean       sd median   q2.5  q97.5  rhat ess_bulk ess_tail
  <chr>          <chr>         <dbl>    <dbl>  <dbl>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>
1 mu_alpha       mu_REs[1]    0.655  0.0100   0.655  0.635  0.675  1.01      85.0     164.
2 mu_beta        mu_REs[2]    0.0527 0.00348  0.0526 0.0460 0.0594 1.00     277.      891.
3 sigma_alpha    sd_REs[1]    0.124  0.00713  0.123  0.111  0.138  1.00     240.      530.
4 sigma_beta     sd_REs[2]    0.0379 0.00291  0.0378 0.0326 0.0441 1.00    1164.     1814.
5 sigma          sigma        0.181  0.000589 0.181  0.180  0.182  1.000   6423.     2537.
6 rho_alpha_beta Rho_REs[1,2] 0.634  0.0641   0.638  0.494  0.748  1.00     845.     1920.

To conclude

Overall, I think there may be room for improving the way some common hierarchical models are implemented in experimental psychology, including cognitive-process models such as drift diffusion models, which can be especially slow. I am not sure how far this idea generalizes to other hierarchical settings, such as longitudinal models with autoregressive structure.

In my case, the array-based approach was useful because, in my main work, I extend the previous model to J experimental tasks. The population-level model then becomes something like

\begin{bmatrix} \boldsymbol{\alpha}_i\\ \boldsymbol{\beta}_i \end{bmatrix} \sim \mathcal{MVN}_{2\times J}\!\left( \boldsymbol\mu = \begin{bmatrix} \boldsymbol{\mu}_\alpha\\ \boldsymbol{\mu}_\beta \end{bmatrix}, \mathbf\Sigma = \begin{bmatrix} \mathbf{\Sigma}_{\alpha} & \mathbf{\Sigma}_{\alpha\beta}\\ \mathbf{\Sigma}_{\beta\alpha} & \mathbf{\Sigma}_{\beta} \end{bmatrix} \right),

where \boldsymbol{\alpha}_i and \boldsymbol{\beta}_i contain subject i’s intercepts and experimental effects across J tasks. As a fun fact, in my own version I also parameterize \mathbf{\Sigma}_{\beta} as \mathbf{\Lambda}\mathbf{\Phi}\mathbf{\Lambda}^{\top} + \mathbf{\Psi}, as in a psychometric factor model.

This model specification, with or without the latent factor parameterization, becomes computationally expensive in brms. For example, I tried a version with 200 subjects, 6 tasks, and 100 trials per task. The array-based Stan version took about 5 minutes to fit, whereas the same model in brms took around 40 hours.

Of course, this is much less flexible than brms for specifying arbitrary models. Still, I wonder whether this kind of structured likelihood could be useful in other settings too. I would be very curious to hear what others think.

I leave below the full R code that I used for this illustration.

Click here to see the full R code used in this illustration

# ╔════════════════════════════════════════════════════════════════════════════╗
# ║                             SCRIPT OVERVIEW                                ║
# ╠════════════════════════════════════════════════════════════════════════════╣
# ║ Script Name   : stan_eff_hiermods.R                                        ║
# ║ Author        : Ricardo Rey-Sáez                                           ║
# ║ Role          : PhD Candidate in Psychology                                ║
# ║ Institution   : Autonomous University of Madrid, Spain                     ║
# ║ Email         : ricardoreysaez95@gmail.es                                  ║
# ║ Date          : 09-05-2026                                                 ║
# ╚════════════════════════════════════════════════════════════════════════════╝

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 1: Load Packages
# ─────────────────────────────────────────────────────────────────────────────
# Libraries necessary for the script to function
library(MASS)
library(dplyr)
library(brms)
library(rstan)

# ─────────────────────────────────────────────────────────────────────────────

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 2: User-Defined Functions
# ─────────────────────────────────────────────────────────────────────────────

# Basic gaussian hierarchical model with random interceps (alpha) and slopes (beta)
# Population parameters from here: https://osf.io/preprints/psyarxiv/gv6k7_v2
set.seed(2026)
mu_alpha        <- .65
mu_beta         <- .06
sigma_alpha     <- .12
sigma_beta      <- .04
rho_ab          <- matrix(c(1,.5,.5,1), ncol=2)
sigma_trial     <- .18
Sigma_REs <- diag(c(sigma_alpha, sigma_beta)) %*% rho_ab %*%
  diag(c(sigma_alpha, sigma_beta))

# Subjects and trials per subject and coondition (2, e.g., congruent/incongruent)
I <- 150
L <- 150

# Simulate random effects
REs <- MASS::mvrnorm(
  n = I, 
  mu = c(mu_alpha, mu_beta), 
  Sigma = Sigma_REs
)

# Simulate observed data
dat <- matrix(NA, ncol = 3, nrow = I * L * 2)
cnt <- 1
for(i in 1:I) {
  for(k in 1:2) {
    
    # Alpha, beta and contidion
    alpha_i <- REs[i,1]
    beta_i  <- REs[i,2]
    X_k     <- (k - 1.5)
    mu_i    <- alpha_i + beta_i * X_k
    
    for(l in 1:L) {
      dat[cnt, 1] <- i
      dat[cnt, 2] <- k
      # Trial-level data
      dat[cnt, 3] <- rnorm(1, mu_i, sigma_trial)
      cnt <- cnt + 1
    }
  }
}

# Store as data frame
dat <- as.data.frame(dat)
colnames(dat) <- c("subj", "cond", "rt")

# ─────────────────────────────────────────────────────────────────────────────

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 3: Simulate and fit brms model
# ─────────────────────────────────────────────────────────────────────────────

# Set condition as contrast sum
dat$cond <- dat$cond - 1.5

# Set prior distributions
priors_brms <- c(
  prior(normal(0, 1), class = "Intercept"),
  prior(normal(0, 1), class = "b", coef = "cond"),
  prior(student_t(4, 0, 0.5), class = "sd", group = "subj", coef = "Intercept"),
  prior(student_t(4, 0, 0.5), class = "sd", group = "subj", coef = "cond"),
  prior(student_t(4, 0, 0.5), class = "sigma"),
  prior(lkj(1), class = "cor", group = "subj")
)

# Fit brms model
brms_fit <- brm(
  formula = rt ~ 1 + cond + (1 + cond | subj),
  data = dat, 
  prior = priors_brms,
  iter = 2000, 
  warmup = 1000, 
  chains = 4, 
  cores = 4, 
  seed = 2026)

# Estimation time
et_brms <- mean(rowSums(get_elapsed_time(brms_fit$fit)))

# ─────────────────────────────────────────────────────────────────────────────

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 4: Fit efficient hierarchical model --- block
# ─────────────────────────────────────────────────────────────────────────────

# The efficient hierarchical model via blocking
HM_stancode_block <- "
data {
  int<lower=1> N;                         // Number of observations
  int<lower=1> I;                         // Number of subjects
  matrix[N, 3] Y;                         // Columns: (1 = subj, 2 = cond, 3 = rt)
  array[I] int<lower=1> trialsum;         // Number of observed trials per subject
  array[I] int<lower=1> idx_start;        // First row of each subject in Y
}          

parameters {
    vector[2] mu_REs;                     // Population means
    vector<lower=0>[2] sd_REs;            // Population std. devs
    real<lower=0> sigma;                  // Trial-level std. dev
    cholesky_factor_corr[2] L_R;          // Population REs correlation
    matrix[2, I] REs_std;                 // Standardized REs
}

transformed parameters{
  // Scaled random effects
  matrix[I, 2] REs = rep_matrix(mu_REs', I) + 
                    (diag_pre_multiply(sd_REs, L_R) * REs_std)';
}

model {
    // Model priors
    mu_REs ~ std_normal();
    sd_REs ~ student_t(4, 0, 0.5);
    sigma  ~ student_t(4, 0, 0.5);
    L_R    ~ lkj_corr_cholesky(1);
    to_vector(REs_std) ~ std_normal();

    // Model Log-Likelihood
    for (i in 1:I) {
      // idx of the last observed raw for subject i
      int idx_end = idx_start[i] + trialsum[i] - 1;
      
      // Vectorized likelihood for subject i
      Y[idx_start[i]:idx_end, 3] ~ normal(
        REs[i, 1] + REs[i, 2] * Y[idx_start[i]:idx_end, 2],
        sigma
      );
   }
}

generated quantities {
  // Model-implied correlation matrix
  corr_matrix[2] Rho_REs = multiply_lower_tri_self_transpose(L_R);
}
"

# Store trialsum
trialsum <- as.integer(rle(dat$subj)$lengths)

# Prepare Stan data
sdata_block <- list(
  N = nrow(dat),
  I = length(unique(dat$subj)),
  Y = as.matrix(dat),
  trialsum = trialsum,
  idx_start = as.integer(cumsum(c(1L, head(trialsum, -1L))))
)

# Fit stan model
fit_stan_block <- stan(
  model_code = HM_stancode_block, 
  data = sdata_block, 
  iter = 2000, 
  warmup = 1000, 
  chains = 4, 
  cores = 4, 
  seed = 2026)

# Estimation time
et_stan_block <- mean(rowSums(get_elapsed_time(fit_stan_block)))

# ─────────────────────────────────────────────────────────────────────────────

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 5: Fit efficient hierarchical model --- array
# ─────────────────────────────────────────────────────────────────────────────

# The efficient hierarchical model
HM_stancode_array <- "
data {
  int<lower=1> I;                         // Number of subjects
  int<lower=1> L;                         // Number of trials per subj and cond
  array[I, 2, L] real Y;                  // Array with observed responses 
}          

parameters {
    vector[2] mu_REs;                     // Population means
    vector<lower=0>[2] sd_REs;            // Population std. devs
    real<lower=0> sigma;                  // Trial-level std. dev
    cholesky_factor_corr[2] L_R;          // Population REs correlation
    matrix[2, I] REs_std;                 // Standardized REs
}

transformed parameters{
  // Scaled random effects
  matrix[I, 2] REs = rep_matrix(mu_REs', I) + 
                    (diag_pre_multiply(sd_REs, L_R) * REs_std)';
}

model {
    // Model priors
    mu_REs ~ std_normal();
    sd_REs ~ student_t(4, 0, 0.5);
    sigma  ~ student_t(4, 0, 0.5);
    L_R    ~ lkj_corr_cholesky(1);
    to_vector(REs_std) ~ std_normal();

    // Model Log-Likelihood
    for (i in 1:I) {
        for(k in 1:2) {
            // Linear prediction
            real mu_ik = REs[i, 1] + REs[i, 2] * (k - 1.5);
            
            // Vectorized log-likelihood
            Y[i, k, ] ~ normal(mu_ik, sigma);
      }
   }
}

generated quantities {
  // Model-implied correlation matrix
  corr_matrix[2] Rho_REs = multiply_lower_tri_self_transpose(L_R);
}
"

# Store data in a three-dimensional array
Y_array <- aperm(array(dat$rt, dim = c(L, 2, I)), c(3, 2, 1))

# Prepare Stan data
sdata_array <- list(
  I = I,
  L = L,
  Y = Y_array
)

# Fit stan model
fit_stan_array <- stan(
  model_code = HM_stancode_array, 
  data = sdata_array, 
  iter = 2000, 
  warmup = 1000, 
  chains = 4, 
  cores = 4, 
  seed = 2026)

# Estimation time
et_stan_array <- mean(rowSums(get_elapsed_time(fit_stan_array)))

# ─────────────────────────────────────────────────────────────────────────────

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 6: Model comparison
# ─────────────────────────────────────────────────────────────────────────────

# brms model is 6 times slower
round(et_brms / et_stan_block)

# brms model is 35 times slower (!!)
round(et_brms / et_stan_array)

# stan block model is X times slower than stan array
round(et_stan_block / et_stan_array)

# Same population means
cbind(true = c(mu_alpha, mu_beta),
      brms = fixef(brms_fit)[1:2,"Estimate"], 
      stan_block = summary(fit_stan_block, "mu_REs")$summary[,"mean"],
      stan_array = summary(fit_stan_array, "mu_REs")$summary[,"mean"])

# Same population Std.devs
cbind(true = c(sigma_alpha, sigma_beta),
      brms = VarCorr(brms_fit)$subj$sd[,"Estimate"], 
      stan_block = summary(fit_stan_block, "sd_REs")$summary[,"mean"],
      stan_array = summary(fit_stan_array, "sd_REs")$summary[,"mean"])

# Same trial-level residual standard deviation
cbind(true = sigma_trial,
      brms = VarCorr(brms_fit)$residual__$sd[,"Estimate"], 
      stan_block = summary(fit_stan_block, "sigma")$summary[,"mean"],
      stan_array = summary(fit_stan_array, "sigma")$summary[,"mean"])

# Same true correlation between experimental effects
cbind(true = .5,
      brms = VarCorr(brms_fit)$subj$cor["Intercept" ,"Estimate", "cond"], 
      stan_block = summary(fit_stan_block, "Rho_REs[1,2]")$summary[,"mean"],
      stan_array = summary(fit_stan_array, "Rho_REs[1,2]")$summary[,"mean"])

# ─────────────────────────────────────────────────────────────────────────────

EDIT: I will use the Gaussian likelihood for the sake of clarity, but keep in mind that the same idea can be used with any other likelihood, especially those better suited for response times. ↩︎

ssp3nc3r · May 9, 2026, 3:51pm

In your particular example (Gaussian with a linear predictor that takes only two distinct values per subject, one for each condition) you could take advantage of sufficient statistics and the Stan model might be an order of magnitude faster than your fastest example.

You’d calculate the sufficient statistics, from your R code something like this:


cell_stats ← dat |>
group_by(subj, cond) |>
summarise(
ybar = mean(rt),
ss   = sum((rt - mean(rt))^2),
.groups = “drop”
) |>
arrange(subj, cond)

ybar_mat ← matrix(cell_stats$ybar, nrow = I, ncol = 2, byrow = TRUE)
ss_mat   ← matrix(cell_stats$ss,   nrow = I, ncol = 2, byrow = TRUE)

sdata_ss ← list(
I    = I,
K    = 2,
L    = L,
ybar = ybar_mat,
ss   = ss_mat,
x    = c(-0.5, 0.5)
)

Then in your Stan program, something like this:

data {
  int<lower=1> I;          // subjects
  int<lower=1> K;          // conditions
  int<lower=1> L;          // trials per cell (balanced)
  matrix[I, K] ybar;       // cell means
  matrix[I, K] ss;         // sum of squared deviations from cell mean
  vector[K] x;             // condition codes, e.g. (-0.5, 0.5)
}

transformed data {
  real sum_ss = sum(ss);   // constant in the parameters
  int  N_tot  = I * K * L;
}

parameters {
  vector[2] mu_re;
  vector<lower=0>[2] sigma_re;
  real<lower=0> sigma_trial;
  cholesky_factor_corr[2] L_omega;
  matrix[2, I] eta;
}

transformed parameters {
  matrix[I, 2] re = rep_matrix(mu_re', I)
                  + (diag_pre_multiply(sigma_re, L_omega) * eta)';
}

model {
  mu_re        ~ std_normal();
  sigma_re     ~ student_t(4, 0, 0.5);
  sigma_trial  ~ student_t(4, 0, 0.5);
  L_omega      ~ lkj_corr_cholesky(1);
  to_vector(eta) ~ std_normal();

  // Cell-level means as a single I x K matrix
  matrix[I, K] mu_mat = rep_matrix(re[, 1], K) + re[, 2] * x';

  // One vectorized contribution to the log-posterior
  target += -N_tot * log(sigma_trial)
          - (sum_ss + L * sum(square(ybar - mu_mat)))
            / (2 * square(sigma_trial));
}

generated quantities {
  corr_matrix[2] omega = multiply_lower_tri_self_transpose(L_omega);
}

Rick_RS95 · May 10, 2026, 9:24am

Thanks a lot! Yes, I am aware that sufficient statistics can be used with the Gaussian model. In fact, some multilevel confirmatory factor models in psychometrics use an extension of the idea you showed here, for example here or here.

I used the Gaussian likelihood mainly for the sake of clarity, to show how changing the data structure changes computation time. However, the same data-structuring idea applies directly to other likelihoods, such as ex-Gaussian or shifted-lognormal models^[1], which are widely used for response times and where we do not have an analogous sufficient-statistics shortcut to simplify the computation.

For the shifted-lognormal case, I am assuming that the non-decision time parameter varies between subjects. ↩︎

simonbrauer · May 10, 2026, 7:19pm

Thanks for writing this up and sharing. Great to see the speed comparisons.

I believe that the Stan documentation uses the same strategy in the ragged data structures section. I’ve used that for missing data problems, both where the observations are iid conditional on the respondent-level effects (as in HLM) and multivariate normal cases where you subset the covariance matrix as well (as with FIML). Both work well, though I’ve never timed them.

Rick_RS95 · May 11, 2026, 7:00am

Thanks for sharing! I did not know that section of the Stan documentation on ragged data structures. I see that the example there uses segment() to work with slices of a vector. I think what I did here is like a hand-crafted version of using block() to slice data matrices (or it would be if I had included more than one predictor).

I actually started comparing computation times out of necessity. What surprised me most was the speed of the array-based version… I am not sure how far this strategy can be generalized to other settings for computational gains, although I imagine that the benefits will mostly show up in large datasets. Still, those are quite common in several empirical settings.

avehtari · May 11, 2026, 2:47pm

See also stanc optimization options for structs-of-arrays optimization in Options for improving Stan sampling speed – The Stan Blog, which can give additional speed for vectorized operations. Not certain whether it applies here, but it is possible that your code change is also improving the speed due to better memory locality.

Rick_RS95 · May 12, 2026, 7:41am

Thanks! I did not know about the structs-of-arrays optimization option in stanc. I will take a look at it in case it can further improve the estimation time!

Henrik_Singmann · May 12, 2026, 8:33am

@paul.buerkner Any chance that brms could allow the user to choose a different parameterisation other than the standard one, such as option 2 in this example? I could see cases where this would lead to faster fitting times but I also do not want to give up using brms.

simonbrauer · May 13, 2026, 4:32pm

A couple follow-up suggestions…

Rick_RS95:

model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = rep_vector(0.0, N);
    mu += Intercept;
    for (n in 1:N) {
      // add more terms to the linear predictor
      mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
    }
    target += normal_id_glm_lpdf(Y | Xc, mu, b, sigma);
  }
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(to_vector(z_1));
}

Seems like you should be able to streamline the base brms code, either by vectorizing the assignment or avoiding assignment altogether. Assuming the assignment is the source of slowdown (I have poor intuitions on these things), then you could minimally modify the base code, maybe even with simple string replacement.

// Updated version 1: Vectorize assignment of mu
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = Intercept + r_1_1[J_1] .* Z_1_1 + r_1_2[J_1] .* Z_1_2;
    target += normal_id_glm_lpdf(Y | Xc, mu, b, sigma);
  }
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(to_vector(z_1));
}

// Updated version 2: No assignment
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    target += normal_id_glm_lpdf(Y | Xc, Intercept + r_1_1[J_1] .* Z_1_1 + r_1_2[J_1] .* Z_1_2, b, sigma);
  }
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(to_vector(z_1));
}

You might also using the profiling functionality in Stan to confirm the source of the slowdown.

// Updated version 3: profiling all 3
model{
  if (!prior_only) {
    vector[3] lp;

    profile("Base brms"){
      // initialize linear predictor term
      vector[N] mu = rep_vector(0.0, N);
      mu += Intercept;
      for (n in 1:N) {
        // add more terms to the linear predictor
        mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
      }

      lp[1] = normal_id_glm_lpdf(Y | Xc, mu, b, sigma);
    }

    profile("Vectorized assignment"){
      real lp;
      // initialize linear predictor term
      vector[N] mu = Intercept + r_1_1[J_1] * Z_1_1 + r_1_2[J_1] * Z_1_2;
  
      lp[2] = normal_id_glm_lpdf(Y | Xc, mu, b, sigma);
    }
    
    profile("No assignment"){
      lp[3] = normal_id_glm_lpdf(Y | Xc, Intercept + r_1_1[J_1] * Z_1_1 + r_1_2[J_1] * Z_1_2, b, sigma);
    }

    target += lp[1];
  }

  // priors including constants
  target += lprior;
  target += std_normal_lpdf(to_vector(z_1));
}

Rick_RS95 · May 15, 2026, 2:58pm

Hi Simon,

Thanks a lot for the suggestion. I tried the three specifications you proposed, together with the slice-based and array-based Stan models, and one extra model specification.

EDIT: The Sparse-Matrix Specification

Since we are comparing different model implementations, I decided to also include a version of this hierarchical model using sparse matrix operations. Multilevel models are often presented using the following formulation:

\mathbf{y} \sim \mathcal{MVN}\!_N\!\left( \boldsymbol{\mu} = \mathbf{X}\cdot\boldsymbol{\beta} + \mathbf{Z}\cdot\mathbf{u},\; \mathbf{\Sigma} = \sigma_e^2\cdot \mathbf{I}_N \right)\!,

where \mathbf{X} is the fixed-effects design matrix, \boldsymbol{\beta} is the vector of fixed-effect coefficients, \mathbf{Z} is the random-effects design matrix, \mathbf{u} is the vector of random effects, \sigma_e is the residual standard deviation, and \mathbf{I}_N is the N \times N identity matrix. The matrix \mathbf{Z} can be stored in Stan using the sparse CSR format, and Stan provides a function to compute the product \mathbf{Z}\cdot\mathbf{u} efficiently (csr_matrix_times_vector()), without explicitly constructing the full sparse matrix.

The R code at the end of the post includes a helper function to generate all the required Stan inputs from a data frame and an lme4-style formula.

Sparse Matrix Stan code

data {
  int<lower=1> N;                  // Number of observations
  int<lower=1> P;                  // Number of fixed predictors
  int<lower=1> J;                  // Number of level-2 units
  int<lower=1> b;                  // Number of random effects per group

  matrix[N, P] X;                  // Fixed-effects design matrix
  vector[N] Y;                     // Outcome vector

  // CSR format for random-effects design matrix Z
  int<lower=1> Nz;                         // Number of non-zero entries in Z
  vector[Nz] Z_values;                     // Non-zero values of Z
  array[Nz] int<lower=1> Z_cols;           // Column indices for Z
  array[N + 1] int<lower=1> Z_row_starts;  // Row starts for Z
}

parameters {
  vector[P] beta;                  // Fixed effects
  cholesky_factor_corr[b] L_b;     // Cholesky factor of RE correlation matrix
  vector<lower=0>[b] sigma_b;      // RE standard deviations
  real<lower=0> sigma_e;           // Residual standard deviation
  vector[b * J] Z_u;               // Standard-normal random effects
}

transformed parameters {
  vector[b * J] u;

  {
    matrix[b, b] L_u = diag_pre_multiply(sigma_b, L_b);
    matrix[b, J] Z_u_mat = to_matrix(Z_u, b, J);

    u = to_vector(L_u * Z_u_mat);
  }
}

model {
  // Priors matching the other models
  beta ~ normal(0, 1);
  sigma_b ~ student_t(4, 0, 0.5);
  sigma_e ~ student_t(4, 0, 0.5);
  L_b ~ lkj_corr_cholesky(1);
  Z_u ~ std_normal();

  // Likelihood
    vector[N] mu;
    profile(\"sparse_build_mu\") {
      mu = X * beta + csr_matrix_times_vector(N, b * J, Z_values, Z_cols, Z_row_starts, u);
    }
    profile(\"sparse_log_lik\") {
      Y ~ normal(mu, sigma_e);
    }
}

generated quantities {
  matrix[b, b] Rho_REs = multiply_lower_tri_self_transpose(L_b);
  matrix[b, b] Sigma_REs = quad_form_diag(Rho_REs, sigma_b);
}

I include the full R code at the end of this response so that the benchmark can be copied and run directly.

I profiled two main parts of the computation: (1) the construction of the linear predictor mu, when this was done explicitly, and (2) the evaluation of the log-likelihood, either with mu already constructed or with its construction embedded in the likelihood call.

As a broad overview, the five models differ in the following model blocks: