A faster implementation of hierarchical models

Hi all!

I am sure many of you know useful tricks to make hierarchical models fit faster. I have been trying different things for a while, and I would like to share a specification that is very fast and that, as far as I know, I have not seen discussed in the Stan forums.

Because my background is in psychometrics and experimental psychology, I will use a simple example inspired by common experimental tasks such as Stroop, flanker, or Simon tasks. Let Y_{ik\ell} be the observed response for subject i on trial \ell in condition k, where k indexes two conditions, for example congruent and incongruent trials.[1]

Y_{ikl}\sim\mathcal{N}\!\left(\mu = \alpha_i + \beta_i \cdot X_k,\;\sigma = \sigma\right),

where X_k is coded as -\frac{1}{2} and +\frac{1}{2}. Thus, \alpha_i is the subject’s average response, and \beta_i is the subject-specific experimental effect. At the population level,

\begin{bmatrix} \alpha_i\\ \beta_i\\ \end{bmatrix} \sim \mathcal{MVN}\!\left(\boldsymbol\mu = \begin{bmatrix} \mu_\alpha\\ \mu_\beta\\ \end{bmatrix},\;\mathbf\Sigma = \begin{bmatrix} \sigma^2_\alpha & \sigma_{\alpha\beta}\\ \sigma_{\alpha\beta} & \sigma^2_\beta\\ \end{bmatrix} \right).

The R code below generates the simulated data used as a running example in this post. Specifically, I simulate a balanced design with 150 subjects and 150 trials per subject and condition. The next sections show how to fit the same model to these data using three different specifications: one using brms, and two more efficient versions programmed directly in Stan. In all cases, I use the same prior distributions.

Click here to see the R code used to simulate the data
library(MASS)

# Basic gaussian hierarchical model with random interceps and slopes
# Population parameters from here: https://osf.io/preprints/psyarxiv/gv6k7_v2
set.seed(2026)
mu_alpha        <- .65
mu_beta         <- .06
sigma_alpha     <- .12
sigma_beta      <- .04
rho_ab          <- matrix(c(1,.5,.5,1), ncol=2)
sigma_trial     <- .18
Sigma_REs <- diag(c(sigma_alpha, sigma_beta)) %*% rho_ab %*%
  diag(c(sigma_alpha, sigma_beta))

# Subjects and trials per subject and coondition (2, e.g., congruent/incongruent)
I <- 150
L <- 150

# Simulate random effects
REs <- MASS::mvrnorm(
  n = I, 
  mu = c(mu_alpha, mu_beta), 
  Sigma = Sigma_REs
)

# Simulate observed data
dat <- matrix(NA, ncol = 3, nrow = I * L * 2)
cnt <- 1
for(i in 1:I) {
  for(k in 1:2) {
    
    # Alpha, beta and contidion
    alpha_i <- REs[i,1]
    beta_i  <- REs[i,2]
    X_k     <- (k - 1.5)
    mu_i    <- alpha_i + beta_i * X_k
    
    for(l in 1:L) {
      dat[cnt, 1] <- i
      dat[cnt, 2] <- k
      # Trial-level data
      dat[cnt, 3] <- rnorm(1, mu_i, sigma_trial)
      cnt <- cnt + 1
    }
  }
}

# Store as data frame
dat <- as.data.frame(dat)
colnames(dat) <- c("subj", "cond", "rt")

1. Fitting the model with brms

As a reference, I first fit the model using brms:

\texttt{rt = 1 + cond + (1 + cond | subj)}.

This is the most convenient specification. In my computer, this model requires 16 minutes to be estimated.

Click here to see the `brms` code
# Set condition as contrast sum
dat$cond <- dat$cond - 1.5

# Set prior distributions
priors_brms <- c(
  prior(normal(0, 1), class = "Intercept"),
  prior(normal(0, 1), class = "b", coef = "cond"),
  prior(student_t(4, 0, 0.5), class = "sd", group = "subj", coef = "Intercept"),
  prior(student_t(4, 0, 0.5), class = "sd", group = "subj", coef = "cond"),
  prior(student_t(4, 0, 0.5), class = "sigma"),
  prior(lkj(1), class = "cor", group = "subj")
)

# Fit brms model
brms_fit <- brm(
  formula = rt ~ 1 + cond + (1 + cond | subj),
  data = dat, 
  prior = priors_brms,
  iter = 2000, 
  warmup = 1000, 
  chains = 4, 
  cores = 4, 
  seed = 2026)

# Estimation time
et_brms <- mean(rowSums(get_elapsed_time(brms_fit$fit)))

Before moving to the hand-written Stan versions, it is useful to look at what brms does in the model block:

model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = rep_vector(0.0, N);
    mu += Intercept;
    for (n in 1:N) {
      // add more terms to the linear predictor
      mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
    }
    target += normal_id_glm_lpdf(Y | Xc, mu, b, sigma);
  }
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(to_vector(z_1));
}

In this Gaussian model, brms builds a temporary linear predictor of length N, where here N=I \times K \times L = 45,000, and then evaluates the likelihood in one vectorized call. The two Stan versions below avoid building this full N-length vector.


2. A hand-written Stan model using a long-format matrix

The first hand-written Stan version still uses the long-format data. The main change is that I do not build a full N-length linear predictor. Instead, I evaluate the likelihood separately for each subject, using only that subject’s slice of the long data. This only requires two indexing objects: the number of rows per subject and the row where each subject starts. If the data are sorted by subject, both can be obtained with two lines of R code:

# Number of observations for each subject
trialsum <- as.integer(rle(dat$subj)$lengths)

# First row of each subject in the long data frame
idx_start = cumsum(c(1L, head(trialsum, -1L)))

The Stan model below fits the same hierarchical model as the brms specification. The key part is the loop over subjects in the model block: for each subject, I find the relevant rows and apply the normal model to that slice of the data.

data {
  int<lower=1> N;                         // Number of observations
  int<lower=1> I;                         // Number of subjects
  matrix[N, 3] Y;                         // Columns: (1 = subj, 2 = cond, 3 = rt)
  array[I] int<lower=1> trialsum;         // Number of observed trials per subject
  array[I] int<lower=1> idx_start;        // First row of each subject in Y
}          

parameters {
    vector[2] mu_REs;                     // Population means
    vector<lower=0>[2] sd_REs;            // Population std. devs
    real<lower=0> sigma;                  // Trial-level std. dev
    cholesky_factor_corr[2] L_R;          // Population REs correlation
    matrix[2, I] REs_std;                 // Standardized REs
}

transformed parameters{
  // Scaled random effects
  matrix[I, 2] REs = rep_matrix(mu_REs', I) + 
                    (diag_pre_multiply(sd_REs, L_R) * REs_std)';
}

model {
    // Model priors
    mu_REs ~ std_normal();
    sd_REs ~ student_t(4, 0, 0.5);
    sigma  ~ student_t(4, 0, 0.5);
    L_R    ~ lkj_corr_cholesky(1);
    to_vector(REs_std) ~ std_normal();

    // Model Log-Likelihood
    for (i in 1:I) {
      // idx of the last observed raw for subject i
      int idx_end = idx_start[i] + trialsum[i] - 1;
      
      // Vectorized likelihood for subject i
      Y[idx_start[i]:idx_end, 3] ~ normal(
        REs[i, 1] + REs[i, 2] * Y[idx_start[i]:idx_end, 2],
        sigma
      );
   }
}

generated quantities {
  // Model-implied correlation matrix
  corr_matrix[2] Rho_REs = multiply_lower_tri_self_transpose(L_R);
}
Click here to see the R code for the long-format Stan model
# The efficient hierarchical model via blocking
HM_stancode_block <- "
data {
  int<lower=1> N;                         // Number of observations
  int<lower=1> I;                         // Number of subjects
  matrix[N, 3] Y;                         // Columns: (1 = subj, 2 = cond, 3 = rt)
  array[I] int<lower=1> trialsum;         // Number of observed trials per subject
  array[I] int<lower=1> idx_start;        // First row of each subject in Y
}          

parameters {
    vector[2] mu_REs;                     // Population means
    vector<lower=0>[2] sd_REs;            // Population std. devs
    real<lower=0> sigma;                  // Trial-level std. dev
    cholesky_factor_corr[2] L_R;          // Population REs correlation
    matrix[2, I] REs_std;                 // Standardized REs
}

transformed parameters{
  // Scaled random effects
  matrix[I, 2] REs = rep_matrix(mu_REs', I) + 
                    (diag_pre_multiply(sd_REs, L_R) * REs_std)';
}

model {
    // Model priors
    mu_REs ~ std_normal();
    sd_REs ~ student_t(4, 0, 0.5);
    sigma  ~ student_t(4, 0, 0.5);
    L_R    ~ lkj_corr_cholesky(1);
    to_vector(REs_std) ~ std_normal();

    // Model Log-Likelihood
    for (i in 1:I) {
      // idx of the last observed raw for subject i
      int idx_end = idx_start[i] + trialsum[i] - 1;
      
      // Vectorized likelihood for subject i
      Y[idx_start[i]:idx_end, 3] ~ normal(
        REs[i, 1] + REs[i, 2] * Y[idx_start[i]:idx_end, 2],
        sigma
      );
   }
}

generated quantities {
  // Model-implied correlation matrix
  corr_matrix[2] Rho_REs = multiply_lower_tri_self_transpose(L_R);
}
"

# Store trialsum
trialsum <- as.integer(rle(dat$subj)$lengths)

# Prepare Stan data
sdata_block <- list(
  N = nrow(dat),
  I = length(unique(dat$subj)),
  Y = as.matrix(dat),
  trialsum = trialsum,
  idx_start = as.integer(cumsum(c(1L, head(trialsum, -1L))))
)

# Fit stan model
fit_stan_block <- stan(
  model_code = HM_stancode_block, 
  data = sdata_block, 
  iter = 2000, 
  warmup = 1000, 
  chains = 4, 
  cores = 4, 
  seed = 2026)

# Estimation time
et_stan_block <- mean(rowSums(get_elapsed_time(fit_stan_block)))

On my computer, this model takes a little under three minutes to fit, which makes it about six times faster than the brms specification.


3. A hand-written Stan model using a balanced array

The second hand-written Stan version uses an array representation of the data. Because the simulated design is balanced, the response variable can be stored as

Y[i, k, \ell],

where i indexes subjects, k indexes conditions, and \ell indexes trials. This makes the likelihood very compact: for each subject and condition, all trials share the same linear predictor, so the likelihood can be vectorized over trials:

data {
  int<lower=1> I;                         // Number of subjects
  int<lower=1> L;                         // Number of trials per subj and cond
  array[I, 2, L] real Y;                  // Array with observed responses 
}          

parameters {
    vector[2] mu_REs;                     // Population means
    vector<lower=0>[2] sd_REs;            // Population std. devs
    real<lower=0> sigma;                  // Trial-level std. dev
    cholesky_factor_corr[2] L_R;          // Population REs correlation
    matrix[2, I] REs_std;                 // Standardized REs
}

transformed parameters{
  // Scaled random effects
  matrix[I, 2] REs = rep_matrix(mu_REs', I) + 
                    (diag_pre_multiply(sd_REs, L_R) * REs_std)';
}

model {
    // Model priors
    mu_REs ~ std_normal();
    sd_REs ~ student_t(4, 0, 0.5);
    sigma  ~ student_t(4, 0, 0.5);
    L_R    ~ lkj_corr_cholesky(1);
    to_vector(REs_std) ~ std_normal();

    // Model Log-Likelihood
    for (i in 1:I) {
        for(k in 1:2) {
            // Linear prediction
            real mu_ik = REs[i, 1] + REs[i, 2] * (k - 1.5);
            
            // Vectorized log-likelihood
            Y[i, k, ] ~ normal(mu_ik, sigma);
      }
   }
}

generated quantities {
  // Model-implied correlation matrix
  corr_matrix[2] Rho_REs = multiply_lower_tri_self_transpose(L_R);
}

Click here to see the R code for the array-based Stan model
# The efficient hierarchical model
HM_stancode_array <- "
data {
  int<lower=1> I;                         // Number of subjects
  int<lower=1> L;                         // Number of trials per subj and cond
  array[I, 2, L] real Y;                  // Array with observed responses 
}          

parameters {
    vector[2] mu_REs;                     // Population means
    vector<lower=0>[2] sd_REs;            // Population std. devs
    real<lower=0> sigma;                  // Trial-level std. dev
    cholesky_factor_corr[2] L_R;          // Population REs correlation
    matrix[2, I] REs_std;                 // Standardized REs
}

transformed parameters{
  // Scaled random effects
  matrix[I, 2] REs = rep_matrix(mu_REs', I) + 
                    (diag_pre_multiply(sd_REs, L_R) * REs_std)';
}

model {
    // Model priors
    mu_REs ~ std_normal();
    sd_REs ~ student_t(4, 0, 0.5);
    sigma  ~ student_t(4, 0, 0.5);
    L_R    ~ lkj_corr_cholesky(1);
    to_vector(REs_std) ~ std_normal();

    // Model Log-Likelihood
    for (i in 1:I) {
        for(k in 1:2) {
            // Linear prediction
            real mu_ik = REs[i, 1] + REs[i, 2] * (k - 1.5);
            
            // Vectorized log-likelihood
            Y[i, k, ] ~ normal(mu_ik, sigma);
      }
   }
}

generated quantities {
  // Model-implied correlation matrix
  corr_matrix[2] Rho_REs = multiply_lower_tri_self_transpose(L_R);
}
"

# Store data in a three-dimensional array
Y_array <- aperm(array(dat$rt, dim = c(L, 2, I)), c(3, 2, 1))

# Prepare Stan data
sdata_array <- list(
  I = I,
  L = L,
  Y = Y_array
)

# Fit the model
fit_stan_array <- stan(
  model_code = HM_stancode_array,
  data = sdata_array,
  iter = 2000,
  warmup = 1000,
  chains = 4,
  cores = 4,
  seed = 2026
)

# Estimation time
et_stan_array <- mean(rowSums(get_elapsed_time(fit_stan_array)))

Here I use the simplest version, because the simulated design is balanced. The same idea also works with unbalanced data, as long as we can keep track of which observations belong to each subject. In fact, I used this strategy with unbalanced data in my own work, based on Haines et al. (2025), which is where I first saw this way of specifying hierarchical models.

On my computer, this model takes about 27 seconds to fit. This makes it around 6 times faster than the previous Stan version, and around 35 times faster than the brms specification.

4. Model comparison

The table below shows the elapsed time for the three approaches on my computer:

Model Seconds Minutes Relative to array
brms 962.6 16.04 34.9
Stan, long-data 171.3 2.86 6.2
Stan, array 27.6 0.46 1.0

Below, I also report the estimates of the group-level parameters obtained with the three models.

# ------------------- #
#    Results: brms    #
# ------------------- #

 Family: gaussian 
  Links: mu = identity 
Formula: rt ~ 1 + cond + (1 + cond | subj) 
   Data: dat (Number of observations: 45000) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Multilevel Hyperparameters:
~subj (Number of levels: 150) 
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)           0.12      0.01     0.11     0.14 1.06      138      311
sd(cond)                0.04      0.00     0.03     0.04 1.01     1039     2195
cor(Intercept,cond)     0.64      0.06     0.51     0.75 1.01      942     2183

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.66      0.01     0.64     0.67 1.03       90      182
cond          0.05      0.00     0.05     0.06 1.01      278      667

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.18      0.00     0.18     0.18 1.00     6770     2631

# ----------------------------------- #
#    Results: Long format approach    #
# ----------------------------------- #

# Results: Long format approach
  parameter      variable       mean       sd median   q2.5  q97.5  rhat ess_bulk ess_tail
  <chr>          <chr>         <dbl>    <dbl>  <dbl>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>
1 mu_alpha       mu_REs[1]    0.654  0.00957  0.654  0.636  0.674   1.01     41.7     93.5
2 mu_beta        mu_REs[2]    0.0524 0.00342  0.0524 0.0457 0.0590  1.01    154.     717. 
3 sigma_alpha    sd_REs[1]    0.123  0.00738  0.123  0.110  0.139   1.02    129.     196. 
4 sigma_beta     sd_REs[2]    0.0377 0.00282  0.0376 0.0324 0.0436  1.00    851.    1826. 
5 sigma          sigma        0.181  0.000586 0.181  0.180  0.182   1.00   7633.    2823. 
6 rho_alpha_beta Rho_REs[1,2] 0.636  0.0619   0.639  0.503  0.748   1.00    755.    1665.

# ----------------------------- #
#    Results: Array approach    #
# ----------------------------- #

  parameter      variable       mean       sd median   q2.5  q97.5  rhat ess_bulk ess_tail
  <chr>          <chr>         <dbl>    <dbl>  <dbl>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>
1 mu_alpha       mu_REs[1]    0.655  0.0100   0.655  0.635  0.675  1.01      85.0     164.
2 mu_beta        mu_REs[2]    0.0527 0.00348  0.0526 0.0460 0.0594 1.00     277.      891.
3 sigma_alpha    sd_REs[1]    0.124  0.00713  0.123  0.111  0.138  1.00     240.      530.
4 sigma_beta     sd_REs[2]    0.0379 0.00291  0.0378 0.0326 0.0441 1.00    1164.     1814.
5 sigma          sigma        0.181  0.000589 0.181  0.180  0.182  1.000   6423.     2537.
6 rho_alpha_beta Rho_REs[1,2] 0.634  0.0641   0.638  0.494  0.748  1.00     845.     1920.

To conclude

Overall, I think there may be room for improving the way some common hierarchical models are implemented in experimental psychology, including cognitive-process models such as drift diffusion models, which can be especially slow. I am not sure how far this idea generalizes to other hierarchical settings, such as longitudinal models with autoregressive structure.

In my case, the array-based approach was useful because, in my main work, I extend the previous model to J experimental tasks. The population-level model then becomes something like

\begin{bmatrix} \boldsymbol{\alpha}_i\\ \boldsymbol{\beta}_i \end{bmatrix} \sim \mathcal{MVN}_{2\times J}\!\left( \boldsymbol\mu = \begin{bmatrix} \boldsymbol{\mu}_\alpha\\ \boldsymbol{\mu}_\beta \end{bmatrix}, \mathbf\Sigma = \begin{bmatrix} \mathbf{\Sigma}_{\alpha} & \mathbf{\Sigma}_{\alpha\beta}\\ \mathbf{\Sigma}_{\beta\alpha} & \mathbf{\Sigma}_{\beta} \end{bmatrix} \right),

where \boldsymbol{\alpha}_i and \boldsymbol{\beta}_i contain subject i’s intercepts and experimental effects across J tasks. As a fun fact, in my own version I also parameterize \mathbf{\Sigma}_{\beta} as \mathbf{\Lambda}\mathbf{\Phi}\mathbf{\Lambda}^{\top} + \mathbf{\Psi}, as in a psychometric factor model.

This model specification, with or without the latent factor parameterization, becomes computationally expensive in brms. For example, I tried a version with 200 subjects, 6 tasks, and 100 trials per task. The array-based Stan version took about 5 minutes to fit, whereas the same model in brms took around 40 hours.

Of course, this is much less flexible than brms for specifying arbitrary models. Still, I wonder whether this kind of structured likelihood could be useful in other settings too. I would be very curious to hear what others think.

I leave below the full R code that I used for this illustration.

Click here to see the full R code used in this illustration
# ╔════════════════════════════════════════════════════════════════════════════╗
# β•‘                             SCRIPT OVERVIEW                                β•‘
# ╠════════════════════════════════════════════════════════════════════════════╣
# β•‘ Script Name   : stan_eff_hiermods.R                                        β•‘
# β•‘ Author        : Ricardo Rey-SΓ‘ez                                           β•‘
# β•‘ Role          : PhD Candidate in Psychology                                β•‘
# β•‘ Institution   : Autonomous University of Madrid, Spain                     β•‘
# β•‘ Email         : ricardoreysaez95@gmail.es                                  β•‘
# β•‘ Date          : 09-05-2026                                                 β•‘
# β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 1: Load Packages
# ─────────────────────────────────────────────────────────────────────────────
# Libraries necessary for the script to function
library(MASS)
library(dplyr)
library(brms)
library(rstan)

# ─────────────────────────────────────────────────────────────────────────────

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 2: User-Defined Functions
# ─────────────────────────────────────────────────────────────────────────────

# Basic gaussian hierarchical model with random interceps (alpha) and slopes (beta)
# Population parameters from here: https://osf.io/preprints/psyarxiv/gv6k7_v2
set.seed(2026)
mu_alpha        <- .65
mu_beta         <- .06
sigma_alpha     <- .12
sigma_beta      <- .04
rho_ab          <- matrix(c(1,.5,.5,1), ncol=2)
sigma_trial     <- .18
Sigma_REs <- diag(c(sigma_alpha, sigma_beta)) %*% rho_ab %*%
  diag(c(sigma_alpha, sigma_beta))

# Subjects and trials per subject and coondition (2, e.g., congruent/incongruent)
I <- 150
L <- 150

# Simulate random effects
REs <- MASS::mvrnorm(
  n = I, 
  mu = c(mu_alpha, mu_beta), 
  Sigma = Sigma_REs
)

# Simulate observed data
dat <- matrix(NA, ncol = 3, nrow = I * L * 2)
cnt <- 1
for(i in 1:I) {
  for(k in 1:2) {
    
    # Alpha, beta and contidion
    alpha_i <- REs[i,1]
    beta_i  <- REs[i,2]
    X_k     <- (k - 1.5)
    mu_i    <- alpha_i + beta_i * X_k
    
    for(l in 1:L) {
      dat[cnt, 1] <- i
      dat[cnt, 2] <- k
      # Trial-level data
      dat[cnt, 3] <- rnorm(1, mu_i, sigma_trial)
      cnt <- cnt + 1
    }
  }
}

# Store as data frame
dat <- as.data.frame(dat)
colnames(dat) <- c("subj", "cond", "rt")

# ─────────────────────────────────────────────────────────────────────────────

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 3: Simulate and fit brms model
# ─────────────────────────────────────────────────────────────────────────────

# Set condition as contrast sum
dat$cond <- dat$cond - 1.5

# Set prior distributions
priors_brms <- c(
  prior(normal(0, 1), class = "Intercept"),
  prior(normal(0, 1), class = "b", coef = "cond"),
  prior(student_t(4, 0, 0.5), class = "sd", group = "subj", coef = "Intercept"),
  prior(student_t(4, 0, 0.5), class = "sd", group = "subj", coef = "cond"),
  prior(student_t(4, 0, 0.5), class = "sigma"),
  prior(lkj(1), class = "cor", group = "subj")
)

# Fit brms model
brms_fit <- brm(
  formula = rt ~ 1 + cond + (1 + cond | subj),
  data = dat, 
  prior = priors_brms,
  iter = 2000, 
  warmup = 1000, 
  chains = 4, 
  cores = 4, 
  seed = 2026)

# Estimation time
et_brms <- mean(rowSums(get_elapsed_time(brms_fit$fit)))

# ─────────────────────────────────────────────────────────────────────────────

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 4: Fit efficient hierarchical model --- block
# ─────────────────────────────────────────────────────────────────────────────

# The efficient hierarchical model via blocking
HM_stancode_block <- "
data {
  int<lower=1> N;                         // Number of observations
  int<lower=1> I;                         // Number of subjects
  matrix[N, 3] Y;                         // Columns: (1 = subj, 2 = cond, 3 = rt)
  array[I] int<lower=1> trialsum;         // Number of observed trials per subject
  array[I] int<lower=1> idx_start;        // First row of each subject in Y
}          

parameters {
    vector[2] mu_REs;                     // Population means
    vector<lower=0>[2] sd_REs;            // Population std. devs
    real<lower=0> sigma;                  // Trial-level std. dev
    cholesky_factor_corr[2] L_R;          // Population REs correlation
    matrix[2, I] REs_std;                 // Standardized REs
}

transformed parameters{
  // Scaled random effects
  matrix[I, 2] REs = rep_matrix(mu_REs', I) + 
                    (diag_pre_multiply(sd_REs, L_R) * REs_std)';
}

model {
    // Model priors
    mu_REs ~ std_normal();
    sd_REs ~ student_t(4, 0, 0.5);
    sigma  ~ student_t(4, 0, 0.5);
    L_R    ~ lkj_corr_cholesky(1);
    to_vector(REs_std) ~ std_normal();

    // Model Log-Likelihood
    for (i in 1:I) {
      // idx of the last observed raw for subject i
      int idx_end = idx_start[i] + trialsum[i] - 1;
      
      // Vectorized likelihood for subject i
      Y[idx_start[i]:idx_end, 3] ~ normal(
        REs[i, 1] + REs[i, 2] * Y[idx_start[i]:idx_end, 2],
        sigma
      );
   }
}

generated quantities {
  // Model-implied correlation matrix
  corr_matrix[2] Rho_REs = multiply_lower_tri_self_transpose(L_R);
}
"

# Store trialsum
trialsum <- as.integer(rle(dat$subj)$lengths)

# Prepare Stan data
sdata_block <- list(
  N = nrow(dat),
  I = length(unique(dat$subj)),
  Y = as.matrix(dat),
  trialsum = trialsum,
  idx_start = as.integer(cumsum(c(1L, head(trialsum, -1L))))
)

# Fit stan model
fit_stan_block <- stan(
  model_code = HM_stancode_block, 
  data = sdata_block, 
  iter = 2000, 
  warmup = 1000, 
  chains = 4, 
  cores = 4, 
  seed = 2026)

# Estimation time
et_stan_block <- mean(rowSums(get_elapsed_time(fit_stan_block)))

# ─────────────────────────────────────────────────────────────────────────────

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 5: Fit efficient hierarchical model --- array
# ─────────────────────────────────────────────────────────────────────────────

# The efficient hierarchical model
HM_stancode_array <- "
data {
  int<lower=1> I;                         // Number of subjects
  int<lower=1> L;                         // Number of trials per subj and cond
  array[I, 2, L] real Y;                  // Array with observed responses 
}          

parameters {
    vector[2] mu_REs;                     // Population means
    vector<lower=0>[2] sd_REs;            // Population std. devs
    real<lower=0> sigma;                  // Trial-level std. dev
    cholesky_factor_corr[2] L_R;          // Population REs correlation
    matrix[2, I] REs_std;                 // Standardized REs
}

transformed parameters{
  // Scaled random effects
  matrix[I, 2] REs = rep_matrix(mu_REs', I) + 
                    (diag_pre_multiply(sd_REs, L_R) * REs_std)';
}

model {
    // Model priors
    mu_REs ~ std_normal();
    sd_REs ~ student_t(4, 0, 0.5);
    sigma  ~ student_t(4, 0, 0.5);
    L_R    ~ lkj_corr_cholesky(1);
    to_vector(REs_std) ~ std_normal();

    // Model Log-Likelihood
    for (i in 1:I) {
        for(k in 1:2) {
            // Linear prediction
            real mu_ik = REs[i, 1] + REs[i, 2] * (k - 1.5);
            
            // Vectorized log-likelihood
            Y[i, k, ] ~ normal(mu_ik, sigma);
      }
   }
}

generated quantities {
  // Model-implied correlation matrix
  corr_matrix[2] Rho_REs = multiply_lower_tri_self_transpose(L_R);
}
"

# Store data in a three-dimensional array
Y_array <- aperm(array(dat$rt, dim = c(L, 2, I)), c(3, 2, 1))

# Prepare Stan data
sdata_array <- list(
  I = I,
  L = L,
  Y = Y_array
)

# Fit stan model
fit_stan_array <- stan(
  model_code = HM_stancode_array, 
  data = sdata_array, 
  iter = 2000, 
  warmup = 1000, 
  chains = 4, 
  cores = 4, 
  seed = 2026)

# Estimation time
et_stan_array <- mean(rowSums(get_elapsed_time(fit_stan_array)))

# ─────────────────────────────────────────────────────────────────────────────

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 6: Model comparison
# ─────────────────────────────────────────────────────────────────────────────

# brms model is 6 times slower
round(et_brms / et_stan_block)

# brms model is 35 times slower (!!)
round(et_brms / et_stan_array)

# stan block model is X times slower than stan array
round(et_stan_block / et_stan_array)

# Same population means
cbind(true = c(mu_alpha, mu_beta),
      brms = fixef(brms_fit)[1:2,"Estimate"], 
      stan_block = summary(fit_stan_block, "mu_REs")$summary[,"mean"],
      stan_array = summary(fit_stan_array, "mu_REs")$summary[,"mean"])

# Same population Std.devs
cbind(true = c(sigma_alpha, sigma_beta),
      brms = VarCorr(brms_fit)$subj$sd[,"Estimate"], 
      stan_block = summary(fit_stan_block, "sd_REs")$summary[,"mean"],
      stan_array = summary(fit_stan_array, "sd_REs")$summary[,"mean"])

# Same trial-level residual standard deviation
cbind(true = sigma_trial,
      brms = VarCorr(brms_fit)$residual__$sd[,"Estimate"], 
      stan_block = summary(fit_stan_block, "sigma")$summary[,"mean"],
      stan_array = summary(fit_stan_array, "sigma")$summary[,"mean"])

# Same true correlation between experimental effects
cbind(true = .5,
      brms = VarCorr(brms_fit)$subj$cor["Intercept" ,"Estimate", "cond"], 
      stan_block = summary(fit_stan_block, "Rho_REs[1,2]")$summary[,"mean"],
      stan_array = summary(fit_stan_array, "Rho_REs[1,2]")$summary[,"mean"])

# ─────────────────────────────────────────────────────────────────────────────


  1. EDIT: I will use the Gaussian likelihood for the sake of clarity, but keep in mind that the same idea can be used with any other likelihood, especially those better suited for response times. β†©οΈŽ

In your particular example (Gaussian with a linear predictor that takes only two distinct values per subject, one for each condition) you could take advantage of sufficient statistics and the Stan model might be an order of magnitude faster than your fastest example.

You’d calculate the sufficient statistics, from your R code something like this:


cell_stats ← dat |>
group_by(subj, cond) |>
summarise(
ybar = mean(rt),
ss   = sum((rt - mean(rt))^2),
.groups = β€œdrop”
) |>
arrange(subj, cond)

ybar_mat ← matrix(cell_stats$ybar, nrow = I, ncol = 2, byrow = TRUE)
ss_mat   ← matrix(cell_stats$ss,   nrow = I, ncol = 2, byrow = TRUE)

sdata_ss ← list(
I    = I,
K    = 2,
L    = L,
ybar = ybar_mat,
ss   = ss_mat,
x    = c(-0.5, 0.5)
)

Then in your Stan program, something like this:

data {
  int<lower=1> I;          // subjects
  int<lower=1> K;          // conditions
  int<lower=1> L;          // trials per cell (balanced)
  matrix[I, K] ybar;       // cell means
  matrix[I, K] ss;         // sum of squared deviations from cell mean
  vector[K] x;             // condition codes, e.g. (-0.5, 0.5)
}

transformed data {
  real sum_ss = sum(ss);   // constant in the parameters
  int  N_tot  = I * K * L;
}

parameters {
  vector[2] mu_re;
  vector<lower=0>[2] sigma_re;
  real<lower=0> sigma_trial;
  cholesky_factor_corr[2] L_omega;
  matrix[2, I] eta;
}

transformed parameters {
  matrix[I, 2] re = rep_matrix(mu_re', I)
                  + (diag_pre_multiply(sigma_re, L_omega) * eta)';
}

model {
  mu_re        ~ std_normal();
  sigma_re     ~ student_t(4, 0, 0.5);
  sigma_trial  ~ student_t(4, 0, 0.5);
  L_omega      ~ lkj_corr_cholesky(1);
  to_vector(eta) ~ std_normal();

  // Cell-level means as a single I x K matrix
  matrix[I, K] mu_mat = rep_matrix(re[, 1], K) + re[, 2] * x';

  // One vectorized contribution to the log-posterior
  target += -N_tot * log(sigma_trial)
          - (sum_ss + L * sum(square(ybar - mu_mat)))
            / (2 * square(sigma_trial));
}

generated quantities {
  corr_matrix[2] omega = multiply_lower_tri_self_transpose(L_omega);
}

Thanks a lot! Yes, I am aware that sufficient statistics can be used with the Gaussian model. In fact, some multilevel confirmatory factor models in psychometrics use an extension of the idea you showed here, for example here or here.

I used the Gaussian likelihood mainly for the sake of clarity, to show how changing the data structure changes computation time. However, the same data-structuring idea applies directly to other likelihoods, such as ex-Gaussian or shifted-lognormal models[1], which are widely used for response times and where we do not have an analogous sufficient-statistics shortcut to simplify the computation.


  1. For the shifted-lognormal case, I am assuming that the non-decision time parameter varies between subjects. β†©οΈŽ

Thanks for writing this up and sharing. Great to see the speed comparisons.

I believe that the Stan documentation uses the same strategy in the ragged data structures section. I’ve used that for missing data problems, both where the observations are iid conditional on the respondent-level effects (as in HLM) and multivariate normal cases where you subset the covariance matrix as well (as with FIML). Both work well, though I’ve never timed them.

Thanks for sharing! I did not know that section of the Stan documentation on ragged data structures. I see that the example there uses segment() to work with slices of a vector. I think what I did here is like a hand-crafted version of using block() to slice data matrices (or it would be if I had included more than one predictor).

I actually started comparing computation times out of necessity. What surprised me most was the speed of the array-based version… I am not sure how far this strategy can be generalized to other settings for computational gains, although I imagine that the benefits will mostly show up in large datasets. Still, those are quite common in several empirical settings.

See also stanc optimization options for structs-of-arrays optimization in Options for improving Stan sampling speed – The Stan Blog, which can give additional speed for vectorized operations. Not certain whether it applies here, but it is possible that your code change is also improving the speed due to better memory locality.

Thanks! I did not know about the structs-of-arrays optimization option in stanc. I will take a look at it in case it can further improve the estimation time!

@paul.buerkner Any chance that brms could allow the user to choose a different parameterisation other than the standard one, such as option 2 in this example? I could see cases where this would lead to faster fitting times but I also do not want to give up using brms.

A couple follow-up suggestions…

Seems like you should be able to streamline the base brms code, either by vectorizing the assignment or avoiding assignment altogether. Assuming the assignment is the source of slowdown (I have poor intuitions on these things), then you could minimally modify the base code, maybe even with simple string replacement.

// Updated version 1: Vectorize assignment of mu
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = Intercept + r_1_1[J_1] .* Z_1_1 + r_1_2[J_1] .* Z_1_2;
    target += normal_id_glm_lpdf(Y | Xc, mu, b, sigma);
  }
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(to_vector(z_1));
}
// Updated version 2: No assignment
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    target += normal_id_glm_lpdf(Y | Xc, Intercept + r_1_1[J_1] .* Z_1_1 + r_1_2[J_1] .* Z_1_2, b, sigma);
  }
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(to_vector(z_1));
}

You might also using the profiling functionality in Stan to confirm the source of the slowdown.

// Updated version 3: profiling all 3
model{
  if (!prior_only) {
    vector[3] lp;

    profile("Base brms"){
      // initialize linear predictor term
      vector[N] mu = rep_vector(0.0, N);
      mu += Intercept;
      for (n in 1:N) {
        // add more terms to the linear predictor
        mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
      }

      lp[1] = normal_id_glm_lpdf(Y | Xc, mu, b, sigma);
    }

    profile("Vectorized assignment"){
      real lp;
      // initialize linear predictor term
      vector[N] mu = Intercept + r_1_1[J_1] * Z_1_1 + r_1_2[J_1] * Z_1_2;
  
      lp[2] = normal_id_glm_lpdf(Y | Xc, mu, b, sigma);
    }
    
    profile("No assignment"){
      lp[3] = normal_id_glm_lpdf(Y | Xc, Intercept + r_1_1[J_1] * Z_1_1 + r_1_2[J_1] * Z_1_2, b, sigma);
    }

    target += lp[1];
  }

  // priors including constants
  target += lprior;
  target += std_normal_lpdf(to_vector(z_1));
}

Hi Simon,

Thanks a lot for the suggestion. I tried the three specifications you proposed, together with the slice-based and array-based Stan models, and one extra model specification.


EDIT: The Sparse-Matrix Specification

Since we are comparing different model implementations, I decided to also include a version of this hierarchical model using sparse matrix operations. Multilevel models are often presented using the following formulation:

\mathbf{y} \sim \mathcal{MVN}\!_N\!\left( \boldsymbol{\mu} = \mathbf{X}\cdot\boldsymbol{\beta} + \mathbf{Z}\cdot\mathbf{u},\; \mathbf{\Sigma} = \sigma_e^2\cdot \mathbf{I}_N \right)\!,

where \mathbf{X} is the fixed-effects design matrix, \boldsymbol{\beta} is the vector of fixed-effect coefficients, \mathbf{Z} is the random-effects design matrix, \mathbf{u} is the vector of random effects, \sigma_e is the residual standard deviation, and \mathbf{I}_N is the N \times N identity matrix. The matrix \mathbf{Z} can be stored in Stan using the sparse CSR format, and Stan provides a function to compute the product \mathbf{Z}\cdot\mathbf{u} efficiently (csr_matrix_times_vector()), without explicitly constructing the full sparse matrix.

The R code at the end of the post includes a helper function to generate all the required Stan inputs from a data frame and an lme4-style formula.

Sparse Matrix Stan code
data {
  int<lower=1> N;                  // Number of observations
  int<lower=1> P;                  // Number of fixed predictors
  int<lower=1> J;                  // Number of level-2 units
  int<lower=1> b;                  // Number of random effects per group

  matrix[N, P] X;                  // Fixed-effects design matrix
  vector[N] Y;                     // Outcome vector

  // CSR format for random-effects design matrix Z
  int<lower=1> Nz;                         // Number of non-zero entries in Z
  vector[Nz] Z_values;                     // Non-zero values of Z
  array[Nz] int<lower=1> Z_cols;           // Column indices for Z
  array[N + 1] int<lower=1> Z_row_starts;  // Row starts for Z
}

parameters {
  vector[P] beta;                  // Fixed effects
  cholesky_factor_corr[b] L_b;     // Cholesky factor of RE correlation matrix
  vector<lower=0>[b] sigma_b;      // RE standard deviations
  real<lower=0> sigma_e;           // Residual standard deviation
  vector[b * J] Z_u;               // Standard-normal random effects
}

transformed parameters {
  vector[b * J] u;

  {
    matrix[b, b] L_u = diag_pre_multiply(sigma_b, L_b);
    matrix[b, J] Z_u_mat = to_matrix(Z_u, b, J);

    u = to_vector(L_u * Z_u_mat);
  }
}

model {
  // Priors matching the other models
  beta ~ normal(0, 1);
  sigma_b ~ student_t(4, 0, 0.5);
  sigma_e ~ student_t(4, 0, 0.5);
  L_b ~ lkj_corr_cholesky(1);
  Z_u ~ std_normal();

  // Likelihood
    vector[N] mu;
    profile(\"sparse_build_mu\") {
      mu = X * beta + csr_matrix_times_vector(N, b * J, Z_values, Z_cols, Z_row_starts, u);
    }
    profile(\"sparse_log_lik\") {
      Y ~ normal(mu, sigma_e);
    }
}

generated quantities {
  matrix[b, b] Rho_REs = multiply_lower_tri_self_transpose(L_b);
  matrix[b, b] Sigma_REs = quad_form_diag(Rho_REs, sigma_b);
}

I include the full R code at the end of this response so that the benchmark can be copied and run directly.

I profiled two main parts of the computation: (1) the construction of the linear predictor mu, when this was done explicitly, and (2) the evaluation of the log-likelihood, either with mu already constructed or with its construction embedded in the likelihood call.

As a broad overview, the five models differ in the following model blocks:

Model 1: standard brms model

Click here to see the stan model code with profile()
model {
  // likelihood including constants
  if (!prior_only) {
    vector[N] mu;
    profile("base_build_mu_loop") {
      // initialize linear predictor term
      mu = rep_vector(0.0, N);
      mu += Intercept;
      for (n in 1:N) {
        // add group-level terms to the linear predictor
        mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
      }
    }
    profile("base_normal_id_glm") {
      target += normal_id_glm_lpdf(Y | Xc, mu, b, sigma);
    }
  }
  // priors including constants
  profile("priors") {
    target += lprior;
    target += std_normal_lpdf(to_vector(z_1));
  }
}

Model 2: brms with vectorized mu computation with allocation

Click here to see the stan model code with profile()
model {
  // likelihood including constants
  if (!prior_only) {
    vector[N] mu;
    profile("vectorized_build_mu") {
      mu = rep_vector(Intercept, N) + r_1_1[J_1] .* Z_1_1 + r_1_2[J_1] .* Z_1_2;
    }
    profile("vectorized_normal_id_glm") {
      target += normal_id_glm_lpdf(Y | Xc, mu, b, sigma);
    }
  }
  // priors including constants
  profile("priors") {
    target += lprior;
    target += std_normal_lpdf(to_vector(z_1));
  }
}

Model 3: brms with vectorized mu computation without allocation

Click here to see the stan model code with profile()
model {
  // likelihood including constants
  if (!prior_only) {
    profile("no_assignment_normal_id_glm") {
      target += normal_id_glm_lpdf(
        Y | Xc, rep_vector(Intercept, N)
          + r_1_1[J_1] .* Z_1_1
          + r_1_2[J_1] .* Z_1_2,
        b,
        sigma
      );
    }
  }
  // priors including constants
  profile("priors") {
    target += lprior;
    target += std_normal_lpdf(to_vector(z_1));
  }
}

Model 4: Stan slicing subject data

Click here to see the stan model code with profile()
model {
    // Model priors
    mu_REs ~ std_normal();
    sd_REs ~ student_t(4, 0, 0.5);
    sigma  ~ student_t(4, 0, 0.5);
    L_R    ~ lkj_corr_cholesky(1);
    to_vector(REs_std) ~ std_normal();
    // Model Log-Likelihood
    profile("subjects_log_lik") {
      for (i in 1:I) {
        // idx of the last observed raw for subject i
        int idx_end = idx_start[i] + trialsum[i] - 1;

        // Vectorized likelihood for subject i
        Y[idx_start[i]:idx_end, 3] ~ normal(
          REs[i, 1] + REs[i, 2] * Y[idx_start[i]:idx_end, 2],
          sigma);
       }
    }
}

Model 5: Stan with array data structure

Click here to see the stan model code with profile()
model {
    // Model priors
    mu_REs ~ std_normal();
    sd_REs ~ student_t(4, 0, 0.5);
    sigma  ~ student_t(4, 0, 0.5);
    L_R    ~ lkj_corr_cholesky(1);
    to_vector(REs_std) ~ std_normal();

    // Model Log-Likelihood
    profile("array_log_lik"){
      for (i in 1:I) {
        for(k in 1:2) {
          // Linear prediction
          real mu_ik = REs[i, 1] + REs[i, 2] * (k - 1.5);
            
          // Vectorized log-likelihood
          Y[i, k, ] ~ normal(mu_ik, sigma);
        }
      }
   }
}

Model 6: Stan with Sparse Matrix operations

Click here to see the stan model code with profile()
model {
  // Priors matching the other models
  beta ~ normal(0, 1);
  sigma_b ~ student_t(4, 0, 0.5);
  sigma_e ~ student_t(4, 0, 0.5);
  L_b ~ lkj_corr_cholesky(1);
  Z_u ~ std_normal();

  // Likelihood
    vector[N] mu;
    profile(\"sparse_build_mu\") {
      mu = X * beta + csr_matrix_times_vector(N, b * J, Z_values, Z_cols, Z_row_starts, u);
    }
    profile(\"sparse_log_lik\") {
      Y ~ normal(mu, sigma_e);
    }
}

Results

The results are shown in Tables 1-3. Total (s) is the average accumulated time spent in the profiled block; Calls is the average number of autodiff calls; ms/call is the average total time per autodiff call; and Fwd/Rev ms splits this time into forward evaluation and reverse-mode autodiff. In Table 1, Relative time is computed with respect to the fastest model.

Table 1: Total sampling time

Model Total (s) Total (min) Relative time
Model 1 728 12.1 30.8
Model 2 729 12.2 30.8
Model 3 794 13.2 33.6
Model 4 156 2.60 6.61
Model 5 23.6 0.394 1.00
Model 6 506 8.43 21.4

Table 2: Average time spent building mu

Model Total (s) Calls ms/call Fwd/Rev ms
Model 1 468 162k 2.89 1.82 / 1.07
Model 2 510 174k 2.94 1.90 / 1.04
Model 6 310 165k 1.88 1.27 / 0.610

Table 3: Average time spent evaluating the log-likelihood

Model Total (s) Calls ms/call Fwd/Rev ms
Model 1 231 162k 1.43 0.955 / 0.476
Model 2 147 174k 0.849 0.676 / 0.172
Model 3 714 174k 4.11 2.85 / 1.26
Model 4 142 166k 0.856 0.638 / 0.218
Model 5 18.9 168k 0.113 0.109 / 0.003
Model 6 149 165k 0.904 0.772 / 0.132

It seems that, in the brms-style implementations, much of the runtime is associated with constructing the observation-level linear predictor. In Models 1 and 2, the profiled block that builds mu accounts for approximately 64% and 70% of the total sampling time, respectively. Once mu has been constructed, the likelihood evaluation itself is considerably cheaper. In fact, the log-likelihood evaluation times in Models 1 and 2 are very similar to that of Model 4, i.e., the slice model. The key difference is that Model 4 still uses the long data format but avoids constructing a full observation-level mu vector in the same way. Consistently, in Model 4, the log-likelihood computation accounts for about 91% of the total sampling time.

Model 3 provides an additional check. Because mu is not explicitly allocated and stored, its construction is embedded directly in the call to normal_id_glm_lpdf. The time spent in this combined block accounts for about 90% of the total sampling time, and is roughly what one would expect from adding the mu-construction and likelihood-evaluation costs observed in Models 1 and 2. This suggests that pre-allocating and storing mu is not the main bottleneck. Rather, the costly step is constructing the full observation-level linear predictor in the first place.

Finally, Model 6 shows that constructing mu through sparse matrix operations is more efficient than in the brms-style implementations. The mu-construction block is about 34% faster than in Model 1 and about 39% faster than in Model 2.

R code used in this post

stan_eff_hiermods_cmdstanr.R (34.4 KB)

@Rick_RS95 Thanks so much for your work on this. I wasn’t fully aware about the amount of time spend on creating the trial-wise predictions and how it dominates the sampling time when there are many observations. Thus, I think it is great to have a discussion on what can be done to increase sampling speed (when using brms).

It seems clear that the β€œproblem” is brms’ flexibility of allowing an arbitrary number of grouping factors. There seem to be only two general solutions for this issue, the sparse model matrix approach of lme4 and rstanarm and the observation-wise loop approach in brms. As shown in the last example of Ricardo, in certain situations with small numbers of grouping factors the sparse matrix approach can actually be faster than the loop-based approach by brms. But I can also see why the brms approach is the default as I can imagine it being faster in most situations with complicated random effect structures.

The problem of both these approaches is, however, that if the random-effect structure is not that complicated and there is only one grouping factor than both of these approaches can be rather slow (at least for the cases with many observations per level of the grouping factor, here subject). In this case it seems preferable to split the data by grouping factor. To show this in a more general way I have adapted Ricardo’s code and simulated a situation with three continuous predictors (which are different for every subject):

y ~ 1 + a + b + c + (1 + a + b + c | subj) 

I then compare the default brms code with one that uses the same general model structure (i.e., same parameters) but splits the data into one model matrix by subject. In the model block we then do not loop across observations but across subjects and use the subject-specific data vector, Ymat, and model matrix, mmarry. We then only need to construct the subject-specific parameter vector from fixed-effects and random effects and can then pass this to the GLM likelihood function.

vector[K] tmp_re;
for (n in 1:N_1) {
  tmp_re[1] = r_1_1[n];
  tmp_re[2] = r_1_2[n];
  tmp_re[3] = r_1_3[n];
  tmp_re[4] = r_1_4[n];
  // add more terms to the linear predictor
  target += normal_id_glm_lpdf(Ymat[n,] | mmarray[n], 0.0, b + tmp_re, sigma);
}

Similar to Ricardo’s simulations, this version is \approx 20 times faster than the standard brms code but produces exactly the same results.

The full timing comparison is:

  model     total_seconds total_minutes relative_to_fastest
  <chr>             <dbl>         <dbl>               <dbl>
1 brms_base        1012.         16.9                  18.8
2 brms_mm            53.7         0.896                 1  

I guess this allows me to specify my question to @paul.buerkner: Is there any chance that brms could get an option that allows something like the example above? That is, instead of building the model with an observation-wise loop with a loop over the levels of a grouping factor using the specific model matrix for each level of the grouping factor. My code below shows how this could be done at least for the example. I am happy to move this to GitHub if you prefer to discuss there.

And here the full script:

# script by Ricardo Rey-SΓ‘ez and Henrik Singmann
# Date: 15-05-2026  

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 1: Load Packages
# ─────────────────────────────────────────────────────────────────────────────

library(MASS)
library(dplyr)
library(brms)
library(cmdstanr)
library(posterior)
library(tibble)

## also requires package randcorr!

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 2: Simulate Data
# ─────────────────────────────────────────────────────────────────────────────

set.seed(2026)

mu_intercept <- 1.5
mu_a <- -2
mu_b <- 1.0
mu_c <- 0.25

sigma <- 1

cormat <- randcorr::randcorr(4)
cor2cov <- function(R, sds) {
  D <- diag(sds)
  D %*% R %*% D
}
Sigma <- cor2cov(cormat, c(1, 1.6, 0.4, 1))
  
I <- 150 # Subjects
L <- 300 # trials per subject

# Simulate random effects
REs <- MASS::mvrnorm(
  n = I,
  mu = c(mu_intercept, mu_a, mu_b, mu_c),
  Sigma = Sigma
)

# Simulate observed data
dat <- vector("list", I)

for (i in 1:I) {
  dat[[i]] <- tibble(
    subj = factor(i),
    a = rnorm(L),
    b = rnorm(L),
    c = rnorm(L)
  ) %>% 
    mutate(
      y = 
        REs[i, 1] + 
        REs[i, 2] * a + 
        REs[i, 3] * b + 
        REs[i, 4] * c + 
        rnorm(L, 0, sigma)
    )
}
dat <- bind_rows(dat)

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 3: Generate brms Data
# ─────────────────────────────────────────────────────────────────────────────

priors_brms <- c(
  prior(normal(0, 1), class = "b")
)

brms_form <- brmsformula(
  formula = y ~ 1 + a + b + c + (1 + a + b + c | subj),
  center = FALSE
)

brms_data <- make_standata(
  formula = brms_form,
  data = dat,
  prior = priors_brms
)

brms_code <- make_stancode(
  formula = brms_form,
  data = dat,
  prior = priors_brms
)

# Generate model matrices:
dat_split <- split(dat, dat$subj)
mms <- vapply(dat_split, 
              \(x) model.matrix(~ 1 + a + b + c, x)[, , drop = FALSE], 
              model.matrix(~ 1 + a + b + c, dat_split[[1]])[, , drop = FALSE])
mms <- aperm(mms, c(3,1,2))

tmax <- dim(mms)[2]
ymat <- t(vapply(dat_split, \(x) x$y, rep(0.1, tmax)))

brms_data_new <- brms_data
brms_data_new$tmax <- tmax
brms_data_new$Ymat <- ymat
brms_data_new$mmarray <- mms

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 4: Stan Model 1: brms base code 
# ─────────────────────────────────────────────────────────────────────────────

brms_base_code <-'
// generated with brms 2.23.0
functions {
 /* compute correlated group-level effects
  * Args:
  *   z: matrix of unscaled group-level effects
  *   SD: vector of standard deviation parameters
  *   L: cholesky factor correlation matrix
  * Returns:
  *   matrix of scaled group-level effects
  */
  matrix scale_r_cor(matrix z, vector SD, matrix L) {
    // r is stored in another dimension order than z
    return transpose(diag_pre_multiply(SD, L) * z);
  }
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  array[N] int<lower=1> J_1;  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_1;
  vector[N] Z_1_2;
  vector[N] Z_1_3;
  vector[N] Z_1_4;
  int<lower=1> NC_1;  // number of group-level correlations
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
}
parameters {
  vector[K] b;  // regression coefficients
  real<lower=0> sigma;  // dispersion parameter
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  matrix[M_1, N_1] z_1;  // standardized group-level effects
  cholesky_factor_corr[M_1] L_1;  // cholesky factor of correlation matrix
}
transformed parameters {
  matrix[N_1, M_1] r_1;  // actual group-level effects
  // using vectors speeds up indexing in loops
  vector[N_1] r_1_1;
  vector[N_1] r_1_2;
  vector[N_1] r_1_3;
  vector[N_1] r_1_4;
  // prior contributions to the log posterior
  real lprior = 0;
  // compute actual group-level effects
  r_1 = scale_r_cor(z_1, sd_1, L_1);
  r_1_1 = r_1[, 1];
  r_1_2 = r_1[, 2];
  r_1_3 = r_1[, 3];
  r_1_4 = r_1[, 4];
  lprior += normal_lpdf(b | 0, 1);
  lprior += student_t_lpdf(sigma | 3, 0, 2.8)
    - 1 * student_t_lccdf(0 | 3, 0, 2.8);
  lprior += student_t_lpdf(sd_1 | 3, 0, 2.8)
    - 4 * student_t_lccdf(0 | 3, 0, 2.8);
  lprior += lkj_corr_cholesky_lpdf(L_1 | 1);
}
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = rep_vector(0.0, N);
    for (n in 1:N) {
      // add more terms to the linear predictor
      mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n] + r_1_3[J_1[n]] * Z_1_3[n] + r_1_4[J_1[n]] * Z_1_4[n];
    }
    target += normal_id_glm_lpdf(Y | X, mu, b, sigma);
  }
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(to_vector(z_1));
}
generated quantities {
  // compute group-level correlations
  corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
  vector<lower=-1,upper=1>[NC_1] cor_1;
  // extract upper diagonal of correlation matrix
  for (k in 1:M_1) {
    for (j in 1:(k - 1)) {
      cor_1[choose(k - 1, 2) + j] = Cor_1[j, k];
    }
  }
}'

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 5: Stan Model with model matrices
# ─────────────────────────────────────────────────────────────────────────────

brms_mm_code <-'
// based on code generated with brms 2.23.0
functions {
 /* compute correlated group-level effects
  * Args:
  *   z: matrix of unscaled group-level effects
  *   SD: vector of standard deviation parameters
  *   L: cholesky factor correlation matrix
  * Returns:
  *   matrix of scaled group-level effects
  */
  matrix scale_r_cor(matrix z, vector SD, matrix L) {
    // r is stored in another dimension order than z
    return transpose(diag_pre_multiply(SD, L) * z);
  }
}
data {
  int<lower=1> K;  // number of population-level effects
  int<lower=1> tmax;  // max trial number
  
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  
  array[N_1] vector[tmax] Ymat;     // Array with observed responses 
  array[N_1] matrix[tmax, K] mmarray;
  
  int<lower=1> NC_1;  // number of group-level correlations
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
}
parameters {
  vector[K] b;  // regression coefficients
  real<lower=0> sigma;  // dispersion parameter
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  matrix[M_1, N_1] z_1;  // standardized group-level effects
  cholesky_factor_corr[M_1] L_1;  // cholesky factor of correlation matrix
}
transformed parameters {
  matrix[N_1, M_1] r_1;  // actual group-level effects
  // using vectors speeds up indexing in loops
  vector[N_1] r_1_1;
  vector[N_1] r_1_2;
  vector[N_1] r_1_3;
  vector[N_1] r_1_4;
  // prior contributions to the log posterior
  real lprior = 0;
  // compute actual group-level effects
  r_1 = scale_r_cor(z_1, sd_1, L_1);
  r_1_1 = r_1[, 1];
  r_1_2 = r_1[, 2];
  r_1_3 = r_1[, 3];
  r_1_4 = r_1[, 4];
  lprior += normal_lpdf(b | 0, 1);
  lprior += student_t_lpdf(sigma | 3, 0, 2.8)
    - 1 * student_t_lccdf(0 | 3, 0, 2.8);
  lprior += student_t_lpdf(sd_1 | 3, 0, 2.8)
    - 4 * student_t_lccdf(0 | 3, 0, 2.8);
  lprior += lkj_corr_cholesky_lpdf(L_1 | 1);
}
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[K] tmp_re;
    for (n in 1:N_1) {
      tmp_re[1] = r_1_1[n];
      tmp_re[2] = r_1_2[n];
      tmp_re[3] = r_1_3[n];
      tmp_re[4] = r_1_4[n];
      // add more terms to the linear predictor
      target += normal_id_glm_lpdf(Ymat[n,] | mmarray[n], 0.0, b + tmp_re, sigma);
    }
    
  }
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(to_vector(z_1));
}
generated quantities {
  // compute group-level correlations
  corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
  vector<lower=-1,upper=1>[NC_1] cor_1;
  // extract upper diagonal of correlation matrix
  for (k in 1:M_1) {
    for (j in 1:(k - 1)) {
      cor_1[choose(k - 1, 2) + j] = Cor_1[j, k];
    }
  }
}'

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 7: Write Stan models to files
# ─────────────────────────────────────────────────────────────────────────────

dir.create("stan_models", showWarnings = FALSE)
dir.create("stan_fits", showWarnings = FALSE)

writeLines(
  brms_base_code,
  con = "stan_models/brms_base_profile.stan"
)

writeLines(
  brms_mm_code,
  con = "stan_models/brms_mm_profile.stan"
)

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 8: Compile Stan models with cmdstanr
# ─────────────────────────────────────────────────────────────────────────────

base_model_brms <- cmdstan_model(
  stan_file = "stan_models/brms_base_profile.stan"
)

mm_model_brms <- cmdstan_model(
  stan_file = "stan_models/brms_mm_profile.stan"
)

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 9: Fit Stan models with cmdstanr
# ─────────────────────────────────────────────────────────────────────────────

base_fit_brms <- base_model_brms$sample(
  data = brms_data,
  seed = 2026,
  chains = 4,
  parallel_chains = 4,
  iter_warmup = 1000,
  iter_sampling = 1000,
  refresh = 100,
  output_dir = "stan_fits"
)

mm_fit_brms <- mm_model_brms$sample(
  data = brms_data_new,
  seed = 2026,
  chains = 4,
  parallel_chains = 4,
  iter_warmup = 1000,
  iter_sampling = 1000,
  refresh = 100,
  output_dir = "stan_fits"
)


# ─────────────────────────────────────────────────────────────────────────────
# SECTION 10: Sampling times
# ─────────────────────────────────────────────────────────────────────────────

sampling_times <- tibble(
  model = c(
    "brms_base",
    "brms_mm"
  ),
  total_seconds = c(
    base_fit_brms$time()$total,
    mm_fit_brms$time()$total
  )
) |>
  mutate(
    total_minutes = total_seconds / 60,
    relative_to_fastest = total_seconds / min(total_seconds)
  )

print(sampling_times)

mm_fit_brms$summary(c("b", "sd_1", "sigma")) %>% 
  print(n = Inf)

base_fit_brms$summary(c("b", "sd_1", "sigma")) %>% 
  print(n = Inf)

# ─────────────────────────────────────────────────────────────────────────────

Great thread! I’m guessing the speed-up is due to the fact that in this case, multiple data points actually share the same linear predictor. Could you run another test where you pre-compute the per-subject linear predictor but then do one vectorised call? Something like:

model {
    profile("likelihood") {
      vector[I] mu = rep_vector(Intercept, I) + r_1_1 .* Z_1_1 + r_1_2 .* Z_1_2;
      y ~ normal(mu[J_1], sigma);
    }
}

Not sure if I got the notation right, but it’s to just produce a vectorised per-subject linear predictor, and then index into the linear predictor for the repeated measures of the observations.

EDIT: I don’t think that would be faster becasue the likelihood is still seeing a unique linear predictor for each observation. But I do think you could produce the per-subject linear predictor as vectorised, and then index into the obsevations using the slicing you’ve done. Probably wouldn’t make much of a difference.

Also, while you’re benchmarking, could you do me a favour and check if pre-computing the indexing (because you index multiple quantities with the same indices) makes any difference?

int idx_end = idx_start[i] + trialsum[i] - 1;
array[trial_sum[i]] int idx = linspaced_int_array(trialsum[i], idx_start[i], idx_start[i] + trial_sum[i] - 1);

Thanks for the suggestion! I ran one more set of checks using the same simulated data as before.

The new thing here is that I compared four long-format variants against the balanced array version:

  • stan_slice: the previous long-format model, evaluating the likelihood by subject-level slices.
  • stan_indexed_mu_one_call: precomputes the subject-by-condition predictors in a matrix[I, 2], then builds a full observation-level mu vector and uses a single vectorized likelihood call.
  • stan_indexed_mu_slice: precomputes the same subject-by-condition predictors, but still evaluates the likelihood by subject-level slices.
  • stan_slice_idx_td: same idea as stan_slice, but with row indices precomputed once in transformed data using linspaced_int_array (I think this is what you suggest, right?).
  • stan_array: the balanced array model, included as the fastest reference.

Since this example has two conditions, the β€œshared” predictor is not a vector of length I but a matrix[I, 2].

Table 1: Total estimation time

Model Total (s) Min Relative to fastest Speedup vs stan_slice Reduction vs stan_slice
stan_array 23.4 0.39 1.00 7.41 86.5%
stan_indexed_mu_slice 69.6 1.16 2.97 2.50 59.9%
stan_indexed_mu_one_call 134.0 2.24 5.73 1.29 22.7%
stan_slice 174.0 2.89 7.41 1.00 0.0%
stan_slice_idx_td 298.0 4.98 12.70 0.58 -71.9%

Table 2: Profile for building mu

Model Total (s) Calls ms/call Fwd/Rev ms
stan_indexed_mu_one_call 28.3 152015 0.186 0.185 / 0.001
stan_indexed_mu_slice 0.690 160645 0.004 0.004 / 0.000

Table 3: Profile for the likelihood

Model Total (s) Calls ms/call Fwd/Rev ms
stan_slice_idx_td 259.0 166302 1.560 1.120 / 0.442
stan_slice 154.0 166302 0.926 0.678 / 0.248
stan_indexed_mu_one_call 82.6 152015 0.543 0.436 / 0.108
stan_indexed_mu_slice 61.8 160645 0.385 0.272 / 0.113
stan_array 18.8 167949 0.112 0.109 / 0.003

Precomputing the subject-by-condition predictors helps quite a bit, especially when combined with subject-level slicing: stan_indexed_mu_slice ends up 2.5x faster than the original stan_slice. That said, building a full observation-level mu vector is still costly. In stan_indexed_mu_one_call, just building mu takes 28.3 seconds, and the model ends up only modestly faster than stan_slice overall. Your edit was right on this one!

The index check is also interesting. Even with row indices precomputed in transformed data, stan_slice_idx_td ends up nearly 72% slower than the original stan_slice. The likelihood alone takes 259 seconds, compared to 154 in stan_slice. So, unless I’ve done something wrong, range-based slicing is better in this specific case.

It’s worth noting that stan_array is symbolically doing the same thing as the other models, as it only computes I \times 2 unique subject-by-condition predictions. The difference is that the balanced data layout lets Stan exploit that structure directly, with no indexing overhead at all[1]. Same logic, but the data is already organized in a way that makes it trivial to vectorize.

As always, here is the R code:

R code used in this example
# ╔════════════════════════════════════════════════════════════════════════════╗
# β•‘                             SCRIPT OVERVIEW                                β•‘
# ╠════════════════════════════════════════════════════════════════════════════╣
# β•‘ Script Name   : stan_hierarchical_efficiency_checks_cmdstanr.R             β•‘
# β•‘ Author        : Ricardo Rey-SΓ‘ez                                           β•‘
# β•‘ Date          : 16-05-2026                                                 β•‘
# β•‘ Purpose       : Additional efficiency checks                               β•‘
# β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 1: Load Packages
# ─────────────────────────────────────────────────────────────────────────────

library(MASS)
library(dplyr)
library(cmdstanr)
library(posterior)
library(tibble)

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 2: Simulate Data
# ─────────────────────────────────────────────────────────────────────────────

# Basic gaussian hierarchical model with random intercepts (alpha) and slopes (beta)
# Population parameters from here: https://osf.io/preprints/psyarxiv/gv6k7_v2

set.seed(2026)

mu_alpha    <- .65
mu_beta     <- .06
sigma_alpha <- .12
sigma_beta  <- .04
rho_ab      <- matrix(c(1, .5, .5, 1), ncol = 2)
sigma_trial <- .18

Sigma_REs <- diag(c(sigma_alpha, sigma_beta)) %*% 
  rho_ab %*%
  diag(c(sigma_alpha, sigma_beta))

# Subjects and trials per subject and condition
I <- 150
L <- 150

# Simulate random effects
REs <- MASS::mvrnorm(
  n = I,
  mu = c(mu_alpha, mu_beta),
  Sigma = Sigma_REs
)

# Simulate observed data
dat <- matrix(NA, ncol = 3, nrow = I * L * 2)

cnt <- 1

for (i in 1:I) {
  for (k in 1:2) {
    
    alpha_i <- REs[i, 1]
    beta_i  <- REs[i, 2]
    X_k     <- k - 1.5
    mu_i    <- alpha_i + beta_i * X_k
    
    for (l in 1:L) {
      dat[cnt, 1] <- i
      dat[cnt, 2] <- k
      dat[cnt, 3] <- rnorm(1, mu_i, sigma_trial)
      cnt <- cnt + 1
    }
  }
}

dat <- as.data.frame(dat)
colnames(dat) <- c("subj", "cond", "rt")

# Set subject as grouping factor
dat$subj <- factor(dat$subj)

# Set condition as sum contrast: -0.5 / 0.5
dat$cond <- dat$cond - 1.5

# Store integer indices for Stan
dat$subj <- as.numeric(dat$subj)

dat <- dat |>
  mutate(
    cond_idx = as.integer(round(cond + 1.5))
  )

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 3: Generate Stan Data
# ─────────────────────────────────────────────────────────────────────────────

# Data for slice models
trialsum <- as.integer(rle(dat$subj)$lengths)

sdata_block <- list(
  N = nrow(dat),
  I = length(unique(dat$subj)),
  Y = as.matrix(dat[, c("subj", "cond", "rt")]),
  trialsum = trialsum,
  idx_start = as.integer(cumsum(c(1L, head(trialsum, -1L))))
)

# Data for models that use observation-level indices
sdata_indexed <- list(
  N = nrow(dat),
  I = length(unique(dat$subj)),
  Y = dat$rt,
  subj_idx = as.integer(dat$subj),
  cond_idx = as.integer(dat$cond_idx)
)

# Data for slice models that also need condition indices
sdata_block_indexed <- list(
  N = nrow(dat),
  I = length(unique(dat$subj)),
  Y = as.matrix(dat[, c("subj", "cond", "rt")]),
  cond_idx = as.integer(dat$cond_idx),
  trialsum = trialsum,
  idx_start = as.integer(cumsum(c(1L, head(trialsum, -1L))))
)

# Data for slice model with indices precomputed in transformed data
sdata_block_idx_td <- list(
  N = nrow(dat),
  I = length(unique(dat$subj)),
  Y = as.matrix(dat[, c("subj", "cond", "rt")]),
  trialsum = trialsum,
  idx_start = as.integer(cumsum(c(1L, head(trialsum, -1L)))),
  max_trials = max(trialsum)
)

# Data for array model
Y_array <- aperm(array(dat$rt, dim = c(L, 2, I)), c(3, 2, 1))

sdata_array <- list(
  I = I,
  L = L,
  Y = Y_array
)

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 4: Stan Model 1: original slice likelihood over subjects
# ─────────────────────────────────────────────────────────────────────────────

HM_stancode_slice <- "
data {
  int<lower=1> N;                         // Number of observations
  int<lower=1> I;                         // Number of subjects
  matrix[N, 3] Y;                         // Columns: (1 = subj, 2 = cond, 3 = rt)
  array[I] int<lower=1> trialsum;         // Number of observed trials per subject
  array[I] int<lower=1> idx_start;        // First row of each subject in Y
}          

parameters {
  vector[2] mu_REs;                       // Population means
  vector<lower=0>[2] sd_REs;              // Population std. devs
  real<lower=0> sigma;                    // Trial-level std. dev
  cholesky_factor_corr[2] L_R;            // Population REs correlation
  matrix[2, I] REs_std;                   // Standardized REs
}

transformed parameters {
  // Scaled random effects
  matrix[I, 2] REs = rep_matrix(mu_REs', I) + 
                    (diag_pre_multiply(sd_REs, L_R) * REs_std)';
}

model {
  // Model priors
  mu_REs ~ std_normal();
  sd_REs ~ student_t(4, 0, 0.5);
  sigma  ~ student_t(4, 0, 0.5);
  L_R    ~ lkj_corr_cholesky(1);
  to_vector(REs_std) ~ std_normal();

  // Model Log-Likelihood
  profile(\"slice_log_lik\") {
    for (i in 1:I) {
      // idx of the last observed row for subject i
      int idx_end = idx_start[i] + trialsum[i] - 1;
    
      // Vectorized likelihood for subject i
      Y[idx_start[i]:idx_end, 3] ~ normal(
        REs[i, 1] + REs[i, 2] * Y[idx_start[i]:idx_end, 2],
        sigma
      );
    }
  }
}

generated quantities {
  // Model-implied correlation matrix
  corr_matrix[2] Rho_REs = multiply_lower_tri_self_transpose(L_R);
}
"

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 5: Stan Model 2: pre-compute shared predictors, one vectorized call
# ─────────────────────────────────────────────────────────────────────────────

# This model checks whether it helps to pre-compute the subject-by-condition
# linear predictors and then build a single observation-level mu vector.

HM_stancode_indexed_mu_one_call <- "
data {
  int<lower=1> N;                                  // Number of observations
  int<lower=1> I;                                  // Number of subjects
  vector[N] Y;                                     // Outcome vector

  array[N] int<lower=1, upper=I> subj_idx;         // Subject index per observation
  array[N] int<lower=1, upper=2> cond_idx;         // Condition index per observation
}

parameters {
  vector[2] mu_REs;                                // Population means
  vector<lower=0>[2] sd_REs;                       // Population std. devs
  real<lower=0> sigma;                             // Trial-level std. dev
  cholesky_factor_corr[2] L_R;                     // Population REs correlation
  matrix[2, I] REs_std;                            // Standardized REs
}

transformed parameters {
  // Scaled random effects
  matrix[I, 2] REs = rep_matrix(mu_REs', I) + 
                    (diag_pre_multiply(sd_REs, L_R) * REs_std)';
}

model {
  // Model priors
  mu_REs ~ std_normal();
  sd_REs ~ student_t(4, 0, 0.5);
  sigma  ~ student_t(4, 0, 0.5);
  L_R    ~ lkj_corr_cholesky(1);
  to_vector(REs_std) ~ std_normal();

  {
    matrix[I, 2] mu_sc;
    vector[N] mu;

    profile(\"indexed_mu_build_mu\") {
      // Subject-by-condition linear predictors
      mu_sc[, 1] = REs[, 1] - 0.5 * REs[, 2];
      mu_sc[, 2] = REs[, 1] + 0.5 * REs[, 2];

      // Observation-level linear predictor
      for (n in 1:N) {
        mu[n] = mu_sc[subj_idx[n], cond_idx[n]];
      }
    }

    profile(\"indexed_mu_log_lik\") {
      Y ~ normal(mu, sigma);
    }
  }
}

generated quantities {
  // Model-implied correlation matrix
  corr_matrix[2] Rho_REs = multiply_lower_tri_self_transpose(L_R);
}
"

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 6: Stan Model 3: pre-compute shared predictors, then slice
# ─────────────────────────────────────────────────────────────────────────────

# This model checks whether the gain comes from pre-computing the shared
# subject-by-condition predictors while still evaluating the likelihood by
# subject-level slices.

HM_stancode_indexed_mu_slice <- "
data {
  int<lower=1> N;                         // Number of observations
  int<lower=1> I;                         // Number of subjects
  matrix[N, 3] Y;                         // Columns: (1 = subj, 2 = cond, 3 = rt)
  array[N] int<lower=1, upper=2> cond_idx;
  array[I] int<lower=1> trialsum;         // Number of observed trials per subject
  array[I] int<lower=1> idx_start;        // First row of each subject in Y
}          

parameters {
  vector[2] mu_REs;                       // Population means
  vector<lower=0>[2] sd_REs;              // Population std. devs
  real<lower=0> sigma;                    // Trial-level std. dev
  cholesky_factor_corr[2] L_R;            // Population REs correlation
  matrix[2, I] REs_std;                   // Standardized REs
}

transformed parameters {
  // Scaled random effects
  matrix[I, 2] REs = rep_matrix(mu_REs', I) + 
                    (diag_pre_multiply(sd_REs, L_R) * REs_std)';
}

model {
  // Model priors
  mu_REs ~ std_normal();
  sd_REs ~ student_t(4, 0, 0.5);
  sigma  ~ student_t(4, 0, 0.5);
  L_R    ~ lkj_corr_cholesky(1);
  to_vector(REs_std) ~ std_normal();

  {
    matrix[I, 2] mu_sc;

    profile(\"indexed_mu_slice_build_mu\") {
      // Subject-by-condition linear predictors
      mu_sc[, 1] = REs[, 1] - 0.5 * REs[, 2];
      mu_sc[, 2] = REs[, 1] + 0.5 * REs[, 2];
    }

    profile(\"indexed_mu_slice_log_lik\") {
      for (i in 1:I) {
        int idx_end = idx_start[i] + trialsum[i] - 1;
        vector[trialsum[i]] mu_i;

        for (j in 1:trialsum[i]) {
          int n = idx_start[i] + j - 1;
          mu_i[j] = mu_sc[i, cond_idx[n]];
        }

        Y[idx_start[i]:idx_end, 3] ~ normal(mu_i, sigma);
      }
    }
  }
}

generated quantities {
  // Model-implied correlation matrix
  corr_matrix[2] Rho_REs = multiply_lower_tri_self_transpose(L_R);
}
"

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 7: Stan Model 4: slice likelihood with indices in transformed data
# ─────────────────────────────────────────────────────────────────────────────

# This is the cleaner version of the index check:
# indices are precomputed once in transformed data, not rebuilt inside model.

HM_stancode_slice_idx_td <- "
data {
  int<lower=1> N;                         // Number of observations
  int<lower=1> I;                         // Number of subjects
  matrix[N, 3] Y;                         // Columns: (1 = subj, 2 = cond, 3 = rt)
  array[I] int<lower=1> trialsum;         // Number of observed trials per subject
  array[I] int<lower=1> idx_start;        // First row of each subject in Y
  int<lower=1> max_trials;                // Maximum number of trials per subject
}          

transformed data {
  array[I, max_trials] int idx;

  for (i in 1:I) {
    array[trialsum[i]] int idx_i = linspaced_int_array(
      trialsum[i],
      idx_start[i],
      idx_start[i] + trialsum[i] - 1
    );

    for (j in 1:trialsum[i]) {
      idx[i, j] = idx_i[j];
    }

    if (trialsum[i] < max_trials) {
      for (j in (trialsum[i] + 1):max_trials) {
        idx[i, j] = 1;
      }
    }
  }
}

parameters {
  vector[2] mu_REs;                       // Population means
  vector<lower=0>[2] sd_REs;              // Population std. devs
  real<lower=0> sigma;                    // Trial-level std. dev
  cholesky_factor_corr[2] L_R;            // Population REs correlation
  matrix[2, I] REs_std;                   // Standardized REs
}

transformed parameters {
  // Scaled random effects
  matrix[I, 2] REs = rep_matrix(mu_REs', I) + 
                    (diag_pre_multiply(sd_REs, L_R) * REs_std)';
}

model {
  // Model priors
  mu_REs ~ std_normal();
  sd_REs ~ student_t(4, 0, 0.5);
  sigma  ~ student_t(4, 0, 0.5);
  L_R    ~ lkj_corr_cholesky(1);
  to_vector(REs_std) ~ std_normal();

  // Model Log-Likelihood
  profile(\"slice_idx_td_log_lik\") {
    for (i in 1:I) {
      Y[idx[i, 1:trialsum[i]], 3] ~ normal(
        REs[i, 1] + REs[i, 2] * Y[idx[i, 1:trialsum[i]], 2],
        sigma
      );
    }
  }
}

generated quantities {
  // Model-implied correlation matrix
  corr_matrix[2] Rho_REs = multiply_lower_tri_self_transpose(L_R);
}
"

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 8: Stan Model 5: array model as fast reference
# ─────────────────────────────────────────────────────────────────────────────

HM_stancode_array <- "
data {
  int<lower=1> I;                         // Number of subjects
  int<lower=1> L;                         // Number of trials per subj and cond
  array[I, 2, L] real Y;                  // Array with observed responses 
}          

parameters {
  vector[2] mu_REs;                       // Population means
  vector<lower=0>[2] sd_REs;              // Population std. devs
  real<lower=0> sigma;                    // Trial-level std. dev
  cholesky_factor_corr[2] L_R;            // Population REs correlation
  matrix[2, I] REs_std;                   // Standardized REs
}

transformed parameters {
  // Scaled random effects
  matrix[I, 2] REs = rep_matrix(mu_REs', I) + 
                    (diag_pre_multiply(sd_REs, L_R) * REs_std)';
}

model {
  // Model priors
  mu_REs ~ std_normal();
  sd_REs ~ student_t(4, 0, 0.5);
  sigma  ~ student_t(4, 0, 0.5);
  L_R    ~ lkj_corr_cholesky(1);
  to_vector(REs_std) ~ std_normal();

  // Model Log-Likelihood
  profile(\"array_log_lik\") {
    for (i in 1:I) {
      for (k in 1:2) {
        // Linear prediction
        real mu_ik = REs[i, 1] + REs[i, 2] * (k - 1.5);
          
        // Vectorized log-likelihood
        Y[i, k, ] ~ normal(mu_ik, sigma);
      }
    }
  }
}

generated quantities {
  // Model-implied correlation matrix
  corr_matrix[2] Rho_REs = multiply_lower_tri_self_transpose(L_R);
}
"

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 9: Write Stan models to files
# ─────────────────────────────────────────────────────────────────────────────

dir.create("stan_models_efficiency_checks", showWarnings = FALSE)
dir.create("stan_fits_efficiency_checks", showWarnings = FALSE)

writeLines(
  HM_stancode_slice,
  con = "stan_models_efficiency_checks/stan_slice_model.stan"
)

writeLines(
  HM_stancode_indexed_mu_one_call,
  con = "stan_models_efficiency_checks/stan_indexed_mu_one_call_model.stan"
)

writeLines(
  HM_stancode_indexed_mu_slice,
  con = "stan_models_efficiency_checks/stan_indexed_mu_slice_model.stan"
)

writeLines(
  HM_stancode_slice_idx_td,
  con = "stan_models_efficiency_checks/stan_slice_idx_td_model.stan"
)

writeLines(
  HM_stancode_array,
  con = "stan_models_efficiency_checks/stan_array_model.stan"
)

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 10: Compile Stan models with cmdstanr
# ─────────────────────────────────────────────────────────────────────────────

stan_slice_mod <- cmdstan_model(
  stan_file = "stan_models_efficiency_checks/stan_slice_model.stan"
)

stan_indexed_mu_one_call_mod <- cmdstan_model(
  stan_file = "stan_models_efficiency_checks/stan_indexed_mu_one_call_model.stan"
)

stan_indexed_mu_slice_mod <- cmdstan_model(
  stan_file = "stan_models_efficiency_checks/stan_indexed_mu_slice_model.stan"
)

stan_slice_idx_td_mod <- cmdstan_model(
  stan_file = "stan_models_efficiency_checks/stan_slice_idx_td_model.stan"
)

stan_array_mod <- cmdstan_model(
  stan_file = "stan_models_efficiency_checks/stan_array_model.stan"
)

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 11: Fit Stan models with cmdstanr
# ─────────────────────────────────────────────────────────────────────────────

stan_slice_fit <- stan_slice_mod$sample(
  data = sdata_block,
  seed = 2026,
  chains = 4,
  parallel_chains = 4,
  iter_warmup = 1000,
  iter_sampling = 1000,
  refresh = 100,
  output_dir = "stan_fits_efficiency_checks"
)

stan_indexed_mu_one_call_fit <- stan_indexed_mu_one_call_mod$sample(
  data = sdata_indexed,
  seed = 2026,
  chains = 4,
  parallel_chains = 4,
  iter_warmup = 1000,
  iter_sampling = 1000,
  refresh = 100,
  output_dir = "stan_fits_efficiency_checks"
)

stan_indexed_mu_slice_fit <- stan_indexed_mu_slice_mod$sample(
  data = sdata_block_indexed,
  seed = 2026,
  chains = 4,
  parallel_chains = 4,
  iter_warmup = 1000,
  iter_sampling = 1000,
  refresh = 100,
  output_dir = "stan_fits_efficiency_checks"
)

stan_slice_idx_td_fit <- stan_slice_idx_td_mod$sample(
  data = sdata_block_idx_td,
  seed = 2026,
  chains = 4,
  parallel_chains = 4,
  iter_warmup = 1000,
  iter_sampling = 1000,
  refresh = 100,
  output_dir = "stan_fits_efficiency_checks"
)

stan_array_fit <- stan_array_mod$sample(
  data = sdata_array,
  seed = 2026,
  chains = 4,
  parallel_chains = 4,
  iter_warmup = 1000,
  iter_sampling = 1000,
  refresh = 100,
  output_dir = "stan_fits_efficiency_checks"
)

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 12: Total estimation times
# ─────────────────────────────────────────────────────────────────────────────

sampling_times <- tibble(
  model = c(
    "stan_slice",
    "stan_indexed_mu_one_call",
    "stan_indexed_mu_slice",
    "stan_slice_idx_td",
    "stan_array"
  ),
  total_seconds = c(
    stan_slice_fit$time()$total,
    stan_indexed_mu_one_call_fit$time()$total,
    stan_indexed_mu_slice_fit$time()$total,
    stan_slice_idx_td_fit$time()$total,
    stan_array_fit$time()$total
  )
) |>
  mutate(
    total_minutes = total_seconds / 60,
    relative_to_fastest = total_seconds / min(total_seconds),
    relative_to_slice = total_seconds / total_seconds[model == "stan_slice"],
    speedup_vs_slice = total_seconds[model == "stan_slice"] / total_seconds,
    reduction_vs_slice = 100 * (1 - total_seconds / total_seconds[model == "stan_slice"])
  ) |>
  arrange(total_seconds)

print(sampling_times)

sampling_times_for_thread <- sampling_times |>
  mutate(
    total_seconds = round(total_seconds, 1),
    total_minutes = round(total_minutes, 2),
    relative_to_fastest = round(relative_to_fastest, 2),
    relative_to_slice = round(relative_to_slice, 2),
    speedup_vs_slice = round(speedup_vs_slice, 2),
    reduction_vs_slice = round(reduction_vs_slice, 1)
  )

print(sampling_times_for_thread)

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 13: Extract profiling information
# ─────────────────────────────────────────────────────────────────────────────

profiles_slice <- bind_rows(
  stan_slice_fit$profiles(),
  .id = "chain"
) |>
  mutate(model = "stan_slice")

profiles_indexed_mu_one_call <- bind_rows(
  stan_indexed_mu_one_call_fit$profiles(),
  .id = "chain"
) |>
  mutate(model = "stan_indexed_mu_one_call")

profiles_indexed_mu_slice <- bind_rows(
  stan_indexed_mu_slice_fit$profiles(),
  .id = "chain"
) |>
  mutate(model = "stan_indexed_mu_slice")

profiles_slice_idx_td <- bind_rows(
  stan_slice_idx_td_fit$profiles(),
  .id = "chain"
) |>
  mutate(model = "stan_slice_idx_td")

profiles_array <- bind_rows(
  stan_array_fit$profiles(),
  .id = "chain"
) |>
  mutate(model = "stan_array")

profiles_all <- rbind(
  profiles_slice,
  profiles_indexed_mu_one_call,
  profiles_indexed_mu_slice,
  profiles_slice_idx_td,
  profiles_array
)

profile_summary <- profiles_all |>
  group_by(model, name) |>
  summarise(
    total_time = mean(total_time),
    forward_time = mean(forward_time),
    reverse_time = mean(reverse_time),
    autodiff_calls = mean(autodiff_calls),
    total_time_per_call = total_time / autodiff_calls,
    forward_time_per_call = forward_time / autodiff_calls,
    reverse_time_per_call = reverse_time / autodiff_calls,
    .groups = "drop"
  ) |>
  arrange(model, desc(total_time))

print(profile_summary, n = Inf)

# Build-mu profiling
profile_build_mu <- profile_summary |>
  filter(
    name %in% c(
      "indexed_mu_build_mu",
      "indexed_mu_slice_build_mu"
    )
  ) |>
  arrange(desc(total_time))

print(profile_build_mu)

# Log-likelihood profiling
profile_log_lik <- profile_summary |>
  filter(
    name %in% c(
      "slice_log_lik",
      "indexed_mu_log_lik",
      "indexed_mu_slice_log_lik",
      "slice_idx_td_log_lik",
      "array_log_lik"
    )
  ) |>
  arrange(desc(total_time))

print(profile_log_lik)

# Clean versions for the thread
profile_build_mu_for_thread <- profile_build_mu |>
  mutate(
    total_time = round(total_time, 3),
    forward_time = round(forward_time, 3),
    reverse_time = round(reverse_time, 3),
    autodiff_calls = round(autodiff_calls),
    total_time_per_call = round(total_time_per_call * 1000, 3),
    forward_time_per_call = round(forward_time_per_call * 1000, 3),
    reverse_time_per_call = round(reverse_time_per_call * 1000, 3)
  )

profile_log_lik_for_thread <- profile_log_lik |>
  mutate(
    total_time = round(total_time, 3),
    forward_time = round(forward_time, 3),
    reverse_time = round(reverse_time, 3),
    autodiff_calls = round(autodiff_calls),
    total_time_per_call = round(total_time_per_call * 1000, 3),
    forward_time_per_call = round(forward_time_per_call * 1000, 3),
    reverse_time_per_call = round(reverse_time_per_call * 1000, 3)
  )

print(profile_build_mu_for_thread)
print(profile_log_lik_for_thread)

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 14: Posterior summaries
# ─────────────────────────────────────────────────────────────────────────────

summary_slice <- stan_slice_fit$summary(
  variables = c("mu_REs", "sd_REs", "sigma", "Rho_REs")
) |>
  mutate(model = "stan_slice")

summary_indexed_mu_one_call <- stan_indexed_mu_one_call_fit$summary(
  variables = c("mu_REs", "sd_REs", "sigma", "Rho_REs")
) |>
  mutate(model = "stan_indexed_mu_one_call")

summary_indexed_mu_slice <- stan_indexed_mu_slice_fit$summary(
  variables = c("mu_REs", "sd_REs", "sigma", "Rho_REs")
) |>
  mutate(model = "stan_indexed_mu_slice")

summary_slice_idx_td <- stan_slice_idx_td_fit$summary(
  variables = c("mu_REs", "sd_REs", "sigma", "Rho_REs")
) |>
  mutate(model = "stan_slice_idx_td")

summary_array <- stan_array_fit$summary(
  variables = c("mu_REs", "sd_REs", "sigma", "Rho_REs")
) |>
  mutate(model = "stan_array")

posterior_summaries <- rbind(
  summary_slice,
  summary_indexed_mu_one_call,
  summary_indexed_mu_slice,
  summary_slice_idx_td,
  summary_array
) |>
  select(
    model,
    variable,
    mean,
    median,
    sd,
    q5,
    q95,
    rhat,
    ess_bulk,
    ess_tail
  )

print(posterior_summaries, n = Inf)

# ─────────────────────────────────────────────────────────────────────────────


  1. But this can also be implemented in unbalanced designs! β†©οΈŽ

Thank you soooo much for this expansion with several continuous predictors, @Henrik_Singmann!! The idea of using subject-specific matrices is really elegant, and it seems very useful for extending the model to more complex designs.

Thanks for doing this. If the indexing actually seems to make that much of a difference, I wonder if compiling the models as cmdstanr::cmdstan_model("mod.stan", cpp_options = list(stan_no_range_checks = TRUE)) would make a difference?

In the past I’ve pre-computed linspaced_int_array for what I thought would be efficiently indexing multiple structures by the same slice, but I may have to revisit this.

I have to take a closer look later because I don’t have time anymore.

I’ll just note that if things go well (fingers crossed) I’ll be working with @paul.buerkner on building something a bit like brms, but with more flexibility and - importantly for this discussion - an actual structured intermediate representation of the model that would be amenable to automatic transformations. We’ve had collapsing rows with identical predictors (via sufficient statistics) on our menu of simple things to try first and I’ve bookmared this thread as it has a bunch of other ideas to try to do automatically.

I am not sure how difficult it would be to get those tricks into current brms codebase, but my impression is that it would touch a lot of parts of the code generation system and thus would be pretty challenging.

Hey Martin, this sounds great. Thanks so much. Looking forward to seeing where you get to.

If I may be so bold, I would like to suggest adding a new argument, say loop_group, to brm which allows changing the level over which the loop inside the model block works. As my example code shows this can lead to significant reductions in the time it takes to create the trial-wise predictions even if we do not have identical predictors. This seems to strike a good balance between the current brms behaviour and the full sparse model matrix used by lme4/rstanarm. And it would essentially be up to the user to decide whether or not to use this argument.

Thanks for the information! This sounds like a very interesting project, especially if it could make optimizations like these easily available to applied researchers. Could you let us know here once it becomes clearer whether the project will move forward? It is always good to stay up to date with this kind of innovation.

Currently the array form is balanced, but it’d easily generalise to β€œragged” structures by doing:

Y[i, k, :N[i, k]] ~ normal(mu_ik, sigma);

where N[i, k] gives the number of observations in each group and Y is declared as stan array[I, K, N_max] real Y. Actually, you could probably make it incrementally faster by more efficient matrix indexing as array[I] matrix[N_max, K] Y. Anyway, these ragged data structures are really common in ecology and I like to write the models in a generalised form where the observations aren’t necessarily balanced. But I wonder how much the indexing would slow it down.

Yes, exactly. When I said before that the array-based implementation can be generalized to unbalanced data, this ragged-array structure is basically what we had in mind. In fact, this is what we have used in our own work, using the first representation:

array[I, K, N_max] real Y;

My only question concerns the suggested array[I] matrix[N_max, K] Y representation. Do you expect this to be faster than array[I, K, N_max] real Y because it allows Stan to exploit more efficient matrix/vector indexing? I would be curious to understand the intuition behind this. In my case, I often work with 4-dimensional arrays. Something like this:

// Observed RT from subject i in condition k, task j and trial l
array[I, K, J, L_max] real Y; 

// Number of obs. trials per subject, condition, and task
array[I, K, J] int T_subj;          

Would you also expect an improvement in sampling times from changing this to something like this?

array[I, J] matrix[L_max, K] Y