MCMC sampling in rstanarm and brms is a lot slower than in rstan for my two-level hierarchical model

I’m trying to model the following toy dataset as part of my self-study:

running.csv (4.2 KB)

You can read this data in using this function:

get_normal_hierarchical_predictor_data <- function(integer_encode_groups=F) {
  df <- read.csv("./running.csv")
  if (integer_encode_groups) {
    df$runner <- factor(df$runner)
    df$runner <- as.numeric(df$runner)
df <- get_normal_hierarchical_predictor_data()
df$age <- (df$age - mean(df$age)) / sd(df$age)  # normalize predictor

In RStan, I defined the following hierarchical models:

modelString = "
  data {
    // parameters for specifying prior for b0
    real m_0;
    real<lower=0> s_0;
    // parameters for specifying prior for sigma_0
    real<lower=0> lambda_0;
    // parameters for specifying prior for b1
    real m_1;
    real<lower=0> s_1;
    // parameters for specifying prior for sigma_1
    real<lower=0> lambda_1;
    // parameters for specifying prior for sigma_y
    real<lower=0> lambda_y;
    // parameters for data
    int<lower=0> n_obs;
    int<lower=0> n_group;
    vector[n_obs] x;
    real y[n_obs];
    int group[n_obs];
    int prior_only;
  parameters {
    real b0;
    real<lower=0> sigma_0;
    vector[n_group] b0_j;
    real b1;
    real<lower=0> sigma_1;
    vector[n_group] b1_j;
    real<lower=0> sigma_y;
  model {
    b0 ~ normal(m_0, s_0);
    sigma_0 ~ exponential(lambda_0);
    b0_j ~ normal(b0, sigma_0);
    b1 ~ normal(m_1, s_1);
    sigma_1 ~ exponential(lambda_1);
    b1_j ~ normal(b1, sigma_1);
    sigma_y ~ exponential(lambda_y);
    if (!prior_only) {
      y ~ normal(b0_j[group] + b1_j[group] .* x, sigma_y);
  generated quantities {
    real y_rep[n_obs] = normal_rng(b0_j[group] + b1_j[group] .* x, sigma_y);

Compiling and sampling are pretty fast:

stanDso = stan_model( model_code=modelString )
dataList = list(
  x = df$age,
  y = df$net,
  group = df$runner,
  n_obs = n_obs,
  n_group = n_group,
  m_0 = 0, s_0 = 100,
  lambda_0 = 0.1,
  m_1 = 0, s_1 = 100,
  lambda_1 = 0.1,
  lambda_y = 0.1,
  prior_only = 0
fit = sampling(

But the same (almost) model in rstanarm takes way (at least 5x time) longer to sample (for the same number of warmup and iter):

fit <- stan_glmer(
  net ~ 1 + age + (1 + age | runner), 
  data = df,
  family = gaussian(),
  prior_intercept = rstanarm::normal(0, 100),
  prior = rstanarm::normal(0, 100),
  prior_aux = rstanarm::exponential(0.1),
  prior_covariance = rstanarm::decov(reg=1, conc=1, shape=1, scale=1),

Does anyone know the reason behind this? Could it be because that rstanarm defines a joint prior via prior covariance, while my stan code uses independent priors?

Note: I only observe the slowdown of rstanarm for this hierarchical model; for binomial and simple linear regression models, rstanarm is consistently faster than rstan (time measured for compilation + sampling).

The key thing to remember is that the rstanarm models are coded to favour generalisability over pure performance, since they need to be able to accomodate a wide range of model specifications.

I’d also recommend checking that the returned posteriors are consistent between the two models, so you can verify that the same prior/likelihood is being specified

1 Like

Thanks for the response. I understand that there might be performance differences, but sampling in rstanarm for my model is 5x+ slower than rstan - this seems like a huge cost of generalizability. This should be concerning.

I’m having the same issue with brms. Below is my brms model:

fit <- brm(
  net ~ 1 + age + (1 + age | runner),

(@paul.buerkner Could you take a look at this as well? If time allows!)

You are comparing three different models (different priors) and in case of hierarchical model these can have a big effect in the shape of the posterior which can easily explain orders of magnitude difference in sampling speed.

You also mention that you are comparing time with compilation time, which makes it even more difficult to make useful comparisons.

You can get the Stan code from brms by using make_stancode(formula, …) or stancode(fit), You could then use that with rstan to make a fair comparison. To make a fair comparsion to rstanarm you should use the same priors, or you could also compare just the computation time per leapfrog step.

1 Like

As suggested, I took a look at brms generated code and found that it’s (1) sampling the covariance matrix between group-level effects and (2) normalizing predictors.

To get rid of these features, I used the following formulas:

No covariance:

1 + age + (1 | runner) + (0 + age | runner)

No normalizing predictors:

0 + Intercept + age + (1 + age | runner)


instead of the original:

1 + age + (1 + age | runner)

It turns out that turning off normalizing predictors has a great effect! Sampling time reduced from 30+ seconds to 4-5 seconds (and I ran this multiple times to ensure the speedup is not just from randomness), which is much similar to my code in RStan.

No covariance also improved sampling speed, but adding it on top of normalizing predictors only brought a bit additional speedup.

Unfortunately, I haven’t found a way to disable this feature for rstanarm.

Side-note: I also noticed that the code brms generated is not fully optimized (I’m guessing that it’s trying to be more general), so replacing a for-loop with vectorized code also improved sampling speed by 1-2 seconds.

1 Like

Just a warning that you have to be careful with this in Stan. Because we compile down to C++, the loops are often faster than vectorized forms because they don’t need to allocate intermediate variables. For instance, it’s faster to do this:

vector[N] alpha;
for (n in 1:N)
  alpha[n] = a[foo[n]] + b[bar[n]];
y ~ normal(alpha, sigma);

than this:

vector[N] alpha = a[foo] + b[bar];
y ~ normal(alpha, sigma);

The reason is that a[foo] winds up allocating a new N-vector and so does b[bar], whereas there’s no extra allocation in the loop form.


Just to clarify that these kinds of operations won’t always be slower when vectorised since the Math library has been moving to propagating expression templates from operations where possible

I think what’s concerning to me is why brms slows down significantly because of all the mechanisms it implements for handling normalization of predictors, at least in the context of my model. Actually, brms actually claims centering predictors to be more computational efficient (maybe this is true in some cases).

Reference: Prior Definitions for brms Models — set_prior • brms