Within-chain parallelization misbehaves with brms in a model with measurement errors

The following model specification works as expected with brms through within-chain parallelization:

brm(y ~ cond + (1|Subj), data=dat, …, backend = “cmdstanr”, threads = threading(4))

It would also work fine when the measurement error for the response variable y is incorporated into the model without using within-chain parallelization:

brm(y | se(SE, sigma = TRUE) ~ cond + (1|Subj), data=dat, …,)

However, when within-chain parallelization is invoked for the second model:

brm(y | se(SE, sigma=TRUE) ~ cond + (1|Subj), data=dat, …, backend = “cmdstanr”, threads = threading(4))

it sputters with the following:

Chain 1 Exception: Exception: normal_lpdf: Random variable has size = 710, but Scale parameter has size 11366; and they must be the same size. (in ‘/tmp/Rtmp8GdeTj/model-336f182bf4d2.stan’, line 28, column 4 to column 73) (in ‘/tmp/Rtmp8GdeTj/model-336f182bf4d2.stan’, line 28, column 4 to column 73)
Chain 1 Exception: Exception: normal_lpdf: Random variable has size = 710, but Scale parameter has size 11366; and they must be the same size. (in ‘/tmp/Rtmp8GdeTj/model-336f182bf4d2.stan’, line 28, column 4 to column 73) (in ‘/tmp/Rtmp8GdeTj/model-336f182bf4d2.stan’, line 28, column 4 to column 73)

Warning: Chain 1 finished unexpectedly!

Warning: Chain 2 finished unexpectedly!

Warning: Chain 3 finished unexpectedly!

Warning: Chain 4 finished unexpectedly!

Warning: Use read_cmdstan_csv() to read the results of the failed chains.

Error in rstan::read_stan_csv(out$output_files()) :

csvfiles does not contain any CSV file name

Calls: brm … eval2 -> eval -> eval -> .fun -> .fit_model ->

In addition: Warning messages:

1: All chains finished unexpectedly!
2: No chains finished successfully. Unable to retrieve the fit.
Execution halted

1 Like

Looks like a bug in the stan code generation. I will check.

On adhoc test data, it works with the brms github version:

dat <- data.frame(
  y = rnorm(100),
  SE = rexp(100),
  cond = sample(c("a", "b"), 100, TRUE),
  Subj = sample(1:10, 100, TRUE)
)

brm(y | se(SE, sigma=TRUE) ~ cond + (1|Subj), data=dat, 
    backend = "cmdstanr", threads = threading(4))

Please provide a minimial reproducible example for the problem you see (after testing whether it works already on github).

I have updated brms to version 2.14.2 from github, and tested that it does work fine with your ad hoc test data:

dat <- data.frame(
y = rnorm(100),
SE = rexp(100),
cond = sample(c(“a”, “b”), 100, TRUE),
Subj = sample(1:10, 100, TRUE)
)

However, it sill whines with my dataset:

dat <- read.table(‘dat.txt’, header=T)
library(‘brms’)
library(‘cmdstanr’)
set_cmdstan_path(’~/cmdstan’)
options(mc.cores = parallel::detectCores())
fm <- brm(y|se(SE, sigma = TRUE)~cond+(1|Subj), data=dat, chains = 4, iter=1000, backend = “cmdstanr”, threads = threading(2))

Found the problem and fixed it on github.

2 Likes

Thanks for the quick fix, Paul!

Hi Paul,

I am facing similar issue while specifying ‘threading’ argument. Let me share the model setup here however, I cannot share actual model and data due to data security issues.

The below setup works fine in absence of threading. Also, I have upgraded BRMS to latest version 2.14.2

Let me know the work around for this.

The error details are as below:

Warning: Chain 1 finished unexpectedly!

Error in rstan::read_stan_csv(out$output_files()) :
csvfiles does not contain any CSV file name
In addition: Warning message:
No chains finished successfully. Unable to retrieve the fit.

CODE :

my_prior <- c(prior(normal(0,100),class=‘b’,nlpar=‘Int’)+
prior(beta(6,112),class=‘b’,nlpar=‘A’,lb=0,ub=1)+
prior(beta(0.24,24),class=‘b’,nlpar=‘B’,lb=0,ub=1)+
prior(beta(2.15,70),class=‘b’,nlpar=‘C’,lb=0,ub=1)+
prior(beta(2.15,70),class=‘b’,nlpar=‘D’,lb=0,ub=1)+
prior(beta(23,202),class=‘b’,nlpar=‘E’,lb=0,ub=1)+
prior(beta(23,202),class=‘b’,nlpar=‘F’,lb=0,ub=1)+
prior(beta(32,232),class=‘b’,nlpar=‘G’,lb=0,ub=1)+
prior(beta(32,232),class=‘b’,nlpar=‘H’,lb=0,ub=1)+
prior(normal(0,100),class=‘b’,nlpar=‘I’)+
prior(normal(0,100),class=‘b’,nlpar=‘J’)+
prior(normal(0,100),class=‘b’,nlpar=‘K’)+
prior(normal(0,100),class=‘b’,nlpar=‘L’)+
prior(normal(0,100),class=‘b’,nlpar=‘M’)+
prior(normal(0,100),class=‘b’,nlpar=‘N’)+
prior(normal(0,100),class=‘b’,nlpar=‘O’))

library(future)

plan(multiprocess,workers=5)
start.time <- Sys.time()

model <- brm_multiple(bf(out ~ Int
+A1+B1+C1+D1+E1+F1+G1+H1
+I1+J1+K1+L1+M1+N1+O1
,nl=TRUE)+
lf(Int ~ 1)+
lf(A1 ~ 0+A)+
lf(B1 ~ 0+B)+
lf(C1 ~ 0+C)+
lf(D1 ~ 0+D)+
lf(E1 ~ 0+E)+
lf(F1 ~ 0+F)+
lf(G1 ~ 0+G)+
lf(H1 ~ 0+H)+
lf(I1 ~ 0+I)+
lf(J1 ~ 0+J)+
lf(K1 ~0+K)+
lf(L1 ~ 0+L)+
lf(M1 ~ 0+M)+
lf(N1 ~ 0+N)+
lf(O1 ~ 0+O),
data = df_split,
family = bernoulli(“logit”),
prior = my_prior,warmup = 1000,chains = 5,cores=5,
seed=12345,iter=2000,silent = FALSE,
backend = “cmdstanr”, threads = threading(2))

Can you provide a minimal reproducible example on fake data?

Ok, after some correspondence via email, here is a reproducible example of the problem:

library(brms)

set.seed(1234)
df <- data.frame(
  Independent = rnorm(262),
  Dependent = rbinom(262, 1, 0.01)
)

my_prior <- c(prior(normal(0,100),class='b',nlpar='Int')+
                prior(beta(6,112),class='b',nlpar='Indep', lb=0,ub=1))

model <- brm(bf(Dependent ~ Int
                         +Indep
                         ,nl=TRUE)+
                        lf(Int ~ 1)+
                        lf(Indep ~ 0+Independent),
                      data = df,
                      family = bernoulli("logit"),
                      prior = my_prior, chains = 1,
             silent = FALSE,
              backend='cmdstanr',
             threads=threading(2))

Compiling Stan program...
Start sampling
Running MCMC with 1 chain, with 2 thread(s) per chain...

Chain 1 Assertion failed!
Chain 1 
Chain 1 Program: C:\Users\paulb\AppData\Local\Temp\RtmpmQNTZp\filecbc529c272b_threads.exe
Chain 1 File: stan/lib/stan_math/lib/eigen_3.3.7/Eigen/src/Core/DenseCoeffsBase.h, Line 408
Chain 1 
Chain 1 Expression: index >= 0 && index < size()
Warnung: Chain 1 finished unexpectedly!

Fehler in rstan::read_stan_csv(out$output_files()) : 
  csvfiles does not contain any CSV file name
Zusätzlich: Warnmeldung:
 Fehler in rstan::read_stan_csv(out$output_files()) : 
  csvfiles does not contain any CSV file name 

The error doesn’t happen without threading. I don’t understand where this error is coming from, so I would appreciate some input from other developers. Perhaps @wds15 has some ideas?

Ok, after some correspondence via email, here is a reproducible example of the problem:

library(brms)

set.seed(1234)
df <- data.frame(
  Independent = rnorm(262),
  Dependent = rbinom(262, 1, 0.01)
)

my_prior <- c(prior(normal(0,100),class='b',nlpar='Int')+
                prior(beta(6,112),class='b',nlpar='Indep', lb=0,ub=1))

model <- brm(bf(Dependent ~ Int
                +Indep
                ,nl=TRUE)+
               lf(Int ~ 1)+
               lf(Indep ~ 0+Independent),
             data = df,
             family = bernoulli("logit"),
             prior = my_prior, chains = 1,
             silent = FALSE,
             backend='cmdstanr',
             threads=threading(2))

Compiling Stan program...
Start sampling
Running MCMC with 1 chain, with 2 thread(s) per chain...

Chain 1 Assertion failed!
  Chain 1 
Chain 1 Program: C:\Users\paulb\AppData\Local\Temp\RtmpmQNTZp\filecbc529c272b_threads.exe
Chain 1 File: stan/lib/stan_math/lib/eigen_3.3.7/Eigen/src/Core/DenseCoeffsBase.h, Line 408
Chain 1 
Chain 1 Expression: index >= 0 && index < size()
Warnung: Chain 1 finished unexpectedly!
  
  Fehler in rstan::read_stan_csv(out$output_files()) : 
  csvfiles does not contain any CSV file name
Zusätzlich: Warnmeldung:
  Fehler in rstan::read_stan_csv(out$output_files()) : 
  csvfiles does not contain any CSV file name 

The error doesn’t happen without threading. I don’t understand where this error is coming from, so I would appreciate some input from other developers. Perhaps @wds15 has some ideas?

Indexing is messed up:

  real partial_log_lik(int[] seq, int start, int end, int[] Y, matrix X_Int, vector b_Int, matrix X_Indep, vector b_Indep) {
    real ptarget = 0;
    int N = end - start + 1;
    // initialize linear predictor term
    vector[N] nlp_Int = X_Int[start:end] * b_Int;
    // initialize linear predictor term
    vector[N] nlp_Indep = X_Indep[start:end] * b_Indep;
    // initialize non-linear predictor term
    vector[N] mu;
    for (n in 1:N) {
      int nn = n + start - 1;
      // compute non-linear predictor values
      mu[n] = nlp_Int[nn] + nlp_Indep[nn];
    }
    ptarget += bernoulli_logit_lpmf(Y[start:end] | mu);
    return ptarget;
  }

The nlp_Int and nlp_Indep should NOT be indexed with nn, but with n.

1 Like

You are right! Thank you

The problem is now fixed in the github version of brms.

Thank you Paul for quick action. Now I do not see any issue so far.

@wds15 I’ll soon fit a large model (Bernoulli) so I have two questions:

  1. Worthwhile to use within-chain parallelization on Bernoulli?
  2. I have 16 cpus, should I then use cores = 8, chains = 4, threads = threading(2)?

Bernoulli is very hard to speed up, but you can try. If it is a hierarchical model, then you should tune the model in case you use brms (look at an issue I made in the brms repo).

I think you want cores=4, chains=4, threads=threading(2) … cores is the number of concurrent chains running I think (maybe that needs an update in notation).

3 Likes

It would be lovely if you could add a paragraph in the current brms case study where you elaborate on the different likelihoods and the potential for speedup (the most common likelihoods of course) :) Maybe also spend some time on clarifying the cores/chains/threads again, but it could be me being a bit slow :)

Under quick summary we have

Models with computationally expensive likelihoods are easier to parallelize than less expensive likelihoods. For example, the Poisson distribution involves expensive
log \Gamma functions whereas the normal likelihood is very cheap to calculate in comparison.

Not sure how to write it more clearly?

The cores/chains/thread is also very verbosely spelled out under quick summary

The example above assumes that 4 cores are available which are best used without within-chain parallelization by running 4 chains in parallel. When using within chain parallelization it is still advisable to use just as many threads in total as you have CPU cores. It’s thus sensible in this case to reduce the number of chains running in parallel to just 2, but allow each chain to use 2 threads. Obviously this will reduce the number of iterations in the posterior here as we assumed a fixed amount of 4 cores.

If that should be reworded, then maybe someone other than me should do that as I have probably looked too much at the text.

1 Like

I think it’s an excellent tutorial, don’t get me wrong :)

  1. Are you saying that log links are a good case but logit (which you kind of indicated you were a bit uncertain about?) was not a good case? Or are logit/log good for this case, or are there distributional assumptions that should guide this?

  2. So with 16 cores I assume 16 threads. Given four chains, I should use cores=16, chains=4, threads=threading(4)? All chains are chopped up into four pieces each.

My initial tests using a logit link were promising. I’m just trying to understand, and as I said, the tutorial is excellent.

1 Like

Just one quick comment: cores only applies to between-chain parallelization via different chains. Thus, you never need cores > chains

1 Like